Skip to content

[AI] [FIX] Kaggle P100 PyTorch CPU fallback#367

Open
twq110 wants to merge 1 commit into
mainfrom
codex/kaggle-pytorch-p100-fallback
Open

[AI] [FIX] Kaggle P100 PyTorch CPU fallback#367
twq110 wants to merge 1 commit into
mainfrom
codex/kaggle-pytorch-p100-fallback

Conversation

@twq110
Copy link
Copy Markdown
Contributor

@twq110 twq110 commented May 10, 2026

개요

Kaggle이 Tesla P100(sm_60)을 할당했을 때 현재 PyTorch CUDA 빌드가 sm_70 이상만 지원해 PatchTST/TCN 학습이 CUDA kernel image 오류로 실패하는 문제를 방지합니다.

변경사항

  • PatchTST/TCN Kaggle 학습에서 CUDA availability만 보지 않고 device capability를 확인
  • CUDA GPU가 보이더라도 sm_70 미만이면 CPU로 fallback
  • fallback 시 로그에 GPU 목록과 CPU fallback 사유 출력

확인된 서버 로그 원인

  • Tesla P100-PCIE-16GB with CUDA capability sm_60 is not compatible with the current PyTorch installation
  • torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device

검증

  • python -m compileall -q AI/modules/signal/models/patchtst/train_kaggle.py AI/modules/signal/models/TCN/train_kaggle.py

Summary by CodeRabbit

릴리스 노트

  • 개선 사항
    • GPU 호환성 검증 로직을 강화하여 더 안정적인 기기 선택 기능 제공
    • 호환되지 않는 GPU 환경에서 자동으로 CPU로 폴백하도록 개선
    • 사용 불가능한 GPU 구성에 대한 경고 메시지 추가
    • AI 신호 처리 모델(TCN, PatchTST)의 학습 프로세스 최적화

Review Change Stack

@twq110 twq110 requested a review from discipline24 as a code owner May 10, 2026 16:15
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 10, 2026

개요

두 개의 PyTorch 학습 스크립트(TCN, PatchTST)에서 GPU 디바이스 선택 로직을 개선합니다. 단순 CUDA 가용성 확인을 대신하여 GPU 컴퓨트 기능(sm_70 이상) 검증 기반의 새로운 select_torch_device() 함수를 도입하고, 학습 흐름에 통합합니다.

변경사항

GPU 디바이스 선택 로직 개선

계층 / 파일 요약
디바이스 선택 함수 구현
AI/modules/signal/models/TCN/train_kaggle.py, AI/modules/signal/models/patchtst/train_kaggle.py
새로운 select_torch_device() 함수를 두 파일에서 정의하여 가시 GPU의 컴퓨트 기능을 검사하고, sm_70 이상의 GPU가 있을 때만 CUDA를 선택합니다. 그 외의 경우 경고 메시지를 출력하고 CPU로 폴백합니다.
레거시 함수 제거
AI/modules/signal/models/patchtst/train_kaggle.py
컴퓨트 기능 검증 없이 단순히 torch.cuda.is_available() 기반으로 작동하던 log_gpu_status() 함수가 제거됩니다.
학습 흐름 통합
AI/modules/signal/models/TCN/train_kaggle.py, AI/modules/signal/models/patchtst/train_kaggle.py
모델 학습 시작 전 디바이스 초기화 로직이 새로운 select_torch_device()를 호출하도록 업데이트되어 향상된 GPU 기능 검증을 적용합니다.

예상 코드 리뷰 난이도

🎯 2 (단순) | ⏱️ ~12분

시 (兔의 축하 시)

GPU의 새로운 선택,
능력을 재 확인하니,
sm_70 이상이면 CUDA 속으로 🚀
아니면 CPU 안전처로,
현명한 계산의 여정.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed PR title clearly identifies the main change: adding PyTorch CPU fallback for Kaggle P100 GPUs, matching the core objective of preventing CUDA kernel errors on incompatible hardware.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/kaggle-pytorch-p100-fallback

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
AI/modules/signal/models/patchtst/train_kaggle.py (1)

232-232: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

제거된 함수를 호출하여 런타임 오류가 발생합니다.

Line 232에서 log_gpu_status()를 호출하고 있지만, AI 요약에 따르면 이 함수는 제거되었습니다. 이로 인해 train() 함수 실행 시 NameError: name 'log_gpu_status' is not defined 오류가 발생합니다.

새로운 select_torch_device() 함수가 GPU 상태 로깅과 디바이스 선택을 모두 처리하므로, 이 줄은 제거되어야 합니다.

🐛 수정 제안
 def train():
-    log_gpu_status()
     print("=" * 50)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@AI/modules/signal/models/patchtst/train_kaggle.py` at line 232, Call to
removed function log_gpu_status() causes NameError; remove the call at line
where train() invokes log_gpu_status() and rely on the existing
select_torch_device() routine which already handles GPU status logging and
device selection. Locate the invocation of log_gpu_status() in the
train_kaggle.py train() flow (the call shown as log_gpu_status()) and delete
that single call so execution uses select_torch_device() only.
🧹 Nitpick comments (2)
AI/modules/signal/models/patchtst/train_kaggle.py (1)

32-54: ⚡ Quick win

다중 GPU 환경에서 잘못된 디바이스가 선택될 수 있습니다.

현재 구현은 compute capability major >= 7을 가진 GPU가 하나라도 있으면 torch.device("cuda")를 반환하는데, 이는 항상 cuda:0을 의미합니다. cuda:0이 호환되지 않지만 다른 디바이스가 호환 가능한 경우 실패할 수 있습니다. 명시적으로 첫 번째 호환 가능한 GPU를 선택하도록 개선하는 것을 권장합니다.

개선된 구현 제안
 def select_torch_device() -> torch.device:
     if not torch.cuda.is_available():
         print("[INFO] GPU devices: []")
         print("[INFO] Using CPU")
         return torch.device("cpu")
 
     devices = []
-    usable_gpu = False
+    usable_device_idx = None
     for idx in range(torch.cuda.device_count()):
         name = torch.cuda.get_device_name(idx)
         major, minor = torch.cuda.get_device_capability(idx)
         devices.append(f"{name} sm_{major}{minor}")
-        if major >= 7:
-            usable_gpu = True
+        if major >= 7 and usable_device_idx is None:
+            usable_device_idx = idx
 
     print(f"[INFO] GPU devices: {devices}")
-    if usable_gpu:
+    if usable_device_idx is not None:
         print("[INFO] Using GPU")
-        return torch.device("cuda")
+        return torch.device(f"cuda:{usable_device_idx}")
 
     print("[WARN] CUDA device is visible, but this PyTorch build requires sm_70+; falling back to CPU.")
     print("[INFO] Using CPU")
     return torch.device("cpu")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@AI/modules/signal/models/patchtst/train_kaggle.py` around lines 32 - 54,
select_torch_device에서 "cuda"를 반환해 항상 cuda:0을 쓰는 문제가 있어, 호환 가능한 GPU가 하나라도 있으면 그 중
첫 번째 호환 장치의 인덱스를 명시적으로 선택하도록 수정하세요: for 루프에서 compatible_idx 같은 변수를 사용해 major >=
7인 첫 인덱스를 저장하고, 루프 종료 후 해당 인덱스가 존재하면 torch.device(f"cuda:{compatible_idx}")를
반환하도록 변경(기존 usable_gpu 불리언 대신 인덱스 추적). 나머지 로그 출력은 유지하고 호환 장치가 없으면 CPU로 폴백하도록 그대로
두세요.
AI/modules/signal/models/TCN/train_kaggle.py (1)

39-61: ⚡ Quick win

다중 GPU 환경에서 잘못된 디바이스가 선택될 수 있습니다.

함수가 compute capability major >= 7을 가진 GPU가 하나라도 있으면 usable_gpu = True로 설정하지만, 최종적으로 torch.device("cuda")를 반환합니다. 이는 명시적 인덱스 없이 cuda:0을 사용하므로, cuda:0이 호환되지 않는 GPU(예: sm_60)이고 cuda:1이 호환 가능한 GPU인 경우에도 여전히 cuda:0을 사용하려 하여 실패할 수 있습니다.

Kaggle 환경에서는 일반적으로 GPU가 0~1개이므로 실제 발생 가능성은 낮지만, 명시적으로 첫 번째 호환 가능한 GPU를 선택하도록 개선하는 것이 좋습니다.

개선된 구현 제안
 def select_torch_device() -> torch.device:
     if not torch.cuda.is_available():
         print("[INFO] GPU devices: []")
         print("[INFO] Using CPU")
         return torch.device("cpu")
 
     devices = []
-    usable_gpu = False
+    usable_device_idx = None
     for idx in range(torch.cuda.device_count()):
         name = torch.cuda.get_device_name(idx)
         major, minor = torch.cuda.get_device_capability(idx)
         devices.append(f"{name} sm_{major}{minor}")
-        if major >= 7:
-            usable_gpu = True
+        if major >= 7 and usable_device_idx is None:
+            usable_device_idx = idx
 
     print(f"[INFO] GPU devices: {devices}")
-    if usable_gpu:
+    if usable_device_idx is not None:
         print("[INFO] Using GPU")
-        return torch.device("cuda")
+        return torch.device(f"cuda:{usable_device_idx}")
 
     print("[WARN] CUDA device is visible, but this PyTorch build requires sm_70+; falling back to CPU.")
     print("[INFO] Using CPU")
     return torch.device("cpu")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@AI/modules/signal/models/TCN/train_kaggle.py` around lines 39 - 61, The
select_torch_device function may return torch.device("cuda") even when the first
visible GPU (cuda:0) is not the compatible one; change the logic to pick and
return the explicit index of the first compatible device (the idx where major>=7
discovered in the for loop) instead of an unindexed "cuda", and update the
prints to show which cuda:{idx} is chosen; ensure you still fall back to cpu as
before if no compatible device is found.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@AI/modules/signal/models/patchtst/train_kaggle.py`:
- Line 232: Call to removed function log_gpu_status() causes NameError; remove
the call at line where train() invokes log_gpu_status() and rely on the existing
select_torch_device() routine which already handles GPU status logging and
device selection. Locate the invocation of log_gpu_status() in the
train_kaggle.py train() flow (the call shown as log_gpu_status()) and delete
that single call so execution uses select_torch_device() only.

---

Nitpick comments:
In `@AI/modules/signal/models/patchtst/train_kaggle.py`:
- Around line 32-54: select_torch_device에서 "cuda"를 반환해 항상 cuda:0을 쓰는 문제가 있어, 호환
가능한 GPU가 하나라도 있으면 그 중 첫 번째 호환 장치의 인덱스를 명시적으로 선택하도록 수정하세요: for 루프에서
compatible_idx 같은 변수를 사용해 major >= 7인 첫 인덱스를 저장하고, 루프 종료 후 해당 인덱스가 존재하면
torch.device(f"cuda:{compatible_idx}")를 반환하도록 변경(기존 usable_gpu 불리언 대신 인덱스 추적).
나머지 로그 출력은 유지하고 호환 장치가 없으면 CPU로 폴백하도록 그대로 두세요.

In `@AI/modules/signal/models/TCN/train_kaggle.py`:
- Around line 39-61: The select_torch_device function may return
torch.device("cuda") even when the first visible GPU (cuda:0) is not the
compatible one; change the logic to pick and return the explicit index of the
first compatible device (the idx where major>=7 discovered in the for loop)
instead of an unindexed "cuda", and update the prints to show which cuda:{idx}
is chosen; ensure you still fall back to cpu as before if no compatible device
is found.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 118eddbc-2795-4cca-9e7f-280bdf5fa46f

📥 Commits

Reviewing files that changed from the base of the PR and between 1bfbb49 and bf34be3.

📒 Files selected for processing (2)
  • AI/modules/signal/models/TCN/train_kaggle.py
  • AI/modules/signal/models/patchtst/train_kaggle.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant