[AI] [FIX] Kaggle P100 PyTorch CPU fallback#367
Conversation
개요두 개의 PyTorch 학습 스크립트(TCN, PatchTST)에서 GPU 디바이스 선택 로직을 개선합니다. 단순 CUDA 가용성 확인을 대신하여 GPU 컴퓨트 기능(sm_70 이상) 검증 기반의 새로운 변경사항GPU 디바이스 선택 로직 개선
예상 코드 리뷰 난이도🎯 2 (단순) | ⏱️ ~12분 시 (兔의 축하 시)
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
AI/modules/signal/models/patchtst/train_kaggle.py (1)
232-232:⚠️ Potential issue | 🔴 Critical | ⚡ Quick win제거된 함수를 호출하여 런타임 오류가 발생합니다.
Line 232에서
log_gpu_status()를 호출하고 있지만, AI 요약에 따르면 이 함수는 제거되었습니다. 이로 인해train()함수 실행 시NameError: name 'log_gpu_status' is not defined오류가 발생합니다.새로운
select_torch_device()함수가 GPU 상태 로깅과 디바이스 선택을 모두 처리하므로, 이 줄은 제거되어야 합니다.🐛 수정 제안
def train(): - log_gpu_status() print("=" * 50)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@AI/modules/signal/models/patchtst/train_kaggle.py` at line 232, Call to removed function log_gpu_status() causes NameError; remove the call at line where train() invokes log_gpu_status() and rely on the existing select_torch_device() routine which already handles GPU status logging and device selection. Locate the invocation of log_gpu_status() in the train_kaggle.py train() flow (the call shown as log_gpu_status()) and delete that single call so execution uses select_torch_device() only.
🧹 Nitpick comments (2)
AI/modules/signal/models/patchtst/train_kaggle.py (1)
32-54: ⚡ Quick win다중 GPU 환경에서 잘못된 디바이스가 선택될 수 있습니다.
현재 구현은 compute capability
major >= 7을 가진 GPU가 하나라도 있으면torch.device("cuda")를 반환하는데, 이는 항상cuda:0을 의미합니다.cuda:0이 호환되지 않지만 다른 디바이스가 호환 가능한 경우 실패할 수 있습니다. 명시적으로 첫 번째 호환 가능한 GPU를 선택하도록 개선하는 것을 권장합니다.개선된 구현 제안
def select_torch_device() -> torch.device: if not torch.cuda.is_available(): print("[INFO] GPU devices: []") print("[INFO] Using CPU") return torch.device("cpu") devices = [] - usable_gpu = False + usable_device_idx = None for idx in range(torch.cuda.device_count()): name = torch.cuda.get_device_name(idx) major, minor = torch.cuda.get_device_capability(idx) devices.append(f"{name} sm_{major}{minor}") - if major >= 7: - usable_gpu = True + if major >= 7 and usable_device_idx is None: + usable_device_idx = idx print(f"[INFO] GPU devices: {devices}") - if usable_gpu: + if usable_device_idx is not None: print("[INFO] Using GPU") - return torch.device("cuda") + return torch.device(f"cuda:{usable_device_idx}") print("[WARN] CUDA device is visible, but this PyTorch build requires sm_70+; falling back to CPU.") print("[INFO] Using CPU") return torch.device("cpu")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@AI/modules/signal/models/patchtst/train_kaggle.py` around lines 32 - 54, select_torch_device에서 "cuda"를 반환해 항상 cuda:0을 쓰는 문제가 있어, 호환 가능한 GPU가 하나라도 있으면 그 중 첫 번째 호환 장치의 인덱스를 명시적으로 선택하도록 수정하세요: for 루프에서 compatible_idx 같은 변수를 사용해 major >= 7인 첫 인덱스를 저장하고, 루프 종료 후 해당 인덱스가 존재하면 torch.device(f"cuda:{compatible_idx}")를 반환하도록 변경(기존 usable_gpu 불리언 대신 인덱스 추적). 나머지 로그 출력은 유지하고 호환 장치가 없으면 CPU로 폴백하도록 그대로 두세요.AI/modules/signal/models/TCN/train_kaggle.py (1)
39-61: ⚡ Quick win다중 GPU 환경에서 잘못된 디바이스가 선택될 수 있습니다.
함수가 compute capability
major >= 7을 가진 GPU가 하나라도 있으면usable_gpu = True로 설정하지만, 최종적으로torch.device("cuda")를 반환합니다. 이는 명시적 인덱스 없이cuda:0을 사용하므로,cuda:0이 호환되지 않는 GPU(예: sm_60)이고cuda:1이 호환 가능한 GPU인 경우에도 여전히cuda:0을 사용하려 하여 실패할 수 있습니다.Kaggle 환경에서는 일반적으로 GPU가 0~1개이므로 실제 발생 가능성은 낮지만, 명시적으로 첫 번째 호환 가능한 GPU를 선택하도록 개선하는 것이 좋습니다.
개선된 구현 제안
def select_torch_device() -> torch.device: if not torch.cuda.is_available(): print("[INFO] GPU devices: []") print("[INFO] Using CPU") return torch.device("cpu") devices = [] - usable_gpu = False + usable_device_idx = None for idx in range(torch.cuda.device_count()): name = torch.cuda.get_device_name(idx) major, minor = torch.cuda.get_device_capability(idx) devices.append(f"{name} sm_{major}{minor}") - if major >= 7: - usable_gpu = True + if major >= 7 and usable_device_idx is None: + usable_device_idx = idx print(f"[INFO] GPU devices: {devices}") - if usable_gpu: + if usable_device_idx is not None: print("[INFO] Using GPU") - return torch.device("cuda") + return torch.device(f"cuda:{usable_device_idx}") print("[WARN] CUDA device is visible, but this PyTorch build requires sm_70+; falling back to CPU.") print("[INFO] Using CPU") return torch.device("cpu")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@AI/modules/signal/models/TCN/train_kaggle.py` around lines 39 - 61, The select_torch_device function may return torch.device("cuda") even when the first visible GPU (cuda:0) is not the compatible one; change the logic to pick and return the explicit index of the first compatible device (the idx where major>=7 discovered in the for loop) instead of an unindexed "cuda", and update the prints to show which cuda:{idx} is chosen; ensure you still fall back to cpu as before if no compatible device is found.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@AI/modules/signal/models/patchtst/train_kaggle.py`:
- Line 232: Call to removed function log_gpu_status() causes NameError; remove
the call at line where train() invokes log_gpu_status() and rely on the existing
select_torch_device() routine which already handles GPU status logging and
device selection. Locate the invocation of log_gpu_status() in the
train_kaggle.py train() flow (the call shown as log_gpu_status()) and delete
that single call so execution uses select_torch_device() only.
---
Nitpick comments:
In `@AI/modules/signal/models/patchtst/train_kaggle.py`:
- Around line 32-54: select_torch_device에서 "cuda"를 반환해 항상 cuda:0을 쓰는 문제가 있어, 호환
가능한 GPU가 하나라도 있으면 그 중 첫 번째 호환 장치의 인덱스를 명시적으로 선택하도록 수정하세요: for 루프에서
compatible_idx 같은 변수를 사용해 major >= 7인 첫 인덱스를 저장하고, 루프 종료 후 해당 인덱스가 존재하면
torch.device(f"cuda:{compatible_idx}")를 반환하도록 변경(기존 usable_gpu 불리언 대신 인덱스 추적).
나머지 로그 출력은 유지하고 호환 장치가 없으면 CPU로 폴백하도록 그대로 두세요.
In `@AI/modules/signal/models/TCN/train_kaggle.py`:
- Around line 39-61: The select_torch_device function may return
torch.device("cuda") even when the first visible GPU (cuda:0) is not the
compatible one; change the logic to pick and return the explicit index of the
first compatible device (the idx where major>=7 discovered in the for loop)
instead of an unindexed "cuda", and update the prints to show which cuda:{idx}
is chosen; ensure you still fall back to cpu as before if no compatible device
is found.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 118eddbc-2795-4cca-9e7f-280bdf5fa46f
📒 Files selected for processing (2)
AI/modules/signal/models/TCN/train_kaggle.pyAI/modules/signal/models/patchtst/train_kaggle.py
개요
Kaggle이 Tesla P100(sm_60)을 할당했을 때 현재 PyTorch CUDA 빌드가 sm_70 이상만 지원해 PatchTST/TCN 학습이 CUDA kernel image 오류로 실패하는 문제를 방지합니다.
변경사항
확인된 서버 로그 원인
Tesla P100-PCIE-16GB with CUDA capability sm_60 is not compatible with the current PyTorch installationtorch.AcceleratorError: CUDA error: no kernel image is available for execution on the device검증
python -m compileall -q AI/modules/signal/models/patchtst/train_kaggle.py AI/modules/signal/models/TCN/train_kaggle.pySummary by CodeRabbit
릴리스 노트