Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions gpustack_runtime/detector/amd.py
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,17 @@ def detect(self) -> Devices | None:
dev_gpu_vram_usage = pyamdsmi.amdsmi_get_gpu_vram_usage(dev)
dev_mem = dev_gpu_vram_usage.get("vram_total")
dev_mem_used = dev_gpu_vram_usage.get("vram_used")
# On APUs with unified memory (e.g., AMD Strix Halo), VRAM
# reports only the BIOS carveout (~512 MiB); VIS_VRAM reports
# the full usable system memory. Use VIS_VRAM when larger.
with contextlib.suppress(pyrocmsmi.ROCMSMIError):
dev_mem_vis_vram = byte_to_mebibyte(
Comment on lines +182 to +186
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VIS_VRAM selection logic (comment + suppress block + max) is duplicated in both the pyamdsmi and fallback branches. Consider extracting it into a small helper (or applying it once after the try/except) to avoid divergence if this logic needs future changes.

Copilot uses AI. Check for mistakes.
pyrocmsmi.rsmi_dev_memory_total_get(
dev_idx,
pyrocmsmi.RSMI_MEM_TYPE_VIS_VRAM,
),
)
dev_mem = max(dev_mem, dev_mem_vis_vram)
Comment on lines +188 to +192
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dev_gpu_vram_usage.get("vram_total") / get("vram_used") can return None; calling max(dev_mem, dev_mem_vis_vram) will then raise TypeError. Consider defaulting to 0 (or otherwise normalizing to an int) before the max() so the VIS_VRAM fallback cannot crash detection.

Copilot uses AI. Check for mistakes.
Comment on lines +182 to +192

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code for checking VIS_VRAM is duplicated on lines 203-213. This duplication can lead to maintenance issues in the future, for example, if one block is updated but the other is missed. To improve maintainability, consider refactoring this logic into a helper function or restructuring the code to perform this check only once after the initial dev_mem value has been determined.

dev_ecc_count = pyamdsmi.amdsmi_get_gpu_ecc_count(
dev,
pyamdsmi.AmdSmiGpuBlock.UMC,
Expand All @@ -189,6 +200,17 @@ def detect(self) -> Devices | None:
dev_mem = byte_to_mebibyte( # byte to MiB
pyrocmsmi.rsmi_dev_memory_total_get(dev_idx),
)
# On APUs with unified memory (e.g., AMD Strix Halo), VRAM
# reports only the BIOS carveout (~512 MiB); VIS_VRAM reports
# the full usable system memory. Use VIS_VRAM when larger.
with contextlib.suppress(pyrocmsmi.ROCMSMIError):
dev_mem_vis_vram = byte_to_mebibyte(
Comment on lines +203 to +207
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new VIS_VRAM behavior isn’t covered by automated tests. Since tests/gpustack_runtime/detector/test_amd.py exists, consider adding a unit-style test that monkeypatches pyamdsmi.amdsmi_get_gpu_vram_usage and pyrocmsmi.rsmi_dev_memory_total_get to validate that VIS_VRAM is preferred when larger and that ROCm SMI errors are safely suppressed.

Copilot uses AI. Check for mistakes.
pyrocmsmi.rsmi_dev_memory_total_get(
dev_idx,
pyrocmsmi.RSMI_MEM_TYPE_VIS_VRAM,
),
)
dev_mem = max(dev_mem, dev_mem_vis_vram)
dev_mem_used = byte_to_mebibyte( # byte to MiB
pyrocmsmi.rsmi_dev_memory_usage_get(dev_idx),
)
Expand Down
5 changes: 5 additions & 0 deletions gpustack_runtime/detector/pyrocmsmi/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,11 @@
ROCMSMI_IOLINK_TYPE_XGMI = 2
ROCMSMI_IOLINK_TYPE_NUMIOLINKTYPES = 3

## Memory Types ##
RSMI_MEM_TYPE_VRAM = 0
RSMI_MEM_TYPE_GTT = 1
RSMI_MEM_TYPE_VIS_VRAM = 2
Comment on lines +67 to +68

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The values for RSMI_MEM_TYPE_GTT and RSMI_MEM_TYPE_VIS_VRAM appear to be swapped. According to the ROCm SMI library documentation (which is referenced in this file's comments), RSMI_MEM_TYPE_VIS_VRAM should be 1 and RSMI_MEM_TYPE_GTT should be 2. With the current values, the code will query for GTT memory instead of visible VRAM, leading to incorrect memory reporting for APUs.

Suggested change
RSMI_MEM_TYPE_GTT = 1
RSMI_MEM_TYPE_VIS_VRAM = 2
RSMI_MEM_TYPE_GTT = 2
RSMI_MEM_TYPE_VIS_VRAM = 1


Comment on lines +65 to +69
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New RSMI_MEM_TYPE_* constants are introduced here, but rsmi_dev_memory_total_get / rsmi_dev_memory_usage_get still default to rsmi_memory_type_t.RSMI_MEM_TYPE_VRAM. If rsmiBindings isn't importable (bindings path missing), those defaults can raise NameError even though the library can be loaded. Consider switching the defaults to these new constants (or defining these constants from rsmi_memory_type_t when available, with a numeric fallback) so memory calls work reliably across install layouts.

Suggested change
## Memory Types ##
RSMI_MEM_TYPE_VRAM = 0
RSMI_MEM_TYPE_GTT = 1
RSMI_MEM_TYPE_VIS_VRAM = 2
## Memory Types ##
# Fallback numeric values that work even when rsmiBindings cannot be imported.
RSMI_MEM_TYPE_VRAM = 0
RSMI_MEM_TYPE_GTT = 1
RSMI_MEM_TYPE_VIS_VRAM = 2
# If rsmi_memory_type_t is available from rsmiBindings, prefer the values
# defined there so we stay in sync with the underlying library.
if "rsmi_memory_type_t" in globals():
try:
RSMI_MEM_TYPE_VRAM = int(rsmi_memory_type_t.RSMI_MEM_TYPE_VRAM)
RSMI_MEM_TYPE_GTT = int(rsmi_memory_type_t.RSMI_MEM_TYPE_GTT)
RSMI_MEM_TYPE_VIS_VRAM = int(rsmi_memory_type_t.RSMI_MEM_TYPE_VIS_VRAM)
except Exception:
# If anything unexpected happens, keep using the numeric fallbacks above.
pass
else:
# When rsmiBindings is not importable, provide a minimal stand-in so that
# references like rsmi_memory_type_t.RSMI_MEM_TYPE_VRAM used in default
# arguments do not raise NameError at import time.
class rsmi_memory_type_t: # type: ignore[override]
RSMI_MEM_TYPE_VRAM = RSMI_MEM_TYPE_VRAM
RSMI_MEM_TYPE_GTT = RSMI_MEM_TYPE_GTT
RSMI_MEM_TYPE_VIS_VRAM = RSMI_MEM_TYPE_VIS_VRAM

Copilot uses AI. Check for mistakes.
## Error Codes ##
ROCMSMI_ERROR_UNINITIALIZED = -99997
ROCMSMI_ERROR_FUNCTION_NOT_FOUND = -99998
Expand Down
Loading