[ROCm] Replace compile-time warp size with runtime query in host code#1885
Conversation
Add bnb_host_warp_size() that queries hipDeviceGetAttribute at runtime with per-device caching (up to 32 GPUs), replacing the compile-time BNB_WARP_SIZE macro in host-side dispatch. This fixes incorrect defaulting to warp size 64 on RDNA and kernel dispatch with proper parameters.
b6d47ab to
83892a5
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Thanks! Looks good apart from minor lint issue. @Abdennacer-Badaoui do you have any feedback? |
|
@matthewdouglas I have a follow up to this at #1887 |
There was a problem hiding this comment.
Looks good to me. Thanks @sstamenk !
I will look at your follow up PR now.
6d51838
Abdennacer-Badaoui
left a comment
There was a problem hiding this comment.
LGTM! The separation of compile-time BNB_WARP_SIZE (for device code) and runtime bnb_host_warp_size() (for host-side launch config) is the right approach for correct RDNA/CDNA dispatch.
Minor nit: the static warp_size cache has a benign data race (could use std::atomic) but not a blocker.
|
can you fix the linting @sstamenk , thanks! |
|
Good catch, I've changed it to utilize atomics. |
Summary
BNB_WARP_SIZEmacro with a runtimebnb_host_warp_size()query in all host-side dispatch code. The macro is only correctly defined during device-code compilation passes and was silently wrong in host code on ROCm 7.x (which removed__AMDGCN_WAVEFRONT_SIZE). On RDNA which utilizes a warp size of 32 this logic was defaulting to 64.blocksize=644-bit quantization dispatch on HIP to select the correct kernel based on the actual device warp size, instead of unconditionally usingkQuantizeBlockwiseSmallwith a hardcoded 64-thread launch.BNB_WARP_SIZEfallback chain incommon.cuhfor ROCm 7.0+ where__AMDGCN_WAVEFRONT_SIZEis no longer emitted: fall back to__GFX9__(CDNA = 64), then default to 32 (RDNA).Problem
On RDNA GPUs (gfx10xx/gfx11xx/gfx12xx, wavefront size 32), the old default of
BNB_WARP_SIZE = 64caused two classes of failures:blocksize=32crash (Fatal Python error: Aborted):kQuantizeBlockwiseSmallwas compiled withTHREADS=64but launched with only 32 threads, causing CUB cooperative operations to abort.blocksize=64garbage output (mean error ~0.73 vs threshold 0.11): After fixing the macro default to 32,kQuantizeBlockwiseSmallnow compiled withTHREADS=32but was still launched with 64 threads (hardcoded in theblocksize=64HIP dispatch path), producing corrupted quantization output. Unit teststest_4bit_compressed_statsandtest_4bit_quantwithblocksize=64were broken on RDNA as a result.The root issue is that
BNB_WARP_SIZEis architecture-specific and only valid inside device code, but was being used in host-side kernel launch configuration.Changes
csrc/common.cuh— Improved warp size fallback for device code:__AMDGCN_WAVEFRONT_SIZE(preferred, but removed in ROCm 7.0)__GFX9__→ 64 (CDNA)csrc/ops.cu— Runtime warp size for host code:bnb_host_warp_size(): queries device 0 withhipDeviceAttributeWarpSize. Returns 32 on CUDA.gemm_4bit_inference_naive: replacedBNB_WARP_SIZE == 64withbnb_host_warp_size() == 64.quantizeBlockwiseforblocksize=64on HIP: dispatch tokQuantizeBlockwiseSmallonly when runtime warp size is 64 (CDNA); fall through tokQuantizeBlockwise<64, 2>otherwise (RDNA, same as CUDA path).Test plan