Unify MoE test perf timing to CUDA Event bench and fix minor issues#408
Unify MoE test perf timing to CUDA Event bench and fix minor issues#408
Conversation
- Remove run_perftest path from mxscale/wmma test harnesses, keep only _bench_kernel_us (CUDA Event per-iteration timing with L2 flush and IQR outlier filtering) for consistent and accurate benchmarking - Remove test_graph/benchmark_mode parameters and CUDA Graph parametrize from pytest tests since they were only used by the removed perftest path - Clean up __main__ CLI in wmma test to remove --test_graph/--benchmark args - Fix is_rdna_arch() to match gfx12* (not just gfx120*) for gfx1250 - Fix bench_resolve_tiles() tile_k fallback for non-divisible dimensions Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
This PR standardizes MoE 2-stage kernel timing in the gfx1250 test harnesses by removing the legacy run_perftest/CUDA-graph benchmarking path and using a single CUDA-event-based per-iteration timer, while also fixing a couple of small architecture/tiling utilities.
Changes:
- Remove
run_perftest/test_graph/benchmark_modepaths from the WMMA and MXScale MoE gfx1250 harnesses and always time withbench_kernel_us. - Update tile resolution logic in
bench_resolve_tilesand broaden RDNA arch detection fromgfx120*togfx12*. - Simplify the WMMA harness CLI by removing obsolete
--benchmark/--test_graphoptions.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/kernels/test_moe_gemm_wmma_gfx1250.py | Removes graph/perftest timing paths and CLI args; uses CUDA-event benchmark timer consistently. |
| tests/kernels/test_moe_gemm_mxscale_gfx1250.py | Removes graph/perftest timing paths; uses CUDA-event benchmark timer consistently. |
| tests/kernels/benchmark_common.py | Adjusts tile selection behavior for stage2 tile_k resolution. |
| python/flydsl/runtime/device.py | Updates RDNA arch detection to treat all gfx12* as RDNA. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| parser.add_argument("--benchmark", action="store_true", default=False, | ||
| help="Use GEMM-style per-iteration event timing with optional L2 flush.") | ||
| parser.add_argument("--no_flush_l2", action="store_true", default=False, | ||
| help="Disable L2 flush in benchmark mode.") |
There was a problem hiding this comment.
The --benchmark flag was removed, but the --no_flush_l2 help text still refers to “benchmark mode”. Since timing is now always done via the CUDA-event benchmark path, consider updating the help text to reflect the current behavior (e.g., “Disable L2 flush during timing”).
| help="Disable L2 flush in benchmark mode.") | |
| help="Disable L2 flush during timing.") |
| flush_l2=bool(flush_l2), | ||
| prep_fn=_prep_stage1, | ||
| ) | ||
| torch.cuda.synchronize() |
There was a problem hiding this comment.
bench_kernel_us already synchronizes internally before returning. This extra torch.cuda.synchronize() is redundant and adds overhead; consider removing it (and the similar one after the stage2 timing call).
| torch.cuda.synchronize() |
| for k in range(target_k, 0, -wmma_k): | ||
| if inter_dim % k != 0: | ||
| # K-padding: run_moe_stage2 auto-pads inter_dim, so accept | ||
| # any tile_k that satisfies the load-mapping constraints. | ||
| total = tile_m * k | ||
| if total % 256 == 0 and (total // 256) % 4 == 0: | ||
| if tile_k2 is None: | ||
| tile_k2 = k | ||
| continue |
There was a problem hiding this comment.
bench_resolve_tiles can now select a tile_k2 even when inter_dim % tile_k2 != 0 (the new non-divisible branch sets tile_k2). Most run_moe_stage2 implementations in this repo (e.g. tests/kernels/test_moe_gemm.py) require inter_dim to be divisible by tile_k2 and will error out; in --bench mode this turns what used to be a clean SKIP into per-token FAILs. Consider only choosing tile_k2 values that divide inter_dim by default, and gate the non-divisible selection behind an explicit allow_k_padding flag / per-dtype capability for harnesses that actually pad K internally.
| flush_l2=bool(flush_l2), | ||
| prep_fn=_prep_stage1, | ||
| ) | ||
| torch.cuda.synchronize() |
There was a problem hiding this comment.
bench_kernel_us already calls torch.cuda.synchronize() before returning the median latency. The extra torch.cuda.synchronize() here is redundant and adds overhead; removing it will make the harness slightly faster (same applies to the other _bench_kernel_us usage in this file).
| torch.cuda.synchronize() |
| if arch.startswith("gfx10") or arch.startswith("gfx11"): | ||
| return True | ||
| if arch.startswith("gfx120"): | ||
| if arch.startswith("gfx12"): |
There was a problem hiding this comment.
why rdna true for 1250?
Made-with: Cursor
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist