Skip to content

Unify MoE test perf timing to CUDA Event bench and fix minor issues#408

Open
XingerZhu wants to merge 1 commit intomainfrom
gfx1250_moe_modify
Open

Unify MoE test perf timing to CUDA Event bench and fix minor issues#408
XingerZhu wants to merge 1 commit intomainfrom
gfx1250_moe_modify

Conversation

@XingerZhu
Copy link
Copy Markdown
Collaborator

  • Remove run_perftest path from mxscale/wmma test harnesses, keep only _bench_kernel_us (CUDA Event per-iteration timing with L2 flush and IQR outlier filtering) for consistent and accurate benchmarking
  • Remove test_graph/benchmark_mode parameters and CUDA Graph parametrize from pytest tests since they were only used by the removed perftest path
  • Clean up main CLI in wmma test to remove --test_graph/--benchmark args
  • Fix is_rdna_arch() to match gfx12* (not just gfx120*) for gfx1250
  • Fix bench_resolve_tiles() tile_k fallback for non-divisible dimensions

Made-with: Cursor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

- Remove run_perftest path from mxscale/wmma test harnesses, keep only
  _bench_kernel_us (CUDA Event per-iteration timing with L2 flush and
  IQR outlier filtering) for consistent and accurate benchmarking
- Remove test_graph/benchmark_mode parameters and CUDA Graph parametrize
  from pytest tests since they were only used by the removed perftest path
- Clean up __main__ CLI in wmma test to remove --test_graph/--benchmark args
- Fix is_rdna_arch() to match gfx12* (not just gfx120*) for gfx1250
- Fix bench_resolve_tiles() tile_k fallback for non-divisible dimensions

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 16, 2026 08:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes MoE 2-stage kernel timing in the gfx1250 test harnesses by removing the legacy run_perftest/CUDA-graph benchmarking path and using a single CUDA-event-based per-iteration timer, while also fixing a couple of small architecture/tiling utilities.

Changes:

  • Remove run_perftest/test_graph/benchmark_mode paths from the WMMA and MXScale MoE gfx1250 harnesses and always time with bench_kernel_us.
  • Update tile resolution logic in bench_resolve_tiles and broaden RDNA arch detection from gfx120* to gfx12*.
  • Simplify the WMMA harness CLI by removing obsolete --benchmark/--test_graph options.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
tests/kernels/test_moe_gemm_wmma_gfx1250.py Removes graph/perftest timing paths and CLI args; uses CUDA-event benchmark timer consistently.
tests/kernels/test_moe_gemm_mxscale_gfx1250.py Removes graph/perftest timing paths; uses CUDA-event benchmark timer consistently.
tests/kernels/benchmark_common.py Adjusts tile selection behavior for stage2 tile_k resolution.
python/flydsl/runtime/device.py Updates RDNA arch detection to treat all gfx12* as RDNA.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

parser.add_argument("--benchmark", action="store_true", default=False,
help="Use GEMM-style per-iteration event timing with optional L2 flush.")
parser.add_argument("--no_flush_l2", action="store_true", default=False,
help="Disable L2 flush in benchmark mode.")
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --benchmark flag was removed, but the --no_flush_l2 help text still refers to “benchmark mode”. Since timing is now always done via the CUDA-event benchmark path, consider updating the help text to reflect the current behavior (e.g., “Disable L2 flush during timing”).

Suggested change
help="Disable L2 flush in benchmark mode.")
help="Disable L2 flush during timing.")

Copilot uses AI. Check for mistakes.
flush_l2=bool(flush_l2),
prep_fn=_prep_stage1,
)
torch.cuda.synchronize()
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bench_kernel_us already synchronizes internally before returning. This extra torch.cuda.synchronize() is redundant and adds overhead; consider removing it (and the similar one after the stage2 timing call).

Suggested change
torch.cuda.synchronize()

Copilot uses AI. Check for mistakes.
Comment on lines 495 to 503
for k in range(target_k, 0, -wmma_k):
if inter_dim % k != 0:
# K-padding: run_moe_stage2 auto-pads inter_dim, so accept
# any tile_k that satisfies the load-mapping constraints.
total = tile_m * k
if total % 256 == 0 and (total // 256) % 4 == 0:
if tile_k2 is None:
tile_k2 = k
continue
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bench_resolve_tiles can now select a tile_k2 even when inter_dim % tile_k2 != 0 (the new non-divisible branch sets tile_k2). Most run_moe_stage2 implementations in this repo (e.g. tests/kernels/test_moe_gemm.py) require inter_dim to be divisible by tile_k2 and will error out; in --bench mode this turns what used to be a clean SKIP into per-token FAILs. Consider only choosing tile_k2 values that divide inter_dim by default, and gate the non-divisible selection behind an explicit allow_k_padding flag / per-dtype capability for harnesses that actually pad K internally.

Copilot uses AI. Check for mistakes.
flush_l2=bool(flush_l2),
prep_fn=_prep_stage1,
)
torch.cuda.synchronize()
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bench_kernel_us already calls torch.cuda.synchronize() before returning the median latency. The extra torch.cuda.synchronize() here is redundant and adds overhead; removing it will make the harness slightly faster (same applies to the other _bench_kernel_us usage in this file).

Suggested change
torch.cuda.synchronize()

Copilot uses AI. Check for mistakes.
if arch.startswith("gfx10") or arch.startswith("gfx11"):
return True
if arch.startswith("gfx120"):
if arch.startswith("gfx12"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why rdna true for 1250?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants