Unify MoE test perf timing to CUDA Event bench and fix minor issues by XingerZhu · Pull Request #408 · ROCm/FlyDSL

XingerZhu · 2026-04-16T08:33:01Z

Remove run_perftest path from mxscale/wmma test harnesses, keep only _bench_kernel_us (CUDA Event per-iteration timing with L2 flush and IQR outlier filtering) for consistent and accurate benchmarking
Remove test_graph/benchmark_mode parameters and CUDA Graph parametrize from pytest tests since they were only used by the removed perftest path
Clean up main CLI in wmma test to remove --test_graph/--benchmark args
Fix is_rdna_arch() to match gfx12* (not just gfx120*) for gfx1250
Fix bench_resolve_tiles() tile_k fallback for non-divisible dimensions

Made-with: Cursor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

- Remove run_perftest path from mxscale/wmma test harnesses, keep only _bench_kernel_us (CUDA Event per-iteration timing with L2 flush and IQR outlier filtering) for consistent and accurate benchmarking - Remove test_graph/benchmark_mode parameters and CUDA Graph parametrize from pytest tests since they were only used by the removed perftest path - Clean up __main__ CLI in wmma test to remove --test_graph/--benchmark args - Fix is_rdna_arch() to match gfx12* (not just gfx120*) for gfx1250 - Fix bench_resolve_tiles() tile_k fallback for non-divisible dimensions Made-with: Cursor

Copilot

Pull request overview

This PR standardizes MoE 2-stage kernel timing in the gfx1250 test harnesses by removing the legacy run_perftest/CUDA-graph benchmarking path and using a single CUDA-event-based per-iteration timer, while also fixing a couple of small architecture/tiling utilities.

Changes:

Remove run_perftest/test_graph/benchmark_mode paths from the WMMA and MXScale MoE gfx1250 harnesses and always time with bench_kernel_us.
Update tile resolution logic in bench_resolve_tiles and broaden RDNA arch detection from gfx120* to gfx12*.
Simplify the WMMA harness CLI by removing obsolete --benchmark/--test_graph options.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
tests/kernels/test_moe_gemm_wmma_gfx1250.py	Removes graph/perftest timing paths and CLI args; uses CUDA-event benchmark timer consistently.
tests/kernels/test_moe_gemm_mxscale_gfx1250.py	Removes graph/perftest timing paths; uses CUDA-event benchmark timer consistently.
tests/kernels/benchmark_common.py	Adjusts tile selection behavior for stage2 `tile_k` resolution.
python/flydsl/runtime/device.py	Updates RDNA arch detection to treat all `gfx12*` as RDNA.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T08:38:51Z

-    parser.add_argument("--benchmark", action="store_true", default=False,
-                        help="Use GEMM-style per-iteration event timing with optional L2 flush.")
    parser.add_argument("--no_flush_l2", action="store_true", default=False,
                        help="Disable L2 flush in benchmark mode.")


The --benchmark flag was removed, but the --no_flush_l2 help text still refers to “benchmark mode”. Since timing is now always done via the CUDA-event benchmark path, consider updating the help text to reflect the current behavior (e.g., “Disable L2 flush during timing”).

Suggested change

help="Disable L2 flush in benchmark mode.")

help="Disable L2 flush during timing.")

Copilot · 2026-04-16T08:38:51Z

+        flush_l2=bool(flush_l2),
+        prep_fn=_prep_stage1,
+    )
    torch.cuda.synchronize()


bench_kernel_us already synchronizes internally before returning. This extra torch.cuda.synchronize() is redundant and adds overhead; consider removing it (and the similar one after the stage2 timing call).

Suggested change

torch.cuda.synchronize()

Copilot · 2026-04-16T08:38:51Z

    for k in range(target_k, 0, -wmma_k):
        if inter_dim % k != 0:
+            # K-padding: run_moe_stage2 auto-pads inter_dim, so accept
+            # any tile_k that satisfies the load-mapping constraints.
+            total = tile_m * k
+            if total % 256 == 0 and (total // 256) % 4 == 0:
+                if tile_k2 is None:
+                    tile_k2 = k
            continue


bench_resolve_tiles can now select a tile_k2 even when inter_dim % tile_k2 != 0 (the new non-divisible branch sets tile_k2). Most run_moe_stage2 implementations in this repo (e.g. tests/kernels/test_moe_gemm.py) require inter_dim to be divisible by tile_k2 and will error out; in --bench mode this turns what used to be a clean SKIP into per-token FAILs. Consider only choosing tile_k2 values that divide inter_dim by default, and gate the non-divisible selection behind an explicit allow_k_padding flag / per-dtype capability for harnesses that actually pad K internally.

Copilot · 2026-04-16T08:38:52Z

+        flush_l2=bool(flush_l2),
+        prep_fn=_prep_stage1,
+    )
    torch.cuda.synchronize()


bench_kernel_us already calls torch.cuda.synchronize() before returning the median latency. The extra torch.cuda.synchronize() here is redundant and adds overhead; removing it will make the harness slightly faster (same applies to the other _bench_kernel_us usage in this file).

Suggested change

torch.cuda.synchronize()

coderfeli · 2026-04-16T13:58:58Z

    if arch.startswith("gfx10") or arch.startswith("gfx11"):
        return True
-    if arch.startswith("gfx120"):
+    if arch.startswith("gfx12"):


why rdna true for 1250?

Copilot AI review requested due to automatic review settings April 16, 2026 08:33

Copilot started reviewing on behalf of XingerZhu April 16, 2026 08:34 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

coderfeli reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify MoE test perf timing to CUDA Event bench and fix minor issues#408

Unify MoE test perf timing to CUDA Event bench and fix minor issues#408
XingerZhu wants to merge 1 commit intomainfrom
gfx1250_moe_modify

XingerZhu commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

coderfeli Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	help="Disable L2 flush in benchmark mode.")
	help="Disable L2 flush during timing.")

Conversation

XingerZhu commented Apr 16, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderfeli Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants