MoE prefill bf16 perf improvement for qwen-3.5-35B-A3B by digantdesai · Pull Request #18829 · pytorch/executorch

digantdesai · 2026-04-11T16:15:38Z

	Baseline	Batched	Speedup
Prefill (1341 tok)	588 tok/s	1807 tok/s	3.07x
Decode (128 tok)	90 tok/s	86 tok/s	~1.0x

pytorch-bot · 2026-04-11T16:15:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18829

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull workflow for PyTorch trunk commits

❌ 1 New Failure, 2 Unrelated Failures

As of commit 40c6cc2 with merge base 87e65ac ():

NEW FAILURE - The following job has failed:

pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t 512d97c7a84f06dee13835c6e5b2b264502985af34ebb0e3254e05c37369c018 /exec failed with exit code 139

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-11T16:16:23Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

digantdesai · 2026-04-13T19:46:59Z

@claude review this PR.

claude · 2026-04-13T19:47:44Z

Claude finished @digantdesai's task in 2m 33s —— View job

PR Review: MoE prefill bf16 perf improvement for qwen-3.5-35B-A3B

Read and analyze all changed files
Review kernel implementation (fused_moe.py)
Review CUDA sort shim (sort.cu, sort.h)
Review benchmarks and tests
Review export/model changes
Review build system changes
Post comprehensive review feedback

This PR adds a batched tensor-core MoE kernel for prefill (fused_moe_batched_gemm) alongside the existing vec-mat decode kernel, achieving a 3x speedup on prefill. The changes are well-structured across 5 commits with clear separation of concerns. Here's a detailed review:

Overall Architecture

The approach is sound: use the existing fused_moe vec-mat kernel for decode (M=1, bandwidth-bound) and a new fused_moe_batched_gemm kernel with token sorting + tl.dot for prefill (M>>1, compute-bound). The export pipeline toggles between them via use_batched_moe flag.

`fused_moe.py` — Batched Kernel + Token Sorting

Positive:

The moe_align_block_size implementation avoids data-dependent shapes, making it compatible with torch.export. Using argsort on constructed unique keys (flat_ids * num_pairs + arange) instead of stable=True sort is a pragmatic workaround.
The scale loading optimization (BLOCK_SIZE_K <= group_size branch) avoids unnecessary per-element scale indexing when the entire tile shares one scale group. Applied consistently to all four kernels.
_BATCHED_BLOCK_M = 16 is well-justified as the minimum for tl.dot with minimal padding waste.

Issues and suggestions:

Potential race condition in GEMM2 scatter (fused_moe.py:830-834): In _fused_moe_silu_batched_kernel, multiple M-blocks scatter their output to original pair positions via pair_ids. If two different expert blocks happen to have the same pair_id values (shouldn't happen by construction of moe_align_block_size since each pair is assigned to exactly one expert), this would be fine. However, sentinel rows all write to out_buf[num_pairs] — multiple blocks may write to the same sentinel row concurrently. Since sentinel row data is discarded (out_buf[:num_pairs]), this is benign but worth a comment.
moe_align_block_size worst-case allocation (fused_moe.py:483): max_num_tokens_padded = num_pairs + num_experts * block_size. For Qwen3.5 (256 experts, block_size=16, 1341 tokens * top_k=8 = 10728 pairs), this is 10728 + 256*16 = 14824 — reasonable. But the intermediate sorted output cache1 is allocated at max_padded rows (fused_moe.py:879-884), which could be up to 14824 rows even though the actual padded count may be smaller. Consider using num_tokens_post_padded for the grid calculation and documenting why the allocation uses the worst-case bound.
fused_moe_batched convenience wrapper (fused_moe.py:970-994): This wrapper is identical to calling fused_moe_batched_gemm directly. If it only exists for benchmarking, it could be removed and the benchmark could call fused_moe_batched_gemm directly. Fix this →
Grid size for batched GEMM2 (fused_moe.py:922): The grid uses num_expert_blocks which is max_num_tokens_padded // BLOCK_M. For blocks beyond num_tokens_post_padded, the kernel processes sentinel rows that produce zero output (weights are zeroed). This works but launches potentially wasted CTAs. For the Qwen3.5 dimensions this is minor (extra ~256 blocks at most), but for models with more experts relative to tokens, it could matter.
out_buf is zero-initialized (fused_moe.py:914-919): The scatter-based GEMM2 writes to arbitrary positions, so torch.zeros is needed (unused positions stay zero for the sum). This is correct but the cache1 for GEMM1 uses torch.empty which is also fine since every position in max_padded gets written in sorted order. Good.

`sort.cu` — CUDA Sort Shim

Positive:

Handles non-innermost dimension sorting via transpose → sort → transpose-back, which is the standard approach.
Supports multiple dtypes including bf16 and fp16.
Error handling with ET_CHECK_OR_RETURN_ERROR is thorough.

Issues and suggestions:

Sequential slice sorting (sort.cu:345-403): The for (int64_t s = 0; s < num_slices; ++s) loop launches separate thrust sorts per slice. For tensors with many small slices (e.g., [1000, 8] sorted on dim=-1), this means 1000 sequential kernel launches. CUB's segmented sort (cub::DeviceSegmentedSort) would be more efficient for this case — but this is a known limitation of thrust-based implementations and may not be needed for the current use case.
cudaMallocAsync/cudaFreeAsync in transpose path (sort.cu:299-306, 431-432): These allocate temporary buffers per call. For repeated sort operations (e.g., in a loop), a workspace allocator would be more efficient. For the MoE use case where argsort is called once per forward pass, this is fine.
Missing elem_size == 1 case in launch_permute (sort.cu:80-108): The switch statement handles 8, 4, and 2 byte elements but silently does nothing for 1-byte types. Since the supported dtypes are all ≥2 bytes this is fine, but a default case with an error would be defensive. Fix this →

Tests

Positive:

test_batched_correctness covers multiple M values (8, 16, 32, 64, 128) with varying expert counts and group sizes.
test_batched_matches_fused validates that the batched kernel matches the existing decode kernel at full Qwen-scale dimensions (E=256, K=2048).
TestMoeAlignBlockSize has good coverage of edge cases (single token, all-same-expert, sentinel values, determinism).
Sort shim tests verify both export success and fallback kernel registration.

Suggestions:

Missing batched kernel test for M=1 (test_fused_moe.py:339): The batched kernel should gracefully handle M=1 (single token during prefill), but there's no test for this. It would be padded to BLOCK_M=16 with 15 sentinels — worth testing.
Tolerance of 0.5 abs error in test_batched_matches_fused (test_fused_moe.py:425): This seems high. The two kernels operate on the same quantized weights, so differences come from fp accumulation order. Consider also checking relative error like the other tests do. The batched kernel uses tl.dot (tensor cores) while decode uses element-wise multiply+reduce, which can give different rounding — 0.5 abs may be appropriate but worth a comment explaining why.

Export / Model Changes

Positive:

The _set_batched_moe toggle pattern is clean — trace decode with use_batched_moe=False, trace prefill with use_batched_moe=True.
Using example_seq_len = config.max_seq_len - 1 for the prefill example is correct to ensure AOTI compiles kernels for the full dynamic range.

Suggestions:

Prefill example allocation (export.py:440-441): torch.zeros((1, example_seq_len), dtype=torch.long) allocates a tensor of size max_seq_len-1 (default 4095) just as a tracing example. This is fine since it's CPU-only during export, but for large max_seq_len values it could be notable.

Build System

CMakeLists.txt (backends/cuda/CMakeLists.txt:112-114): sort.cu is correctly gated behind CMAKE_CUDA_COMPILER. The TARGETS file also adds both sort.cu and sort.h appropriately.
cuda_backend.py:148: Adding "at::_ops::sort_stable::call" to get_supported_fallback_kernels is the correct registration for the sort shim.

Summary

Strong PR with a well-motivated performance optimization. The batched tensor-core kernel + token sorting approach is standard for MoE prefill and the implementation is solid. The scale loading optimization is a nice touch. The CUDA sort shim fills a real gap for AOTI export.

Key items to consider:

Item 3: fused_moe_batched wrapper could be consolidated
Item 8: Add a default/error case in launch_permute's switch
Item 9: Add a test for batched kernel with M=1
Item 10: Consider adding a comment explaining the 0.5 tolerance

Copilot

Pull request overview

This PR improves Qwen3.5 MoE prefill throughput by adding a token-sorted, tensor-core (tl.dot) batched MoE Triton kernel and wiring export to use it for the prefill method while keeping the existing vec-mat kernel for decode. It also adds a CUDA AOTI sort_stable shim (Thrust-based) to support the argsort/sort needed by token sorting during export/lowering.

Changes:

Add a new Triton MoE variant (fused_moe_batched_gemm) with token sorting (moe_align_block_size) for compute-bound prefill.
Toggle decode vs prefill MoE kernel selection during Qwen3.5 export (use_batched_moe flag + export-side toggling).
Add a CUDA runtime shim for aten::sort.stable fallback and new CUDA tests + a benchmark script.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
examples/models/qwen3_5_moe/model.py	Adds `use_batched_moe` switch to route MoE to the new batched kernel.
examples/models/qwen3_5_moe/export.py	Exports decode with old kernel and prefill with batched kernel; updates prefill example shapes.
backends/cuda/triton/kernels/fused_moe.py	Implements token sorting helper and new batched tensor-core MoE kernels/wrappers; tweaks scale-load logic.
backends/cuda/triton/kernels/init.py	Exposes new MoE helpers/ops in kernel package exports.
backends/cuda/tests/test_sort_shim.py	Adds export-level tests ensuring sort/argsort can lower with the shim.
backends/cuda/tests/test_fused_moe.py	Adds correctness tests for batched MoE and for `moe_align_block_size`.
backends/cuda/runtime/shims/sort.h	Declares the AOTI CUDA sort stable shim API.
backends/cuda/runtime/shims/sort.cu	Implements Thrust-based stable sort shim (plus optional transpose-permute path).
backends/cuda/runtime/TARGETS	Adds sort shim sources/headers to the runtime build.
backends/cuda/cuda_backend.py	Registers `at::_ops::sort_stable::call` as a supported CUDA fallback kernel.
backends/cuda/benchmarks/benchmark_moe.py	Adds a benchmark harness for fused MoE kernels.
backends/cuda/CMakeLists.txt	Ensures the sort shim is built into `aoti_cuda_shims` when CUDA is available.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T15:28:19Z

+#include <cuda.h>
+#include <cuda_bf16.h>
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+#include <thrust/device_ptr.h>
+#include <thrust/execution_policy.h>
+#include <thrust/sort.h>
+
+#include <executorch/backends/aoti/utils.h>
+#include <executorch/backends/cuda/runtime/shims/memory.h>
+#include <executorch/backends/cuda/runtime/shims/sort.h>
+#include <executorch/backends/aoti/slim/cuda/guard.h>
+#include <executorch/runtime/platform/log.h>
+
+namespace executorch::backends::cuda {


sort.cu uses std::vector later (e.g., contig_strides) but doesn’t include , which will fail to compile on toolchains that don’t indirectly include it. Add the missing standard header include explicitly near the top of the file.

Copilot · 2026-04-14T15:28:19Z

+        # Load per-group scales and dequantize
+        if BLOCK_SIZE_K <= group_size:
+            # All K values in this tile share one scale group — load [1, N]
+            group_idx = (BLOCK_SIZE_K * k_step) // group_size
+            scale_ptrs = (
+                B_scale
+                + expert_id * stride_bse
+                + offs_n[None, :] * stride_bsn
+                + group_idx * stride_bsk
+            )
+            b_scale = tl.load(scale_ptrs, mask=n_mask[None, :], other=0.0).to(
+                tl.float32
+            )
+        else:


The fast-path if BLOCK_SIZE_K <= group_size assumes the entire K tile falls within a single quantization group, but that’s only guaranteed when the tile does not straddle a group_size boundary (e.g., when group_size % BLOCK_SIZE_K == 0, or when (BLOCK_SIZE_K*k_step) % group_size + BLOCK_SIZE_K <= group_size). As written, some valid (K, group_size, BLOCK_SIZE_K) combos will load the wrong scale for part of the tile and silently produce incorrect results. Please tighten the condition or fall back to the per-element scale indexing in those cases.

Copilot · 2026-04-14T15:28:19Z

+        if BLOCK_SIZE_K <= group_size:
+            group_idx = (BLOCK_SIZE_K * k_step) // group_size
+            scale_ptrs = (
+                B_scale
+                + expert_id * stride_bse
+                + offs_n[None, :] * stride_bsn
+                + group_idx * stride_bsk
+            )
+            b_scale = tl.load(scale_ptrs, mask=n_mask[None, :], other=0.0).to(
+                tl.float32
+            )
+        else:


Same issue as above: the if BLOCK_SIZE_K <= group_size scale-loading fast-path is only correct if the K tile cannot cross a quantization-group boundary. For non-multiple group sizes (or misaligned tiles), this can load an incorrect scale for some K lanes and produce wrong outputs. Please either enforce the alignment constraint (with a clear error) or use the general per-lane scale indexing path.

Copilot · 2026-04-14T15:28:20Z

+    # To avoid argsort (needs sort_stable fallback in AOTI), we use a
+    # scatter-based approach:
+    #   For each pair i, within_expert_rank[i] = sum_{j<i} (flat_ids[j]==e)
+    # This is a segmented exclusive prefix sum, computed as:
+    #   within_expert_rank[i] = cumcount_of_expert[flat_ids[i]] before i
+    # We iterate in pair order, incrementing per-expert counters.
+    # Since this is a sequential scan, we implement it as a cumsum trick:
+    #   For each expert e, the pairs assigned to it appear at positions
+    #   where flat_ids == e. Their within-expert ranks are 0, 1, 2, ...
+    #   in the order they appear.
+    #
+    # Vectorized: use argsort on (flat_ids * num_pairs + arange) which
+    # gives stable expert-grouped ordering — same result as argsort(flat_ids).
+    # The multiplication ensures expert grouping; adding arange breaks ties
+    # by original order (equivalent to stable sort).
+    sort_keys = flat_ids * num_pairs + torch.arange(
+        num_pairs, device=device, dtype=torch.int64
+    )
+    sorted_order = sort_keys.argsort()  # no `stable` kwarg needed — keys are unique
+


The comment says this avoids argsort due to sort_stable/AOTI constraints, but the implementation still calls argsort() (on sort_keys). If the intent is “stable sort not required because keys are unique”, please rewrite the comment to match the code; otherwise this is misleading for future maintenance/debugging (especially now that a sort shim is being introduced).

Copilot · 2026-04-14T15:28:20Z

+    topk_ids: torch.Tensor,
+    block_size: int,
+    num_experts: int,
+) -> tuple[torch.Tensor, torch.Tensor, int]:


The return type annotation says the third return value is an int, but moe_align_block_size() actually returns num_tokens_post_padded as a scalar int64 tensor (expert_offsets[num_experts]). Please update the type hint to torch.Tensor (or convert to Python int and adjust the implementation accordingly) to avoid confusing callers and static analysis.

Suggested change

) -> tuple[torch.Tensor, torch.Tensor, int]:

) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:

mergennachin · 2026-04-14T15:41:35Z

Tolerance of 0.5 abs error in test_batched_matches_fused

This seems like a valid comment, any reason why? How can you be sure about the correctness of the new batched version?

mergennachin

Please see comments

mergennachin · 2026-04-14T15:35:05Z

+# LICENSE file in the root directory of this source tree.
+
+"""
+Benchmark the Triton fused MoE kernel against eager and torch.compile baselines.


do we need this file to be checked in? i'd rather not check in, if it's not being exercised by CI.

yeah this can be run by CI or removed.

mergennachin · 2026-04-14T15:50:10Z

Also it would be good to consistently reproduce this number in CI. We can change the prompt to be 512 so that it is comparable to other frameworks.

Inductor emits aten::sort.stable for ops like argsort, but lacks a native c-shim for it. This adds a thrust-based implementation (aoti_torch_cuda_sort_stable) that handles int64, int32, and float32 dtypes on contiguous innermost-dim tensors. Registered as a supported fallback kernel in CudaBackend so AOTI-compiled models can use sort. This PR was authored with the assistance of Claude.

Sweeps prompt lengths [1..4095] with Qwen3.5-35B-A3B shapes (256 experts, top-8, INT4 W4A16). Validates correctness against loop-based eager reference at small M, benchmarks vectorized eager, torch.compile, and Triton fused_moe. Handles OOM gracefully at large M where eager/compile dequantize all experts. This PR was authored with the assistance of Claude.

When the Triton tile size fits within a single quantization group, load one scale per N-element instead of per (K, N) element. Reduces scale memory traffic in both GEMM1 and GEMM2 vec-mat kernels. This PR was authored with the assistance of Claude.

Adds a batched (M>1) Triton fused MoE kernel using tensor-core mma instructions for prefill workloads. Includes moe_align_block_size for token-expert sorting and scale broadcast optimization in the batched GEMM inner loops. Weight layout: [E, N, K//2] (packed INT4). This PR was authored with the assistance of Claude.

Add use_batched_moe flag on FusedMoEExperts, toggled by _set_batched_moe in export.py before each method's torch.export call. Decode (T=1) uses the vec-mat fused_moe kernel; prefill (T>=2) uses fused_moe_batched_gemm. This PR was authored with the assistance of Claude.

digantdesai · 2026-04-16T03:34:33Z

Also it would be good to consistently reproduce this number in CI. We can change the prompt to be 512 so that it is comparable to other frameworks.

Songhao has a PR for updating the CI setup.

digantdesai · 2026-04-16T03:37:33Z

Tolerance of 0.5 abs error in test_batched_matches_fused

This seems like a valid comment, any reason why? How can you be sure about the correctness of the new batched version?

Just an oversight. I did update the benchmark script but not this - let me update.

digantdesai · 2026-04-16T04:46:19Z

Fixes in #18934

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 11, 2026

digantdesai force-pushed the digantdesai/qwen35_moe branch from a0d199a to 63548f5 Compare April 13, 2026 15:15

digantdesai changed the title ~~Add CUDA sort shim for AOTI export (thrust-based sort_stable fallback)~~ [AOTI-CUDA] MoE prefill bf16 perf improvement for qwen-3.5-35B-A3B Apr 13, 2026

digantdesai changed the title ~~[AOTI-CUDA] MoE prefill bf16 perf improvement for qwen-3.5-35B-A3B~~ MoE prefill bf16 perf improvement for qwen-3.5-35B-A3B Apr 13, 2026

digantdesai requested review from Gasoonjia and mergennachin April 13, 2026 16:10

digantdesai force-pushed the digantdesai/qwen35_moe branch from 63548f5 to 5055971 Compare April 13, 2026 19:46

digantdesai marked this pull request as ready for review April 14, 2026 15:21

digantdesai requested review from kirklandsign, larryliu0820 and lucylq as code owners April 14, 2026 15:21

Copilot AI review requested due to automatic review settings April 14, 2026 15:21

Copilot AI reviewed Apr 14, 2026

View reviewed changes

mergennachin approved these changes Apr 14, 2026

View reviewed changes

lucylq mentioned this pull request Apr 14, 2026

[AOTI-CUDA] qwen3.5-35B-A3B prefill bf16 perf through MoE #18796

Closed

digantdesai added 5 commits April 14, 2026 19:19

digantdesai force-pushed the digantdesai/qwen35_moe branch from 5055971 to 40c6cc2 Compare April 15, 2026 03:21

digantdesai merged commit 8d1ff1a into main Apr 15, 2026
237 of 240 checks passed

digantdesai deleted the digantdesai/qwen35_moe branch April 15, 2026 14:49

	) -> tuple[torch.Tensor, torch.Tensor, int]:
	) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:

Conversation

digantdesai commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18829

❗ 1 Active SEVs

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

github-actions bot commented Apr 11, 2026

This PR needs a release notes: label

Uh oh!

digantdesai commented Apr 13, 2026

Uh oh!

claude bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: MoE prefill bf16 perf improvement for qwen-3.5-35B-A3B

Overall Architecture

fused_moe.py — Batched Kernel + Token Sorting

sort.cu — CUDA Sort Shim

Tests

Export / Model Changes

Build System

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin commented Apr 14, 2026

Uh oh!

mergennachin left a comment

Choose a reason for hiding this comment

Uh oh!

mergennachin Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

digantdesai Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergennachin commented Apr 14, 2026

Uh oh!

Uh oh!

digantdesai commented Apr 16, 2026

Uh oh!

digantdesai commented Apr 16, 2026

Uh oh!

digantdesai commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

digantdesai commented Apr 11, 2026 •

edited

Loading

pytorch-bot bot commented Apr 11, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude bot commented Apr 13, 2026 •

edited

Loading

`fused_moe.py` — Batched Kernel + Token Sorting

`sort.cu` — CUDA Sort Shim

digantdesai Apr 16, 2026 •

edited

Loading