Group all same-shape parameters into a single AsyncTask per shape group by alint77 · Pull Request #28 · microsoft/dion

alint77 · 2026-02-23T21:16:49Z

When profiling NorMuon with FSDP2 on a 12-layer/768-hidden model across 2 GPUs, I noticed that optimizer.step was dominated by GPU idle time between dozens of small, sequential communication rounds. Each world_size-sized batch of same-shape parameters triggered its own pair of all-to-all calls and Newton-Schulz iteration, resulting in ~36 separate rounds per step with visible gaps on the Chrome trace between each one.

This PR collapses all same-shape parameters into a single "mega-batch" per shape group, bringing that down to ~3 rounds (one per unique weight shape in a standard transformer).

What changed
Mega-batched communication (normuon_update_megabatch_async)
Instead of processing world_size matrices at a time through all-to-all → Newton-Schulz → all-to-all, we now:

Stack all N same-shape local shards into 3D tensors
Do one all-to-all to redistribute the full stack
Run Newton-Schulz on a [N/world_size, rows, cols] batch (already supported by the existing NS implementations since they use dim=(-2,-1) norms and @ broadcasting)
Do one all-to-all back
The non-sharded (DDP-style) path gets the same treatment with all-gather, and single-GPU also benefits from batched NS.

Stacked normalization kernel (normuon_normalization_stacked)
The original normuon_normalization operated on a List[Tensor] using torch.foreach* ops. With 48 attention weight matrices, that's 48-element foreach calls — each one launching separate kernels per tensor. The new version stacks everything into a single [N, rows, cols] tensor and does plain tensor ops, which torch.compile fuses into far fewer kernels.

Refactored _create_normuon_tasks
Extracted _get_shard_info as a helper method (was duplicated inline for every batch). The task creation now groups parameters by (shape, sharding, dtype) and yields one AsyncTask per group rather than per world_size-chunk.

Backward compatibility
The old normuon_update_batch_async and normuon_normalization are still present and used for the batch-sharded 3D tensor edge case
The async yield points are preserved at the same communication boundaries, so the AsyncRuntime overlap behavior is unchanged
All existing tests pass
Expected impact
For a typical transformer with 3 distinct weight shapes:

Communication rounds: O(num_params / world_size) → O(num_shapes), e.g. 36 → 3
Kernel launches: foreach over N-element lists → single stacked tensor ops
Newton-Schulz: N separate 2D calls → one batched 3D call with better occupancy
The actual speedup will depend on model size and world_size — larger models with more layers benefit more since there are more same-shape matrices to batch together.
BEOFRE

AFTER

Also the loss curve matches perfectly.

alint77 · 2026-02-23T21:34:15Z

Metric	OLD	NEW	Speedup
Optimizer.step total GPU time	128ms	18ms	7x
Optimizer.step CPU time per step	78ms	12ms	6x
NCCL all-to-all calls/step	100	6	17x fewer
Total NCCL events/step	172	8	21x fewer
CPU step time	194ms	171ms	-12%
GPU step time	164ms	164ms	same

Group all same-shape parameters into a single AsyncTask per shape group

11f6fc7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group all same-shape parameters into a single AsyncTask per shape group#28

Group all same-shape parameters into a single AsyncTask per shape group#28
alint77 wants to merge 1 commit intomicrosoft:mainfrom
alint77:dev/megabatching

alint77 commented Feb 23, 2026

Uh oh!

alint77 commented Feb 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alint77 commented Feb 23, 2026

Uh oh!

alint77 commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alint77 commented Feb 23, 2026 •

edited

Loading