feat: Add HybridEP support for MoE expert parallelism#1942
feat: Add HybridEP support for MoE expert parallelism#1942
Conversation
WalkthroughThe changes add two conditional configuration hooks to the MoE setup function for optional dispatcher and HybridEP parameters, and update the DeepEP dependency reference from a specific commit to the hybrid-ep branch across multiple dependency groups in the project configuration. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pyproject.toml (1)
324-327:⚠️ Potential issue | 🟠 MajorStale
dependency-metadataversion fordeep_ep.The dependency is pinned to the
hybrid-epbranch (which dynamically generates its version from the current commit hash viagit rev-parse --short HEAD), but thedependency-metadataversion is statically set tov1.2.1+bfded34. This means the metadata version will become stale whenever the branch advances, potentially causing uv resolver failures.Either:
- Update to pin to a specific commit hash instead of a branch, or
- Update the metadata version to match the current HEAD of
hybrid-epand regenerate it whenever the dependency updates
🤖 Fix all issues with AI agents
In `@nemo_rl/models/megatron/setup.py`:
- Around line 405-412: The new runtime keys moe_flex_dispatcher_backend and
moe_hybridep_num_sms are missing from the MegatronConfig TypedDict and from
example configs; add both to the MegatronConfig definition in
nemo_rl/models/policy/__init__.py as NotRequired entries (use the exact symbol
name MegatronConfig) with short docstrings: "Backend type for MoE flex
dispatcher (HybridEP)" for moe_flex_dispatcher_backend and "Number of SMs for
HybridEP" for moe_hybridep_num_sms, and then update at least one exemplar YAML
in examples/configs (e.g., a megatron MoE config) to include these keys with
sensible defaults (recommended defaults) so they are documented and visible to
users.
🧹 Nitpick comments (1)
pyproject.toml (1)
70-72: Branch ref instead of pinned commit reduces build reproducibility.All three dependency groups now point to
@hybrid-ep(a moving branch) instead of a fixed commit hash. This means builds are not reproducible — a force-push or new commit on that branch silently changes what gets installed. Consider pinning to a specific commit on thehybrid-epbranch once it stabilizes.
9acbdab to
7cc2e65
Compare
- Add deep_ep aarch64 dependency (7febc6e2, hybrid-ep branch) - Add HybridEP setup in megatron setup.py (IMEX env vars, NVLink domain config) - Add Qwen3-30B-A3B 4n4g config with moe_flex_dispatcher_backend=hybridep - Add EP=4, EP=8, sms16 config variants for ablation testing - Add test scripts for EP variants - Update Dockerfile for aarch64 deep_ep build support - Add HybridEP settings to 235B and DeepSeek-V3 performance configs Signed-off-by: sna <sna@nvidia.com>
7cc2e65 to
6c0cd7e
Compare
Resolve conflicts: keep aarch64 deep_ep split, adopt main's flashinfer/cutlass/emerging-optimizers updates. Signed-off-by: sna <sna@nvidia.com>
❌ Submodule Fast-Forward Check FailedCheck based on commit: 0349e83 (PR #1942 from ❌ Submodules that need attention:Megatron-Bridge: ❌ PR branch is BEHIND main branch Megatron-LM: ❌ PR branch is BEHIND main branch Please ensure all submodule commits are fast-forwards of the main branch before merging. |
Signed-off-by: sna <sna@nvidia.com>
❌ Submodule Fast-Forward Check FailedCheck based on commit: ec53cb6 (PR #1942 from ❌ Submodules that need attention:Megatron-Bridge: ❌ PR branch is BEHIND main branch Megatron-LM: ❌ PR branch is BEHIND main branch Please ensure all submodule commits are fast-forwards of the main branch before merging. |
Signed-off-by: sna <sna@nvidia.com>
…divisible_by Signed-off-by: sna <sna@nvidia.com>
There was a problem hiding this comment.
Review: PR #1942 — Add HybridEP support for MoE expert parallelism
Overall this looks good — all prior review feedback has been addressed (conda removal, CUDA_HOME move, platform markers, cleanup). One minor duplicate line to fix.
Generated by Claude Code
Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>
|
Offline testing update:
|
Signed-off-by: sna <sna@nvidia.com>
|
/ok to test e858ed6 |
Signed-off-by: Seonjin Na <sna@nvidia.com>
|
/ok to test f832943 |
Signed-off-by: Seonjin Na <sna@nvidia.com>
Signed-off-by: sna <sna@nvidia.com>
|
/ok to test ee42b36 |
|
/ok to test f45703a |
Two new variants isolate the HybridEP dispatcher effect following the PR #1942 pattern on GB200 NVL72: flex dispatcher + hybridep backend + 16 SMs, EP=8, no load balancing, no high-priority stream. hpstream_04_hybridep_nopack: HybridEP only, sequence packing off. hpstream_05_hybridep_pack: HybridEP only, sequence packing on. Pairs with hpstream_03 (HybridEP + seqpack + bias-update LB + HP EP stream) to decompose the contribution of each feature on top of raw HybridEP. Signed-off-by: sna <sna@nvidia.com>
Applies runtime patches to megatron.core.transformer.moe.fused_a2a to fix: 1. NCCL allgather hang (enable_custom_allgather=True) 2. Buffer overflow on seq_len growth (reinit when seq_len > max) 3. Dirty buffer after backward (needs_reset flag) 4. HybridEPDispatch.backward return count (10 -> 9) 5. HybridEPCombine.backward return count (5 -> 4) Applied in setup.py when moe_flex_dispatcher_backend=hybridep. Signed-off-by: sna <sna@nvidia.com>
This reverts commit 58b048d.
Commit a48493 has the buffer reallocation and backward state cleanup fixed in deep_ep itself. This eliminates the need for runtime monkey-patches on fused_a2a.py that were previously required for 7febc6e2, and improves Training/LogProb throughput by ~1-2% on Qwen3-30B-A3B and ~7-10% on Qwen3-235B. Signed-off-by: sna <sna@nvidia.com>
Usage in config:
policy.megatron_cfg.moe_token_dispatcher_type=flex
policy.megatron_cfg.moe_flex_dispatcher_backend=hybridep
policy.megatron_cfg.moe_hybridep_num_sms=32
See: https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Dependencies