Skip to content

feat: Add HybridEP support for MoE expert parallelism#1942

Open
seonjinn wants to merge 21 commits intomainfrom
sj/hybridep-support
Open

feat: Add HybridEP support for MoE expert parallelism#1942
seonjinn wants to merge 21 commits intomainfrom
sj/hybridep-support

Conversation

@seonjinn
Copy link
Copy Markdown
Contributor

@seonjinn seonjinn commented Feb 13, 2026

  • Update DeepEP dependency to hybrid-ep branch for HybridEP support
    • automodel, vllm, mcore dependency groups updated
  • Add HybridEP configuration options in _apply_moe_config():
    • moe_flex_dispatcher_backend: Flex dispatcher backend (e.g., 'hybridep')
    • moe_hybridep_num_sms: Number of SMs for HybridEP operations

Usage in config:
policy.megatron_cfg.moe_token_dispatcher_type=flex
policy.megatron_cfg.moe_flex_dispatcher_backend=hybridep
policy.megatron_cfg.moe_hybridep_num_sms=32

See: https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added configuration support for additional Mixture of Experts (MoE) model parameters, including dispatcher backend and HybridEP settings.
  • Dependencies

    • Updated DeepEP dependency reference to use the hybrid-ep branch across multiple dependency groups.

@seonjinn seonjinn requested review from a team as code owners February 13, 2026 19:37
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 13, 2026

Walkthrough

The changes add two conditional configuration hooks to the MoE setup function for optional dispatcher and HybridEP parameters, and update the DeepEP dependency reference from a specific commit to the hybrid-ep branch across multiple dependency groups in the project configuration.

Changes

Cohort / File(s) Summary
MoE Configuration Hooks
nemo_rl/models/megatron/setup.py
Adds conditional assignments for moe_flex_dispatcher_backend and moe_hybridep_num_sms configuration parameters in the _apply_moe_config function.
Dependency Updates
pyproject.toml
Updates DeepEP git dependency reference from commit bfded34800dfec415b71503f8205181de90b2480 to the hybrid-ep branch across automodel, vllm, and mcore dependency groups with explanatory comments.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sj/hybridep-support

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pyproject.toml (1)

324-327: ⚠️ Potential issue | 🟠 Major

Stale dependency-metadata version for deep_ep.

The dependency is pinned to the hybrid-ep branch (which dynamically generates its version from the current commit hash via git rev-parse --short HEAD), but the dependency-metadata version is statically set to v1.2.1+bfded34. This means the metadata version will become stale whenever the branch advances, potentially causing uv resolver failures.

Either:

  1. Update to pin to a specific commit hash instead of a branch, or
  2. Update the metadata version to match the current HEAD of hybrid-ep and regenerate it whenever the dependency updates
🤖 Fix all issues with AI agents
In `@nemo_rl/models/megatron/setup.py`:
- Around line 405-412: The new runtime keys moe_flex_dispatcher_backend and
moe_hybridep_num_sms are missing from the MegatronConfig TypedDict and from
example configs; add both to the MegatronConfig definition in
nemo_rl/models/policy/__init__.py as NotRequired entries (use the exact symbol
name MegatronConfig) with short docstrings: "Backend type for MoE flex
dispatcher (HybridEP)" for moe_flex_dispatcher_backend and "Number of SMs for
HybridEP" for moe_hybridep_num_sms, and then update at least one exemplar YAML
in examples/configs (e.g., a megatron MoE config) to include these keys with
sensible defaults (recommended defaults) so they are documented and visible to
users.
🧹 Nitpick comments (1)
pyproject.toml (1)

70-72: Branch ref instead of pinned commit reduces build reproducibility.

All three dependency groups now point to @hybrid-ep (a moving branch) instead of a fixed commit hash. This means builds are not reproducible — a force-push or new commit on that branch silently changes what gets installed. Consider pinning to a specific commit on the hybrid-ep branch once it stabilizes.

Comment thread nemo_rl/models/megatron/setup.py
@seonjinn seonjinn requested a review from guyueh1 February 13, 2026 19:42
@seonjinn seonjinn requested review from a team as code owners February 13, 2026 23:26
Comment thread pyproject.toml Outdated
Comment thread pyproject.toml Outdated
Comment thread pyproject.toml Outdated
@seonjinn seonjinn added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Mar 4, 2026
@seonjinn seonjinn requested a review from terrykong March 4, 2026 01:59
@guyueh1 guyueh1 added the Performance Related to improving performance label Mar 5, 2026
@seonjinn seonjinn changed the title feat: Add HybridEP support for MoE expert parallelism feat: Add HybridEP/Partial CudaGraph support for MoE expert parallelism Mar 5, 2026
Comment thread ray.sub Outdated
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm @seonjinn , but the main thing i'm concerned about is the introduction of conda, we can continue discussing on the thread is tarted

@anwithk anwithk added this to the v0.6 Release milestone Mar 20, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread ray.sub Outdated
Comment thread ray.sub Outdated
@guyueh1 guyueh1 changed the title feat: Add HybridEP/Partial CudaGraph support for MoE expert parallelism feat: Add HybridEP support for MoE expert parallelism Apr 8, 2026
@seonjinn seonjinn requested a review from a team as a code owner April 9, 2026 20:43
@seonjinn seonjinn force-pushed the sj/hybridep-support branch 2 times, most recently from 9acbdab to 7cc2e65 Compare April 13, 2026 01:23
- Add deep_ep aarch64 dependency (7febc6e2, hybrid-ep branch)
- Add HybridEP setup in megatron setup.py (IMEX env vars, NVLink domain config)
- Add Qwen3-30B-A3B 4n4g config with moe_flex_dispatcher_backend=hybridep
- Add EP=4, EP=8, sms16 config variants for ablation testing
- Add test scripts for EP variants
- Update Dockerfile for aarch64 deep_ep build support
- Add HybridEP settings to 235B and DeepSeek-V3 performance configs

Signed-off-by: sna <sna@nvidia.com>
@seonjinn seonjinn force-pushed the sj/hybridep-support branch from 7cc2e65 to 6c0cd7e Compare April 13, 2026 01:26
Resolve conflicts: keep aarch64 deep_ep split, adopt main's
flashinfer/cutlass/emerging-optimizers updates.

Signed-off-by: sna <sna@nvidia.com>
@github-actions
Copy link
Copy Markdown

❌ Submodule Fast-Forward Check Failed

Check based on commit: 0349e83 (PR #1942 from sj/hybridep-support)

❌ Submodules that need attention:

Megatron-Bridge: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/7110a964272a5c74dcb6b680b691087e190c220c/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/a2bb70b91b827bd6b085a77442c7cf60cfdb59fe/

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/17a67b9a97fb11a75933fd7f76ad76e1ac98a53d/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA/Megatron-LM/commits/9e2810417315a7ee93b41d4e234454abd3c16af5/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: sna <sna@nvidia.com>
@github-actions
Copy link
Copy Markdown

❌ Submodule Fast-Forward Check Failed

Check based on commit: ec53cb6 (PR #1942 from sj/hybridep-support)

❌ Submodules that need attention:

Megatron-Bridge: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/7110a964272a5c74dcb6b680b691087e190c220c/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/a2bb70b91b827bd6b085a77442c7cf60cfdb59fe/

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/17a67b9a97fb11a75933fd7f76ad76e1ac98a53d/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA/Megatron-LM/commits/9e2810417315a7ee93b41d4e234454abd3c16af5/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: sna <sna@nvidia.com>
…divisible_by

Signed-off-by: sna <sna@nvidia.com>
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #1942 — Add HybridEP support for MoE expert parallelism

Overall this looks good — all prior review feedback has been addressed (conda removal, CUDA_HOME move, platform markers, cleanup). One minor duplicate line to fix.

Generated by Claude Code

Comment thread tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh Outdated
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Seonjin  <sna@nvidia.com>
@terrykong
Copy link
Copy Markdown
Collaborator

Offline testing update:

  • qwen3-235b-16n4g: Confirmed working ✅
  • deepseek-v3-32n4g: TBD
  • qwen3-30ba3b-4n4g: Hangs ❌

@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test e858ed6

Signed-off-by: Seonjin Na <sna@nvidia.com>
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test f832943

Comment thread tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh Outdated
guyueh1
guyueh1 previously approved these changes Apr 15, 2026
Copy link
Copy Markdown
Contributor

@guyueh1 guyueh1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Seonjin Na <sna@nvidia.com>
@guyueh1
Copy link
Copy Markdown
Contributor

guyueh1 commented Apr 15, 2026

/ok to test ee42b36

guyueh1
guyueh1 previously approved these changes Apr 15, 2026
@guyueh1
Copy link
Copy Markdown
Contributor

guyueh1 commented Apr 16, 2026

/ok to test f45703a

seonjinn added a commit that referenced this pull request Apr 17, 2026
Two new variants isolate the HybridEP dispatcher effect following the
PR #1942 pattern on GB200 NVL72: flex dispatcher + hybridep backend +
16 SMs, EP=8, no load balancing, no high-priority stream.

  hpstream_04_hybridep_nopack: HybridEP only, sequence packing off.
  hpstream_05_hybridep_pack:   HybridEP only, sequence packing on.

Pairs with hpstream_03 (HybridEP + seqpack + bias-update LB + HP EP
stream) to decompose the contribution of each feature on top of raw
HybridEP.

Signed-off-by: sna <sna@nvidia.com>
Applies runtime patches to megatron.core.transformer.moe.fused_a2a to fix:
1. NCCL allgather hang (enable_custom_allgather=True)
2. Buffer overflow on seq_len growth (reinit when seq_len > max)
3. Dirty buffer after backward (needs_reset flag)
4. HybridEPDispatch.backward return count (10 -> 9)
5. HybridEPCombine.backward return count (5 -> 4)

Applied in setup.py when moe_flex_dispatcher_backend=hybridep.

Signed-off-by: sna <sna@nvidia.com>
Commit a48493 has the buffer reallocation and backward state cleanup fixed in deep_ep itself. This eliminates the need for runtime monkey-patches on fused_a2a.py that were previously required for 7febc6e2, and improves Training/LogProb throughput by ~1-2% on Qwen3-30B-A3B and ~7-10% on Qwen3-235B.

Signed-off-by: sna <sna@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L2 Run doctests, unit tests, functional tests, and convergence tests Performance Related to improving performance r0.6.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants