feat: Add HybridEP support for MoE expert parallelism by seonjinn · Pull Request #1942 · NVIDIA-NeMo/RL

seonjinn · 2026-02-13T19:37:01Z

Update DeepEP dependency to hybrid-ep branch for HybridEP support
- automodel, vllm, mcore dependency groups updated
Add HybridEP configuration options in _apply_moe_config():
- moe_flex_dispatcher_backend: Flex dispatcher backend (e.g., 'hybridep')
- moe_hybridep_num_sms: Number of SMs for HybridEP operations

Usage in config:
policy.megatron_cfg.moe_token_dispatcher_type=flex
policy.megatron_cfg.moe_flex_dispatcher_backend=hybridep
policy.megatron_cfg.moe_hybridep_num_sms=32

See: https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added configuration support for additional Mixture of Experts (MoE) model parameters, including dispatcher backend and HybridEP settings.
Dependencies
- Updated DeepEP dependency reference to use the hybrid-ep branch across multiple dependency groups.

coderabbitai · 2026-02-13T19:41:08Z

Walkthrough

The changes add two conditional configuration hooks to the MoE setup function for optional dispatcher and HybridEP parameters, and update the DeepEP dependency reference from a specific commit to the hybrid-ep branch across multiple dependency groups in the project configuration.

Changes

Cohort / File(s)	Summary
MoE Configuration Hooks `nemo_rl/models/megatron/setup.py`	Adds conditional assignments for `moe_flex_dispatcher_backend` and `moe_hybridep_num_sms` configuration parameters in the `_apply_moe_config` function.
Dependency Updates `pyproject.toml`	Updates DeepEP git dependency reference from commit `bfded34800dfec415b71503f8205181de90b2480` to the `hybrid-ep` branch across automodel, vllm, and mcore dependency groups with explanatory comments.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch sj/hybridep-support

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pyproject.toml (1)

324-327: ⚠️ Potential issue | 🟠 Major

Stale dependency-metadata version for deep_ep.

The dependency is pinned to the hybrid-ep branch (which dynamically generates its version from the current commit hash via git rev-parse --short HEAD), but the dependency-metadata version is statically set to v1.2.1+bfded34. This means the metadata version will become stale whenever the branch advances, potentially causing uv resolver failures.

Either:

Update to pin to a specific commit hash instead of a branch, or

Update the metadata version to match the current HEAD of hybrid-ep and regenerate it whenever the dependency updates

🤖 Fix all issues with AI agents

In `@nemo_rl/models/megatron/setup.py`:
- Around line 405-412: The new runtime keys moe_flex_dispatcher_backend and
moe_hybridep_num_sms are missing from the MegatronConfig TypedDict and from
example configs; add both to the MegatronConfig definition in
nemo_rl/models/policy/__init__.py as NotRequired entries (use the exact symbol
name MegatronConfig) with short docstrings: "Backend type for MoE flex
dispatcher (HybridEP)" for moe_flex_dispatcher_backend and "Number of SMs for
HybridEP" for moe_hybridep_num_sms, and then update at least one exemplar YAML
in examples/configs (e.g., a megatron MoE config) to include these keys with
sensible defaults (recommended defaults) so they are documented and visible to
users.

🧹 Nitpick comments (1)

pyproject.toml (1)

70-72: Branch ref instead of pinned commit reduces build reproducibility.

All three dependency groups now point to @hybrid-ep (a moving branch) instead of a fixed commit hash. This means builds are not reproducible — a force-push or new commit on that branch silently changes what gets installed. Consider pinning to a specific commit on the hybrid-ep branch once it stabilizes.

terrykong

mostly lgtm @seonjinn , but the main thing i'm concerned about is the introduction of conda, we can continue discussing on the thread is tarted

copy-pr-bot · 2026-04-08T18:49:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- Add deep_ep aarch64 dependency (7febc6e2, hybrid-ep branch) - Add HybridEP setup in megatron setup.py (IMEX env vars, NVLink domain config) - Add Qwen3-30B-A3B 4n4g config with moe_flex_dispatcher_backend=hybridep - Add EP=4, EP=8, sms16 config variants for ablation testing - Add test scripts for EP variants - Update Dockerfile for aarch64 deep_ep build support - Add HybridEP settings to 235B and DeepSeek-V3 performance configs Signed-off-by: sna <sna@nvidia.com>

Resolve conflicts: keep aarch64 deep_ep split, adopt main's flashinfer/cutlass/emerging-optimizers updates. Signed-off-by: sna <sna@nvidia.com>

github-actions · 2026-04-13T02:41:46Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 0349e83 (PR #1942 from sj/hybridep-support)

❌ Submodules that need attention:

Megatron-Bridge: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/7110a964272a5c74dcb6b680b691087e190c220c/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/a2bb70b91b827bd6b085a77442c7cf60cfdb59fe/

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/17a67b9a97fb11a75933fd7f76ad76e1ac98a53d/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA/Megatron-LM/commits/9e2810417315a7ee93b41d4e234454abd3c16af5/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: sna <sna@nvidia.com>

github-actions · 2026-04-13T02:52:41Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: ec53cb6 (PR #1942 from sj/hybridep-support)

❌ Submodules that need attention:

Megatron-Bridge: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/7110a964272a5c74dcb6b680b691087e190c220c/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/a2bb70b91b827bd6b085a77442c7cf60cfdb59fe/

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/17a67b9a97fb11a75933fd7f76ad76e1ac98a53d/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA/Megatron-LM/commits/9e2810417315a7ee93b41d4e234454abd3c16af5/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: sna <sna@nvidia.com>

…divisible_by Signed-off-by: sna <sna@nvidia.com>

terrykong

Review: PR #1942 — Add HybridEP support for MoE expert parallelism

Overall this looks good — all prior review feedback has been addressed (conda removal, CUDA_HOME move, platform markers, cleanup). One minor duplicate line to fix.

Generated by Claude Code

Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>

terrykong · 2026-04-14T01:33:53Z

Offline testing update:

qwen3-235b-16n4g: Confirmed working ✅
deepseek-v3-32n4g: TBD
qwen3-30ba3b-4n4g: Hangs ❌

Signed-off-by: sna <sna@nvidia.com>

seonjinn · 2026-04-15T18:49:33Z

/ok to test e858ed6

Signed-off-by: Seonjin Na <sna@nvidia.com>

seonjinn · 2026-04-15T20:28:54Z

/ok to test f832943

guyueh1

LGTM

Signed-off-by: Seonjin Na <sna@nvidia.com>

Signed-off-by: sna <sna@nvidia.com>

guyueh1 · 2026-04-15T21:58:00Z

/ok to test ee42b36

guyueh1 · 2026-04-16T23:09:26Z

/ok to test f45703a

Two new variants isolate the HybridEP dispatcher effect following the PR #1942 pattern on GB200 NVL72: flex dispatcher + hybridep backend + 16 SMs, EP=8, no load balancing, no high-priority stream. hpstream_04_hybridep_nopack: HybridEP only, sequence packing off. hpstream_05_hybridep_pack: HybridEP only, sequence packing on. Pairs with hpstream_03 (HybridEP + seqpack + bias-update LB + HP EP stream) to decompose the contribution of each feature on top of raw HybridEP. Signed-off-by: sna <sna@nvidia.com>

Applies runtime patches to megatron.core.transformer.moe.fused_a2a to fix: 1. NCCL allgather hang (enable_custom_allgather=True) 2. Buffer overflow on seq_len growth (reinit when seq_len > max) 3. Dirty buffer after backward (needs_reset flag) 4. HybridEPDispatch.backward return count (10 -> 9) 5. HybridEPCombine.backward return count (5 -> 4) Applied in setup.py when moe_flex_dispatcher_backend=hybridep. Signed-off-by: sna <sna@nvidia.com>

This reverts commit 58b048d.

Commit a48493 has the buffer reallocation and backward state cleanup fixed in deep_ep itself. This eliminates the need for runtime monkey-patches on fused_a2a.py that were previously required for 7febc6e2, and improves Training/LogProb throughput by ~1-2% on Qwen3-30B-A3B and ~7-10% on Qwen3-235B. Signed-off-by: sna <sna@nvidia.com>

seonjinn requested review from a team as code owners February 13, 2026 19:37

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

Comment thread nemo_rl/models/megatron/setup.py

seonjinn requested a review from guyueh1 February 13, 2026 19:42

seonjinn requested review from a team as code owners February 13, 2026 23:26

guyueh1 reviewed Feb 14, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

Comment thread pyproject.toml Outdated

Comment thread pyproject.toml Outdated

seonjinn added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Mar 4, 2026

seonjinn requested a review from terrykong March 4, 2026 01:59

guyueh1 added the Performance Related to improving performance label Mar 5, 2026

seonjinn changed the title ~~feat: Add HybridEP support for MoE expert parallelism~~ feat: Add HybridEP/Partial CudaGraph support for MoE expert parallelism Mar 5, 2026

terrykong reviewed Mar 6, 2026

View reviewed changes

Comment thread ray.sub Outdated

terrykong reviewed Mar 6, 2026

View reviewed changes

anwithk added this to the v0.6 Release milestone Mar 20, 2026

guyueh1 reviewed Apr 8, 2026

View reviewed changes

Comment thread ray.sub Outdated

Comment thread ray.sub Outdated

guyueh1 changed the title ~~feat: Add HybridEP/Partial CudaGraph support for MoE expert parallelism~~ feat: Add HybridEP support for MoE expert parallelism Apr 8, 2026

seonjinn requested a review from a team as a code owner April 9, 2026 20:43

seonjinn force-pushed the sj/hybridep-support branch 2 times, most recently from 9acbdab to 7cc2e65 Compare April 13, 2026 01:23

seonjinn force-pushed the sj/hybridep-support branch from 7cc2e65 to 6c0cd7e Compare April 13, 2026 01:26

Merge latest main into sj/hybridep-support

0349e83

Resolve conflicts: keep aarch64 deep_ep split, adopt main's flashinfer/cutlass/emerging-optimizers updates. Signed-off-by: sna <sna@nvidia.com>

seonjinn added the r0.6.0 label Apr 13, 2026

Revert 3rdparty to match main, no submodule changes

ec53cb6

Signed-off-by: sna <sna@nvidia.com>

Update 3rdparty submodules to match origin/main

2e3b144

Signed-off-by: sna <sna@nvidia.com>

Revert deep_ep to 7febc6e2 and restore original make_sequence_length_…

9cc82f9

…divisible_by Signed-off-by: sna <sna@nvidia.com>

terrykong reviewed Apr 14, 2026

View reviewed changes

Comment thread tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh Outdated

Update tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh

3d9051f

Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>

Revert 30BA3B config to main: remove HybridEP settings, restore EP=16

e858ed6

Signed-off-by: sna <sna@nvidia.com>

copy-pr-bot bot had a problem deploying to nemo-ci April 15, 2026 18:50 Failure

Add uv.lock

f832943

Signed-off-by: Seonjin Na <sna@nvidia.com>

copy-pr-bot bot had a problem deploying to nemo-ci April 15, 2026 20:29 Error

guyueh1 reviewed Apr 15, 2026

View reviewed changes

Comment thread tests/test_suites/llm/performance/grpo-qwen3-235b-16n4g.sh Outdated

guyueh1 previously approved these changes Apr 15, 2026

View reviewed changes

remove flag

6c05b7a

Signed-off-by: Seonjin Na <sna@nvidia.com>

seonjinn dismissed guyueh1’s stale review via 6c05b7a April 15, 2026 20:36

seonjinn added 2 commits April 15, 2026 13:46

Merge branch 'main' into sj/hybridep-support

3d61a6b

Pad packed seq to 128 for HybridEP JIT kernel alignment

ee42b36

Signed-off-by: sna <sna@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci April 15, 2026 21:58 Inactive

guyueh1 requested a review from parthmannan April 15, 2026 21:59

guyueh1 previously approved these changes Apr 15, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci April 15, 2026 23:44 Inactive

Merge branch 'main' into sj/hybridep-support

f45703a

copy-pr-bot bot temporarily deployed to nemo-ci April 16, 2026 23:09 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 16, 2026 23:17 Inactive

seonjinn dismissed guyueh1’s stale review via 58b048d April 18, 2026 08:35

seonjinn added 2 commits April 18, 2026 01:38

Revert "Add HybridEP monkey-patches for fused_a2a.py bugs"

fc275d4

This reverts commit 58b048d.

Conversation

seonjinn commented Feb 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot bot commented Apr 8, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 13, 2026

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Apr 13, 2026

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

terrykong left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review: PR #1942 — Add HybridEP support for MoE expert parallelism

Uh oh!

Uh oh!

terrykong commented Apr 14, 2026

Uh oh!

seonjinn commented Apr 15, 2026

Uh oh!

seonjinn commented Apr 15, 2026

Uh oh!

Uh oh!

guyueh1 left a comment

Choose a reason for hiding this comment

Uh oh!

guyueh1 commented Apr 15, 2026

Uh oh!

guyueh1 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seonjinn commented Feb 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 13, 2026 •

edited

Loading

terrykong left a comment •

edited

Loading