Fix LLM deploy test failure by defaulting expert parallelism to 1 by cjluo-nv · Pull Request #1273 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-04-16T06:04:21Z

What does this PR do?

Type of change: Bug fix

Fixes TRT-LLM DeepEP kernel failures during LLM deployment on unsupported GPUs (e.g. Blackwell SM 12.0) by defaulting expert parallelism (ep) to 1 instead of auto-setting it to the GPU count for MoE models.

Previously, when the model config contained expert-related keys, ep was automatically set to torch.cuda.device_count(), which triggered DeepEP kernel failures on GPUs that don't support it. Now ep defaults to 1 while still enabling attention data parallelism for MoE models. Expert parallelism can be enabled explicitly by the caller when the environment is known to support it.

Testing

Verified that the llm_ptq test passes with this fix on Blackwell GPUs.
2-gpu CI test triggered: https://github.com/NVIDIA/Model-Optimizer/actions/runs/24495054531/job/71588037727

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A
Did you update Changelog?: N/A

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-04-16T06:04:38Z

📝 Walkthrough

Walkthrough

The LLM.__init__ method in modelopt/deploy/llm/generate.py was modified to force expert-parallel size (ep) to 1 instead of dynamically scaling it based on GPU device count when specific configuration keys are detected. Related inline comments were updated accordingly.

Changes

Cohort / File(s)	Summary
Expert Parallel Initialization Logic `modelopt/deploy/llm/generate.py`	Modified `ep` (expert-parallel size) assignment from dynamic GPU count scaling to fixed value of 1. The `enable_attention_dp` flag behavior remains unchanged. Comments updated to reflect new DeepEP/GPU compatibility handling.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns	✅ Passed	The PR implements trust_remote_code parameter with safe default False in LLM.init, allowing caller configuration for tokenizer loading operations.
Title check	✅ Passed	The title accurately describes the main change: defaulting expert parallelism to 1 to fix a test failure in LLM deployment.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/fix_test

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/deploy/llm/generate.py`:
- Around line 112-115: The code currently forces ep = 1 unconditionally,
preventing callers from enabling expert parallelism; modify the wrapper
signature (e.g., add a parameter named moe_expert_parallel_size or similar to
the generate function and/or its callers) with a default of 1, replace the
hard-coded ep = 1 assignment with ep = moe_expert_parallel_size, and propagate
that value to where ep is used (the variable referenced at Line ~145); keep the
existing comment about forcing ep=1 as the safe default for unsupported GPUs but
allow callers to set moe_expert_parallel_size when their environment supports
EP.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9d0505c9-f557-4c9f-aa4e-4849c08e2a7f

📥 Commits

Reviewing files that changed from the base of the PR and between d45219b and 89d724b.

📒 Files selected for processing (1)

modelopt/deploy/llm/generate.py

coderabbitai · 2026-04-16T06:07:48Z

+        # Force ep=1 to avoid TRT-LLM DeepEP kernel failures on unsupported GPUs
+        # (e.g. Blackwell SM 12.0). Expert parallelism can be enabled explicitly
+        # by the caller when the environment is known to support it.
        ep = 1


⚠️ Potential issue | 🟠 Major

Comment/behavior mismatch: expert parallelism is now hard-disabled with no caller override.

Line 113 says EP can be enabled explicitly by callers, but this wrapper has no moe_expert_parallel_size input and always sets ep = 1 (propagated at Line 145). That makes EP impossible to enable through this API and can regress MoE multi-GPU setups that expect configurable EP (as shown in examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py and examples/speculative_decoding/collect_hidden_states/compute_hidden_states_trtllm.py).

Suggested fix

class LLM(TRTLLM): def __init__( self, checkpoint_dir: str | Path, tokenizer: "str | Path | None" = None, medusa_choices: Any = None, tp: int = 0, + moe_expert_parallel_size: int | None = None, trust_remote_code: bool = False, max_seq_len: int = 0, max_batch_size: int = 0, ): @@ - # Force ep=1 to avoid TRT-LLM DeepEP kernel failures on unsupported GPUs - # (e.g. Blackwell SM 12.0). Expert parallelism can be enabled explicitly - # by the caller when the environment is known to support it. - ep = 1 + # Default to EP=1 to avoid TRT-LLM DeepEP kernel failures on unsupported GPUs. + # Allow explicit override when the caller knows the environment supports EP. + ep = 1 if moe_expert_parallel_size is None else moe_expert_parallel_size

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/deploy/llm/generate.py` around lines 112 - 115, The code currently forces ep = 1 unconditionally, preventing callers from enabling expert parallelism; modify the wrapper signature (e.g., add a parameter named moe_expert_parallel_size or similar to the generate function and/or its callers) with a default of 1, replace the hard-coded ep = 1 assignment with ep = moe_expert_parallel_size, and propagate that value to where ep is used (the variable referenced at Line ~145); keep the existing comment about forcing ep=1 as the safe default for unsupported GPUs but allow callers to set moe_expert_parallel_size when their environment supports EP.

@cjluo-nv is this important to address?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

it just mean when TRT LLM supports it, we will remove the hardcoded EP. Should be ok

github-actions · 2026-04-16T06:09:27Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1273/
Built to branch `gh-pages` at 2026-04-16 06:09 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-04-16T06:17:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.56%. Comparing base (07ae8e7) to head (89d724b).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1273      +/-   ##
==========================================
+ Coverage   75.61%   76.56%   +0.94%     
==========================================
  Files         459      459              
  Lines       48597    48596       -1     
==========================================
+ Hits        36747    37206     +459     
+ Misses      11850    11390     -460

Flag	Coverage Δ
examples	`41.43% <ø> (+11.59%)`	⬆️
gpu	`59.95% <ø> (-0.55%)`	⬇️
unit	`52.23% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kevalmorabia97

Passes on 2-gpu now

shengliangxu · 2026-04-16T17:14:27Z

            backend="pytorch",
            model=checkpoint_dir,
            tensor_parallel_size=tp,
            moe_expert_parallel_size=ep,


would it work if we remove this moe_expert_parallel_size assignment so trtllm find the best ep by itself?

Fix llm_ptq test bug

89d724b

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv requested a review from a team as a code owner April 16, 2026 06:04

cjluo-nv requested a review from meenchen April 16, 2026 06:04

cjluo-nv changed the title ~~Fix llm_ptq test bug~~ Fix LLM deploy test failure by defaulting expert parallelism to 1 Apr 16, 2026

cjluo-nv requested a review from kevalmorabia97 April 16, 2026 06:05

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

kevalmorabia97 mentioned this pull request Apr 16, 2026

Fix DeepEP TMA constraint violation for MoE CUDA graph batch sizes #1267

Closed

1 task

kevalmorabia97 approved these changes Apr 16, 2026

View reviewed changes

shengliangxu reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LLM deploy test failure by defaulting expert parallelism to 1#1273

Fix LLM deploy test failure by defaulting expert parallelism to 1#1273
cjluo-nv wants to merge 1 commit intomainfrom
chenjiel/fix_test

cjluo-nv commented Apr 16, 2026 •

edited by kevalmorabia97

Loading

Uh oh!

coderabbitai bot commented Apr 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 16, 2026 •

edited

Loading

Uh oh!

kevalmorabia97 Apr 16, 2026

Uh oh!

coderabbitai bot Apr 16, 2026

Uh oh!

cjluo-nv Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Built to branch `gh-pages` at 2026-04-16 06:09 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

kevalmorabia97 left a comment

Uh oh!

shengliangxu Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cjluo-nv commented Apr 16, 2026 • edited by kevalmorabia97 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

Before your PR is "Ready for review"

Uh oh!

coderabbitai bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 16, 2026

Built to branch gh-pages at 2026-04-16 06:09 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kevalmorabia97 left a comment

Choose a reason for hiding this comment

Uh oh!

shengliangxu Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjluo-nv commented Apr 16, 2026 •

edited by kevalmorabia97

Loading

coderabbitai bot commented Apr 16, 2026 •

edited

Loading

coderabbitai bot Apr 16, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-16 06:09 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov bot commented Apr 16, 2026 •

edited

Loading