Skip to content

Fix LLM deploy test failure by defaulting expert parallelism to 1#1273

Open
cjluo-nv wants to merge 1 commit intomainfrom
chenjiel/fix_test
Open

Fix LLM deploy test failure by defaulting expert parallelism to 1#1273
cjluo-nv wants to merge 1 commit intomainfrom
chenjiel/fix_test

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented Apr 16, 2026

What does this PR do?

Type of change: Bug fix

Fixes TRT-LLM DeepEP kernel failures during LLM deployment on unsupported GPUs (e.g. Blackwell SM 12.0) by defaulting expert parallelism (ep) to 1 instead of auto-setting it to the GPU count for MoE models.

Previously, when the model config contained expert-related keys, ep was automatically set to torch.cuda.device_count(), which triggered DeepEP kernel failures on GPUs that don't support it. Now ep defaults to 1 while still enabling attention data parallelism for MoE models. Expert parallelism can be enabled explicitly by the caller when the environment is known to support it.

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A
  • Did you update Changelog?: N/A

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv requested a review from a team as a code owner April 16, 2026 06:04
@cjluo-nv cjluo-nv requested a review from meenchen April 16, 2026 06:04
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

The LLM.__init__ method in modelopt/deploy/llm/generate.py was modified to force expert-parallel size (ep) to 1 instead of dynamically scaling it based on GPU device count when specific configuration keys are detected. Related inline comments were updated accordingly.

Changes

Cohort / File(s) Summary
Expert Parallel Initialization Logic
modelopt/deploy/llm/generate.py
Modified ep (expert-parallel size) assignment from dynamic GPU count scaling to fixed value of 1. The enable_attention_dp flag behavior remains unchanged. Comments updated to reflect new DeepEP/GPU compatibility handling.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed The PR implements trust_remote_code parameter with safe default False in LLM.init, allowing caller configuration for tokenizer loading operations.
Title check ✅ Passed The title accurately describes the main change: defaulting expert parallelism to 1 to fix a test failure in LLM deployment.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/fix_test

Comment @coderabbitai help to get the list of available commands and usage tips.

@cjluo-nv cjluo-nv changed the title Fix llm_ptq test bug Fix LLM deploy test failure by defaulting expert parallelism to 1 Apr 16, 2026
@cjluo-nv cjluo-nv requested a review from kevalmorabia97 April 16, 2026 06:05
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/deploy/llm/generate.py`:
- Around line 112-115: The code currently forces ep = 1 unconditionally,
preventing callers from enabling expert parallelism; modify the wrapper
signature (e.g., add a parameter named moe_expert_parallel_size or similar to
the generate function and/or its callers) with a default of 1, replace the
hard-coded ep = 1 assignment with ep = moe_expert_parallel_size, and propagate
that value to where ep is used (the variable referenced at Line ~145); keep the
existing comment about forcing ep=1 as the safe default for unsupported GPUs but
allow callers to set moe_expert_parallel_size when their environment supports
EP.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9d0505c9-f557-4c9f-aa4e-4849c08e2a7f

📥 Commits

Reviewing files that changed from the base of the PR and between d45219b and 89d724b.

📒 Files selected for processing (1)
  • modelopt/deploy/llm/generate.py

Comment on lines +112 to 115
# Force ep=1 to avoid TRT-LLM DeepEP kernel failures on unsupported GPUs
# (e.g. Blackwell SM 12.0). Expert parallelism can be enabled explicitly
# by the caller when the environment is known to support it.
ep = 1
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Comment/behavior mismatch: expert parallelism is now hard-disabled with no caller override.

Line 113 says EP can be enabled explicitly by callers, but this wrapper has no moe_expert_parallel_size input and always sets ep = 1 (propagated at Line 145). That makes EP impossible to enable through this API and can regress MoE multi-GPU setups that expect configurable EP (as shown in examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py and examples/speculative_decoding/collect_hidden_states/compute_hidden_states_trtllm.py).

Suggested fix
 class LLM(TRTLLM):
     def __init__(
         self,
         checkpoint_dir: str | Path,
         tokenizer: "str | Path | None" = None,
         medusa_choices: Any = None,
         tp: int = 0,
+        moe_expert_parallel_size: int | None = None,
         trust_remote_code: bool = False,
         max_seq_len: int = 0,
         max_batch_size: int = 0,
     ):
@@
-        # Force ep=1 to avoid TRT-LLM DeepEP kernel failures on unsupported GPUs
-        # (e.g. Blackwell SM 12.0). Expert parallelism can be enabled explicitly
-        # by the caller when the environment is known to support it.
-        ep = 1
+        # Default to EP=1 to avoid TRT-LLM DeepEP kernel failures on unsupported GPUs.
+        # Allow explicit override when the caller knows the environment supports EP.
+        ep = 1 if moe_expert_parallel_size is None else moe_expert_parallel_size
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/deploy/llm/generate.py` around lines 112 - 115, The code currently
forces ep = 1 unconditionally, preventing callers from enabling expert
parallelism; modify the wrapper signature (e.g., add a parameter named
moe_expert_parallel_size or similar to the generate function and/or its callers)
with a default of 1, replace the hard-coded ep = 1 assignment with ep =
moe_expert_parallel_size, and propagate that value to where ep is used (the
variable referenced at Line ~145); keep the existing comment about forcing ep=1
as the safe default for unsupported GPUs but allow callers to set
moe_expert_parallel_size when their environment supports EP.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjluo-nv is this important to address?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just mean when TRT LLM supports it, we will remove the hardcoded EP. Should be ok

@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1273/

Built to branch gh-pages at 2026-04-16 06:09 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.56%. Comparing base (07ae8e7) to head (89d724b).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1273      +/-   ##
==========================================
+ Coverage   75.61%   76.56%   +0.94%     
==========================================
  Files         459      459              
  Lines       48597    48596       -1     
==========================================
+ Hits        36747    37206     +459     
+ Misses      11850    11390     -460     
Flag Coverage Δ
examples 41.43% <ø> (+11.59%) ⬆️
gpu 59.95% <ø> (-0.55%) ⬇️
unit 52.23% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passes on 2-gpu now

backend="pytorch",
model=checkpoint_dir,
tensor_parallel_size=tp,
moe_expert_parallel_size=ep,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it work if we remove this moe_expert_parallel_size assignment so trtllm find the best ep by itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants