Fix LLM deploy test failure by defaulting expert parallelism to 1#1273
Fix LLM deploy test failure by defaulting expert parallelism to 1#1273
Conversation
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/deploy/llm/generate.py`:
- Around line 112-115: The code currently forces ep = 1 unconditionally,
preventing callers from enabling expert parallelism; modify the wrapper
signature (e.g., add a parameter named moe_expert_parallel_size or similar to
the generate function and/or its callers) with a default of 1, replace the
hard-coded ep = 1 assignment with ep = moe_expert_parallel_size, and propagate
that value to where ep is used (the variable referenced at Line ~145); keep the
existing comment about forcing ep=1 as the safe default for unsupported GPUs but
allow callers to set moe_expert_parallel_size when their environment supports
EP.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9d0505c9-f557-4c9f-aa4e-4849c08e2a7f
📒 Files selected for processing (1)
modelopt/deploy/llm/generate.py
| # Force ep=1 to avoid TRT-LLM DeepEP kernel failures on unsupported GPUs | ||
| # (e.g. Blackwell SM 12.0). Expert parallelism can be enabled explicitly | ||
| # by the caller when the environment is known to support it. | ||
| ep = 1 |
There was a problem hiding this comment.
Comment/behavior mismatch: expert parallelism is now hard-disabled with no caller override.
Line 113 says EP can be enabled explicitly by callers, but this wrapper has no moe_expert_parallel_size input and always sets ep = 1 (propagated at Line 145). That makes EP impossible to enable through this API and can regress MoE multi-GPU setups that expect configurable EP (as shown in examples/specdec_bench/specdec_bench/models/trtllm_torch_api.py and examples/speculative_decoding/collect_hidden_states/compute_hidden_states_trtllm.py).
Suggested fix
class LLM(TRTLLM):
def __init__(
self,
checkpoint_dir: str | Path,
tokenizer: "str | Path | None" = None,
medusa_choices: Any = None,
tp: int = 0,
+ moe_expert_parallel_size: int | None = None,
trust_remote_code: bool = False,
max_seq_len: int = 0,
max_batch_size: int = 0,
):
@@
- # Force ep=1 to avoid TRT-LLM DeepEP kernel failures on unsupported GPUs
- # (e.g. Blackwell SM 12.0). Expert parallelism can be enabled explicitly
- # by the caller when the environment is known to support it.
- ep = 1
+ # Default to EP=1 to avoid TRT-LLM DeepEP kernel failures on unsupported GPUs.
+ # Allow explicit override when the caller knows the environment supports EP.
+ ep = 1 if moe_expert_parallel_size is None else moe_expert_parallel_size🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/deploy/llm/generate.py` around lines 112 - 115, The code currently
forces ep = 1 unconditionally, preventing callers from enabling expert
parallelism; modify the wrapper signature (e.g., add a parameter named
moe_expert_parallel_size or similar to the generate function and/or its callers)
with a default of 1, replace the hard-coded ep = 1 assignment with ep =
moe_expert_parallel_size, and propagate that value to where ep is used (the
variable referenced at Line ~145); keep the existing comment about forcing ep=1
as the safe default for unsupported GPUs but allow callers to set
moe_expert_parallel_size when their environment supports EP.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
it just mean when TRT LLM supports it, we will remove the hardcoded EP. Should be ok
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1273 +/- ##
==========================================
+ Coverage 75.61% 76.56% +0.94%
==========================================
Files 459 459
Lines 48597 48596 -1
==========================================
+ Hits 36747 37206 +459
+ Misses 11850 11390 -460
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
kevalmorabia97
left a comment
There was a problem hiding this comment.
Passes on 2-gpu now
| backend="pytorch", | ||
| model=checkpoint_dir, | ||
| tensor_parallel_size=tp, | ||
| moe_expert_parallel_size=ep, |
There was a problem hiding this comment.
would it work if we remove this moe_expert_parallel_size assignment so trtllm find the best ep by itself?
What does this PR do?
Type of change: Bug fix
Fixes TRT-LLM DeepEP kernel failures during LLM deployment on unsupported GPUs (e.g. Blackwell SM 12.0) by defaulting expert parallelism (
ep) to 1 instead of auto-setting it to the GPU count for MoE models.Previously, when the model config contained expert-related keys,
epwas automatically set totorch.cuda.device_count(), which triggered DeepEP kernel failures on GPUs that don't support it. Nowepdefaults to 1 while still enabling attention data parallelism for MoE models. Expert parallelism can be enabled explicitly by the caller when the environment is known to support it.Testing
llm_ptqtest passes with this fix on Blackwell GPUs.Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/A