Handle zero-amax per-channel activation scaling for MoE export by AEON-7 · Pull Request #1265 · NVIDIA/Model-Optimizer

AEON-7 · 2026-04-15T04:26:34Z

What

NVFP4QTensor.get_activation_scaling_factor asserts:

assert torch.all(activation_scaling_factor > 0), (
    f" activation scaling factor {activation_scaling_factor} not positive."
)

On MoE models, some per-channel activation amax entries are exactly zero because routing sparsity leaves certain input slots on rarely-routed experts un-activated during calibration. The derived scaling factor (amax / (maxbound * 448)) is then zero and the assertion trips.

How to reproduce

Any MoE model with per-expert-decomposed linears quantized using NVFP4_AWQ_FULL_CFG. On SuperGemma4 26B (128 experts, ~6% activation rate per expert per token), this fires on the first expert whose calibration-time coverage left even a single channel dark. It is the routine case, not the edge case.

The fix

Detect zero entries in the computed activation_scaling_factor tensor and replace them with the minimum positive value in the same tensor via torch.where. Fall back to a small positive floor (1e-8) for the pathological case where every channel in the tensor is zero (block entirely un-activated).

Why this is numerically safe

A zero amax channel means no activation was ever observed there during calibration. Any value flowing through that channel at inference is therefore statistically near-zero relative to the observed distribution. Scaling that near-zero value by the "quietest live channel's" scaling factor quantizes it to near-zero and dequantizes back to near-zero — the same end result as a genuinely zero scale, minus the NaN/division hazards.

The assertion after the fix remains strict (torch.all(scale > 0)), so downstream code that relies on the positivity invariant is unaffected.

Validation

End-to-end on SuperGemma4 26B (Gemma 4 MoE, 128 experts, per-expert-decomposed plugin) with NVFP4_AWQ_FULL_CFG:

Before: AssertionError: activation scaling factor tensor([...]) not positive. on a per-channel tensor whose printed head hides zeros in the ... ellipsis.
After: scaling factors land strictly positive; export produces a valid NVFP4 checkpoint. Side-by-side sampled generation (seed=42, temperature=0.7, top_p=0.9) against the BF16 baseline matches on fact-recall ("The capital of France is Paris." — identical) and produces coherent, well-formed outputs on creative (haiku) and technical (neural network explanation) prompts.

The resulting quantized model ships at AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4.

Companion PR

Depends on / pairs with #1264 (non-scalar input amax in preprocess_linear_fusion). Both are orthogonal bugs on the same NVFP4 + per-expert-MoE export path; this PR fixes the bug that fires after #1264's fix unblocks the fusion step.

Summary by CodeRabbit

Bug Fixes
- Fixed an edge-case in quantization where exact-zero scaling factors could occur; those zeros are now replaced with safe positive defaults to prevent downstream dequantization errors and ensure stable, correct behavior in rare inputs.

copy-pr-bot · 2026-04-15T04:26:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-15T04:26:48Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4cf5210f-4ca3-4d46-9b95-a498d62a0e71

📥 Commits

Reviewing files that changed from the base of the PR and between 7cb5851 and c6edb16.

📒 Files selected for processing (1)

modelopt/torch/quantization/qtensor/nvfp4_tensor.py

📝 Walkthrough

Walkthrough

The get_activation_scaling_factor function in the NVFP4 tensor module now detects exact-zero scaling-factor entries and replaces them with the tensor's minimum strictly-positive scaling factor when available, otherwise 1e-8 (same device/dtype). Negative scaling factors are left unchanged; the existing assert torch.all(activation_scaling_factor > 0) remains.

Changes

Cohort / File(s)	Summary
NVFP4 Tensor zero-scaling handling `modelopt/torch/quantization/qtensor/nvfp4_tensor.py`	Detects exact-zero entries in computed `activation_scaling_factor` and substitutes them with the minimum strictly-positive value (or `1e-8` if none), using `torch.where`. Leaves negative values unchanged; keeps the existing `> 0` assertion.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Handle zero-amax per-channel activation scaling for MoE export' directly and specifically describes the main change: detecting and handling zero activation scaling factors in MoE models, which is the core fix implemented in the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns	✅ Passed	PR contains no security anti-patterns from SECURITY.md; only algorithmic modifications using safe PyTorch operations with no new dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py`:
- Around line 199-211: Restrict the repair to exact zeros: change zero_mask to
use activation_scaling_factor == 0, then compute positive =
activation_scaling_factor[~zero_mask] and further filter positive =
positive[positive > 0] (so negatives are not considered recoverable); if
positive.numel() > 0 replace zeros with positive.min(), else if there are only
zeros (no negatives present) fall back to torch.full_like(..., 1e-8) to keep the
tensor valid, but if negatives exist leave activation_scaling_factor untouched
so the existing assert can catch the error. Ensure these updates are applied
around the activation_scaling_factor / zero_mask logic in nvfp4_tensor.py.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a54c8117-c13f-42c0-86a5-b85150490b56

📥 Commits

Reviewing files that changed from the base of the PR and between c9b1155 and 8b3a4eb.

📒 Files selected for processing (1)

modelopt/torch/quantization/qtensor/nvfp4_tensor.py

NVFP4QTensor.get_activation_scaling_factor asserts `torch.all(activation_scaling_factor > 0)` but on MoE models some per-channel activation amax entries are exactly zero: routing sparsity means certain input slots on rarely-routed experts never receive any tokens during calibration, so their observed amax stays at initialization (zero). The derived scaling factor (`amax / (maxbound * 448)`) is then zero too, and the assertion trips during `export_hf_checkpoint()`. In practice this fires immediately after the (separate) fused-linear fusion step completes, on the first expert whose calibration-time coverage left even a single channel dark. With 128 experts and ~6% activation rate per expert per token, this is routine rather than exceptional. This change: - Detects exact-zero entries in the computed scaling factor tensor via `== 0` (not `<= 0`), so that negative entries — which would indicate a genuine upstream bug, not sparsity — remain untouched and continue to trip the existing positivity assertion rather than being silently masked. - Replaces the zero entries with the minimum strictly-positive value in the same tensor (elementwise `torch.where`), preserving the per-channel shape and the positivity invariant downstream code relies on. - Falls back to a small positive floor (1e-8) only when no positive entries exist (every channel in the tensor is zero). Why this is numerically safe: a zero amax channel means no activation was ever observed there during calibration. Any value flowing through that channel at inference time is therefore statistically near-zero relative to the observed distribution. Scaling that near-zero value by the "quietest live channel's" scaling factor quantizes it to near-zero and dequantizes it back to near-zero — the same end result as with a genuinely zero scale, minus the NaN/division hazards. Validated end-to-end on SuperGemma4 26B (128-expert Gemma 4 MoE) with `NVFP4_AWQ_FULL_CFG`: export completes, the serialized checkpoint loads into transformers via `mto.restore`, and sampled generation is semantically equivalent to the BF16 baseline on fact-recall, creative, and technical prompts. Signed-off-by: AEON-7 <m2vgz48wpp@privaterelay.appleid.com>

AEON-7 · 2026-04-15T13:12:42Z

Good catch — you're right, <= 0 would let negative entries (which only arise from upstream bugs, not routing sparsity) be silently "repaired" alongside genuine zeros, masking real problems.

Fixed in c6edb1685f: zero_mask now matches exactly-zero entries only. Negative entries are left untouched so the existing positivity assertion below still catches them. Also tightened the positive selection to filter on > 0 rather than ~zero_mask, so the replacement value can never be negative even if one somehow slipped through. The all-zero fallback path is preserved.

Diff:

-        zero_mask = activation_scaling_factor <= 0
+        zero_mask = activation_scaling_factor == 0
         if zero_mask.any():
-            positive = activation_scaling_factor[~zero_mask]
-            if positive.numel() > 0:
-                activation_scaling_factor = torch.where(
-                    zero_mask, positive.min(), activation_scaling_factor
-                )
-            else:
-                activation_scaling_factor = torch.full_like(
-                    activation_scaling_factor, 1e-8
-                )
+            positive = activation_scaling_factor[activation_scaling_factor > 0]
+            replacement = (
+                positive.min()
+                if positive.numel() > 0
+                else torch.tensor(
+                    1e-8,
+                    device=activation_scaling_factor.device,
+                    dtype=activation_scaling_factor.dtype,
+                )
+            )
+            activation_scaling_factor = torch.where(
+                zero_mask, replacement, activation_scaling_factor
+            )

Thanks for the review.

AEON-7 requested a review from a team as a code owner April 15, 2026 04:26

AEON-7 requested a review from kaix-nv April 15, 2026 04:26

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

Comment thread modelopt/torch/quantization/qtensor/nvfp4_tensor.py Outdated

AEON-7 force-pushed the aeon7/fix-zero-amax-scaling-factor-moe-export branch from 8b3a4eb to 7cb5851 Compare April 15, 2026 04:32

AEON-7 force-pushed the aeon7/fix-zero-amax-scaling-factor-moe-export branch from 7cb5851 to c6edb16 Compare April 15, 2026 13:12

shengliangxu requested a review from meenchen April 16, 2026 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle zero-amax per-channel activation scaling for MoE export#1265

Handle zero-amax per-channel activation scaling for MoE export#1265
AEON-7 wants to merge 1 commit intoNVIDIA:mainfrom
AEON-7:aeon7/fix-zero-amax-scaling-factor-moe-export

AEON-7 commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Apr 15, 2026

Uh oh!

coderabbitai bot commented Apr 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

AEON-7 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AEON-7 commented Apr 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How to reproduce

The fix

Why this is numerically safe

Validation

Companion PR

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Apr 15, 2026

Uh oh!

coderabbitai bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AEON-7 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AEON-7 commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 15, 2026 •

edited

Loading