Skip to content

Handle zero-amax per-channel activation scaling for MoE export#1265

Open
AEON-7 wants to merge 1 commit intoNVIDIA:mainfrom
AEON-7:aeon7/fix-zero-amax-scaling-factor-moe-export
Open

Handle zero-amax per-channel activation scaling for MoE export#1265
AEON-7 wants to merge 1 commit intoNVIDIA:mainfrom
AEON-7:aeon7/fix-zero-amax-scaling-factor-moe-export

Conversation

@AEON-7
Copy link
Copy Markdown

@AEON-7 AEON-7 commented Apr 15, 2026

What

NVFP4QTensor.get_activation_scaling_factor asserts:

assert torch.all(activation_scaling_factor > 0), (
    f" activation scaling factor {activation_scaling_factor} not positive."
)

On MoE models, some per-channel activation amax entries are exactly zero because routing sparsity leaves certain input slots on rarely-routed experts un-activated during calibration. The derived scaling factor (amax / (maxbound * 448)) is then zero and the assertion trips.

How to reproduce

Any MoE model with per-expert-decomposed linears quantized using NVFP4_AWQ_FULL_CFG. On SuperGemma4 26B (128 experts, ~6% activation rate per expert per token), this fires on the first expert whose calibration-time coverage left even a single channel dark. It is the routine case, not the edge case.

The fix

Detect zero entries in the computed activation_scaling_factor tensor and replace them with the minimum positive value in the same tensor via torch.where. Fall back to a small positive floor (1e-8) for the pathological case where every channel in the tensor is zero (block entirely un-activated).

Why this is numerically safe

A zero amax channel means no activation was ever observed there during calibration. Any value flowing through that channel at inference is therefore statistically near-zero relative to the observed distribution. Scaling that near-zero value by the "quietest live channel's" scaling factor quantizes it to near-zero and dequantizes back to near-zero — the same end result as a genuinely zero scale, minus the NaN/division hazards.

The assertion after the fix remains strict (torch.all(scale > 0)), so downstream code that relies on the positivity invariant is unaffected.

Validation

End-to-end on SuperGemma4 26B (Gemma 4 MoE, 128 experts, per-expert-decomposed plugin) with NVFP4_AWQ_FULL_CFG:

  • Before: AssertionError: activation scaling factor tensor([...]) not positive. on a per-channel tensor whose printed head hides zeros in the ... ellipsis.
  • After: scaling factors land strictly positive; export produces a valid NVFP4 checkpoint. Side-by-side sampled generation (seed=42, temperature=0.7, top_p=0.9) against the BF16 baseline matches on fact-recall ("The capital of France is Paris." — identical) and produces coherent, well-formed outputs on creative (haiku) and technical (neural network explanation) prompts.

The resulting quantized model ships at AEON-7/supergemma4-26b-abliterated-multimodal-nvfp4.

Companion PR

Depends on / pairs with #1264 (non-scalar input amax in preprocess_linear_fusion). Both are orthogonal bugs on the same NVFP4 + per-expert-MoE export path; this PR fixes the bug that fires after #1264's fix unblocks the fusion step.

Summary by CodeRabbit

  • Bug Fixes
    • Fixed an edge-case in quantization where exact-zero scaling factors could occur; those zeros are now replaced with safe positive defaults to prevent downstream dequantization errors and ensure stable, correct behavior in rare inputs.

@AEON-7 AEON-7 requested a review from a team as a code owner April 15, 2026 04:26
@AEON-7 AEON-7 requested a review from kaix-nv April 15, 2026 04:26
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4cf5210f-4ca3-4d46-9b95-a498d62a0e71

📥 Commits

Reviewing files that changed from the base of the PR and between 7cb5851 and c6edb16.

📒 Files selected for processing (1)
  • modelopt/torch/quantization/qtensor/nvfp4_tensor.py

📝 Walkthrough

Walkthrough

The get_activation_scaling_factor function in the NVFP4 tensor module now detects exact-zero scaling-factor entries and replaces them with the tensor's minimum strictly-positive scaling factor when available, otherwise 1e-8 (same device/dtype). Negative scaling factors are left unchanged; the existing assert torch.all(activation_scaling_factor > 0) remains.

Changes

Cohort / File(s) Summary
NVFP4 Tensor zero-scaling handling
modelopt/torch/quantization/qtensor/nvfp4_tensor.py
Detects exact-zero entries in computed activation_scaling_factor and substitutes them with the minimum strictly-positive value (or 1e-8 if none), using torch.where. Leaves negative values unchanged; keeps the existing > 0 assertion.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Handle zero-amax per-channel activation scaling for MoE export' directly and specifically describes the main change: detecting and handling zero activation scaling factors in MoE models, which is the core fix implemented in the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed PR contains no security anti-patterns from SECURITY.md; only algorithmic modifications using safe PyTorch operations with no new dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py`:
- Around line 199-211: Restrict the repair to exact zeros: change zero_mask to
use activation_scaling_factor == 0, then compute positive =
activation_scaling_factor[~zero_mask] and further filter positive =
positive[positive > 0] (so negatives are not considered recoverable); if
positive.numel() > 0 replace zeros with positive.min(), else if there are only
zeros (no negatives present) fall back to torch.full_like(..., 1e-8) to keep the
tensor valid, but if negatives exist leave activation_scaling_factor untouched
so the existing assert can catch the error. Ensure these updates are applied
around the activation_scaling_factor / zero_mask logic in nvfp4_tensor.py.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a54c8117-c13f-42c0-86a5-b85150490b56

📥 Commits

Reviewing files that changed from the base of the PR and between c9b1155 and 8b3a4eb.

📒 Files selected for processing (1)
  • modelopt/torch/quantization/qtensor/nvfp4_tensor.py

Comment thread modelopt/torch/quantization/qtensor/nvfp4_tensor.py Outdated
@AEON-7 AEON-7 force-pushed the aeon7/fix-zero-amax-scaling-factor-moe-export branch from 8b3a4eb to 7cb5851 Compare April 15, 2026 04:32
NVFP4QTensor.get_activation_scaling_factor asserts
`torch.all(activation_scaling_factor > 0)` but on MoE models some
per-channel activation amax entries are exactly zero: routing sparsity
means certain input slots on rarely-routed experts never receive any
tokens during calibration, so their observed amax stays at initialization
(zero). The derived scaling factor (`amax / (maxbound * 448)`) is then
zero too, and the assertion trips during `export_hf_checkpoint()`.

In practice this fires immediately after the (separate) fused-linear
fusion step completes, on the first expert whose calibration-time
coverage left even a single channel dark. With 128 experts and
~6% activation rate per expert per token, this is routine rather than
exceptional.

This change:
- Detects exact-zero entries in the computed scaling factor tensor via
  `== 0` (not `<= 0`), so that negative entries — which would indicate
  a genuine upstream bug, not sparsity — remain untouched and continue
  to trip the existing positivity assertion rather than being silently
  masked.
- Replaces the zero entries with the minimum strictly-positive value in
  the same tensor (elementwise `torch.where`), preserving the per-channel
  shape and the positivity invariant downstream code relies on.
- Falls back to a small positive floor (1e-8) only when no positive
  entries exist (every channel in the tensor is zero).

Why this is numerically safe: a zero amax channel means no activation
was ever observed there during calibration. Any value flowing through
that channel at inference time is therefore statistically near-zero
relative to the observed distribution. Scaling that near-zero value by
the "quietest live channel's" scaling factor quantizes it to near-zero
and dequantizes it back to near-zero — the same end result as with a
genuinely zero scale, minus the NaN/division hazards.

Validated end-to-end on SuperGemma4 26B (128-expert Gemma 4 MoE) with
`NVFP4_AWQ_FULL_CFG`: export completes, the serialized checkpoint loads
into transformers via `mto.restore`, and sampled generation is
semantically equivalent to the BF16 baseline on fact-recall, creative,
and technical prompts.

Signed-off-by: AEON-7 <m2vgz48wpp@privaterelay.appleid.com>
@AEON-7 AEON-7 force-pushed the aeon7/fix-zero-amax-scaling-factor-moe-export branch from 7cb5851 to c6edb16 Compare April 15, 2026 13:12
@AEON-7
Copy link
Copy Markdown
Author

AEON-7 commented Apr 15, 2026

Good catch — you're right, <= 0 would let negative entries (which only arise from upstream bugs, not routing sparsity) be silently "repaired" alongside genuine zeros, masking real problems.

Fixed in c6edb1685f: zero_mask now matches exactly-zero entries only. Negative entries are left untouched so the existing positivity assertion below still catches them. Also tightened the positive selection to filter on > 0 rather than ~zero_mask, so the replacement value can never be negative even if one somehow slipped through. The all-zero fallback path is preserved.

Diff:

-        zero_mask = activation_scaling_factor <= 0
+        zero_mask = activation_scaling_factor == 0
         if zero_mask.any():
-            positive = activation_scaling_factor[~zero_mask]
-            if positive.numel() > 0:
-                activation_scaling_factor = torch.where(
-                    zero_mask, positive.min(), activation_scaling_factor
-                )
-            else:
-                activation_scaling_factor = torch.full_like(
-                    activation_scaling_factor, 1e-8
-                )
+            positive = activation_scaling_factor[activation_scaling_factor > 0]
+            replacement = (
+                positive.min()
+                if positive.numel() > 0
+                else torch.tensor(
+                    1e-8,
+                    device=activation_scaling_factor.device,
+                    dtype=activation_scaling_factor.dtype,
+                )
+            )
+            activation_scaling_factor = torch.where(
+                zero_mask, replacement, activation_scaling_factor
+            )

Thanks for the review.

@shengliangxu shengliangxu requested a review from meenchen April 16, 2026 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant