GPTQ vector by sugunav14 · Pull Request #1223 · NVIDIA/Model-Optimizer

sugunav14 · 2026-04-09T15:45:06Z

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Broader quantization format support with a fallback to custom formats
- Vector lookup table (LUT) codebook support for advanced vector quantization
Bug Fixes
- Fixed model loading precision keyword compatibility for certain checkpoints
- Prevented KV-cache accumulation during calibration runs
- Ensure weight quantizers are disabled after GPTQ weight updates
Tests
- Added GPU-focused tests validating vector-LUT and GPTQ integration

copy-pr-bot · 2026-04-09T15:45:11Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-09T15:45:15Z

📝 Walkthrough

Walkthrough

Replaces model precision keyword (dtype → torch_dtype), broadens quantization-format resolution with a fallback to modelopt attributes, prevents KV-cache accumulation during calibration replays, adds vector-LUT GPTQ update logic, and introduces GPU tests for blockwise vector quantization.

Changes

Cohort / File(s)	Summary
Model loading precision `examples/llm_ptq/example_utils.py`	Replaced `dtype` with `torch_dtype` in model construction kwargs for non-VILA checkpoints (2 occurrences).
Quantization format validation & CLI logic `examples/llm_ptq/hf_ptq.py`	Replaced strict `QUANT_CFG_CHOICES` assertions with a fallback that resolves `args.qformat` via `modelopt.torch.quantization` attributes; adjusted `mono_quantize` conditional accordingly; minor whitespace tweak.
Calibration KV-cache and GPTQ update flow `modelopt/torch/quantization/model_calib.py`	In per-layer replay loop, detect and neutralize `past_key_values` between replays (clone kwargs, reset cache or set to None) to avoid KV accumulation. After `GPTQHelper.update_weights(...)`, disable the layer's `weight_quantizer` when present.
GPTQ vector-LUT support & utilities `modelopt/torch/quantization/utils/calib_utils.py`	Added `GPTQHelper.is_vector_lut()`, `weight_slice_scales()` context manager, and `_blockwise_vector_update()`; updated `update_weights()` to branch to vector-LUT update path when configured.
GPU tests for vector quantization `tests/gpu/torch/quantization/test_gptq_vq.py`	New comprehensive GPU test module with reference VQ/GPTQ implementations, helper utilities (codebook, Hessian, mock quantizer), and three tests comparing LUT-based and reference quantization paths and end-to-end behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Driver as Calibration/Quantize Driver
    participant Layer as Model Layer
    participant GPTQ as GPTQHelper
    participant Quant as Quantizer Backend
    participant IO as Codebook / LUT IO

    Driver->>Layer: run forward(batch, kwargs_input)
    alt kwargs_input.past_key_values present
        Layer->>Layer: clone kwargs_input
        Layer->>Layer: reset cache or set past_key_values=None
    end
    Layer->>Driver: return activations

    Driver->>GPTQ: compute h_inv / gather stats
    GPTQ->>Quant: query backend (quantizer(self.weight)) to populate _psx_scales
    Quant-->>GPTQ: return per-block scales
    GPTQ->>Quant: quantize sub-vectors with provided slice-scales
    GPTQ->>Layer: write quantized weights back (column updates)
    GPTQ->>Layer: disable weight_quantizer attribute (if present)

    GPTQ->>IO: (optional) load/save vector LUT codebook

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The PR title 'GPTQ vector' is too vague and does not clearly describe the main changes. While it references GPTQ and vector quantization concepts present in the changeset, it lacks specificity about what is being implemented or changed.	Consider a more descriptive title such as 'Add GPTQ support for vector quantization' or 'Implement blockwise vector update for GPTQ quantization' to clearly convey the primary changes.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns	✅ Passed	Comprehensive review of all changed files reveals no security anti-patterns including unsafe load operations, eval/exec on untrusted input, hardcoded trust_remote_code, or nosec comments.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch svelury/gptq-vector

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-09T15:49:48Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1223/
Built to branch `gh-pages` at 2026-04-16 01:16 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-04-09T15:57:50Z

Codecov Report

❌ Patch coverage is 11.11111% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.24%. Comparing base (b6c6ec3) to head (78f0d8a).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/utils/calib_utils.py	10.86%	41 Missing ⚠️
modelopt/torch/quantization/model_calib.py	12.50%	7 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1223       +/-   ##
===========================================
- Coverage   76.90%   61.24%   -15.67%     
===========================================
  Files         350      459      +109     
  Lines       40524    49354     +8830     
===========================================
- Hits        31166    30226      -940     
- Misses       9358    19128     +9770

Flag	Coverage Δ
examples	`43.61% <9.25%> (+0.98%)`	⬆️
gpu	`18.94% <9.25%> (-38.58%)`	⬇️
unit	`51.98% <11.11%> (-3.64%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

examples/llm_ptq/hf_ptq.py (2)

1075-1082: Same inconsistency: error message doesn't mention mtq attribute fallback.

For consistency, the error message should mention both supported sources.

♻️ Suggested fix

             else:
                 raise AssertionError(
-                    f"Unsupported quantization format: {args.qformat}, choices are: {list(QUANT_CFG_CHOICES.keys())}"
+                    f"Unsupported quantization format: {args.qformat}, choices are: {list(QUANT_CFG_CHOICES.keys())} or any mtq attribute"
                 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 1075 - 1082, The AssertionError
message for unsupported quantization format only lists QUANT_CFG_CHOICES keys
but omits the mtq attribute fallback; update the error message in the block
handling args.qformat (where QUANT_CFG_CHOICES, mtq, getattr(mtq, args.qformat),
and quant_cfg are used) to mention both valid keys from QUANT_CFG_CHOICES and
that additional formats may be available as attributes on mtq (or list mtq
attributes if desirable) so the error clearly indicates both supported sources.

423-431: Inconsistent error message between low-memory mode and standard mode.

The error message in low-memory mode (line 429-431) only mentions QUANT_CFG_CHOICES.keys() but doesn't mention that mtq attributes are also supported (which lines 425-426 check for). Consider aligning the error message with the actual supported formats.

♻️ Suggested fix for consistent error message

         else:
             raise AssertionError(
                 f"Quantization format is not supported for low memory mode. "
-                f"Supported formats: {QUANT_CFG_CHOICES.keys()}"
+                f"Supported formats: {list(QUANT_CFG_CHOICES.keys())} or any mtq attribute"
             )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 423 - 431, The error message for the
low-memory branch (the block that checks args.qformat against QUANT_CFG_CHOICES
and mtq via hasattr/getattr and sets quant_cfg) is inconsistent: update the
AssertionError text to reflect both supported sources (keys of QUANT_CFG_CHOICES
and supported attributes on mtq) — e.g., mention QUANT_CFG_CHOICES.keys() and
the list of mtq attribute names or a phrase like "or available mtq formats" — so
the message accurately describes what formats are accepted when raising the
AssertionError for args.qformat.

tests/gpu/torch/quantization/test_gptq_vq.py (1)

307-316: Consider adding test cases for edge cases and error conditions.

The current tests cover the happy path with specific dimensions. Consider adding tests for:

Misaligned dimensions (e.g., in_features not divisible by vector_size)

Invalid configurations that should raise assertions

Different block size combinations

This would improve confidence in the assertion checks added in the implementation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/quantization/test_gptq_vq.py` around lines 307 - 316, Add
unit tests in test_gptq_vq.py that exercise edge and error cases using the
existing constants (SEED, IN_FEATURES, OUT_FEATURES, VECTOR_SIZE,
GPTQ_BLOCK_SIZE, QUANT_BLOCK_SIZE, N_CODEWORDS, PERC_DAMP, SCALE_TYPE) by: 1)
creating a case where IN_FEATURES is not divisible by VECTOR_SIZE and asserting
the code raises the appropriate exception/AssertionError; 2) creating invalid
config combinations (e.g., VECTOR_SIZE <= 0, N_CODEWORDS out of range, PERC_DAMP
negative) and asserting they raise errors; and 3) parametrizing tests to run
across multiple GPTQ_BLOCK_SIZE/QUANT_BLOCK_SIZE combinations (including
mismatched sizes) to validate assertion checks and expected failures.

modelopt/torch/quantization/utils/calib_utils.py (1)

50-72: Adding explicit weights_only=True to torch.load would improve code clarity.

The torch.load on line 65 is already safe because the project requires PyTorch 2.8+, which defaults weights_only=True. However, making the parameter explicit improves readability and documents intent, since the codebook file only contains tensor data.

Optional improvement

     if "sorted" in encode_format:
-        cb = torch.load(encode_path + encode_format + ".pt", map_location="cpu")
+        cb = torch.load(encode_path + encode_format + ".pt", map_location="cpu", weights_only=True)
         codebook = cb["sorted_values"].cuda().float()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 50 - 72, In
load_vector_lut_codebook, make the torch.load call explicit about loading
tensor-only data by adding weights_only=True to the torch.load invocation (the
branch that loads cb = torch.load(...)). Update the torch.load call inside the
load_vector_lut_codebook function so it passes weights_only=True (and keep
map_location="cpu") to clearly document intent that the file contains only
tensor data.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 258-314: The _blockwise_vector_update method assumes
quant_block_size aligns with vector_size for correct scale indexing; add an
assertion after loading codebook/computing vector_size and quant_block_size to
validate quant_block_size % vector_size == 0 (e.g., in _blockwise_vector_update
right after vector_size = codebook.shape[1]) so the j // quant_block_size scale
lookup in the loop is safe and fails fast with a clear message if misconfigured.

In `@tests/gpu/torch/quantization/test_gptq_vq.py`:
- Around line 320-321: The tests import luts (clip_vector_prescaled,
clip_vector_scalesign_fast) in tests/gpu/torch/quantization/test_gptq_vq.py and
calib_utils.py also imports luts at two locations; add luts as an optional test
dependency in pyproject.toml (e.g., under a [project.optional-dependencies]
group like "dev-test") or modify the test modules to skip when missing by
calling pytest.importorskip("luts") at the top of the failing test functions (or
module-level in test_gptq_vq.py) and similarly protect the imports/usage sites
in modelopt/torch/quantization/utils/calib_utils.py (the two import/usage
locations around the current lines) so runtime failures are avoided when luts is
not installed.

---

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1075-1082: The AssertionError message for unsupported quantization
format only lists QUANT_CFG_CHOICES keys but omits the mtq attribute fallback;
update the error message in the block handling args.qformat (where
QUANT_CFG_CHOICES, mtq, getattr(mtq, args.qformat), and quant_cfg are used) to
mention both valid keys from QUANT_CFG_CHOICES and that additional formats may
be available as attributes on mtq (or list mtq attributes if desirable) so the
error clearly indicates both supported sources.
- Around line 423-431: The error message for the low-memory branch (the block
that checks args.qformat against QUANT_CFG_CHOICES and mtq via hasattr/getattr
and sets quant_cfg) is inconsistent: update the AssertionError text to reflect
both supported sources (keys of QUANT_CFG_CHOICES and supported attributes on
mtq) — e.g., mention QUANT_CFG_CHOICES.keys() and the list of mtq attribute
names or a phrase like "or available mtq formats" — so the message accurately
describes what formats are accepted when raising the AssertionError for
args.qformat.

In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 50-72: In load_vector_lut_codebook, make the torch.load call
explicit about loading tensor-only data by adding weights_only=True to the
torch.load invocation (the branch that loads cb = torch.load(...)). Update the
torch.load call inside the load_vector_lut_codebook function so it passes
weights_only=True (and keep map_location="cpu") to clearly document intent that
the file contains only tensor data.

In `@tests/gpu/torch/quantization/test_gptq_vq.py`:
- Around line 307-316: Add unit tests in test_gptq_vq.py that exercise edge and
error cases using the existing constants (SEED, IN_FEATURES, OUT_FEATURES,
VECTOR_SIZE, GPTQ_BLOCK_SIZE, QUANT_BLOCK_SIZE, N_CODEWORDS, PERC_DAMP,
SCALE_TYPE) by: 1) creating a case where IN_FEATURES is not divisible by
VECTOR_SIZE and asserting the code raises the appropriate
exception/AssertionError; 2) creating invalid config combinations (e.g.,
VECTOR_SIZE <= 0, N_CODEWORDS out of range, PERC_DAMP negative) and asserting
they raise errors; and 3) parametrizing tests to run across multiple
GPTQ_BLOCK_SIZE/QUANT_BLOCK_SIZE combinations (including mismatched sizes) to
validate assertion checks and expected failures.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8f2a775c-a4a0-4aaf-aa80-b16f69d1d713

📥 Commits

Reviewing files that changed from the base of the PR and between 73be810 and 7a29dc0.

📒 Files selected for processing (5)

examples/llm_ptq/example_utils.py
examples/llm_ptq/hf_ptq.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/utils/calib_utils.py
tests/gpu/torch/quantization/test_gptq_vq.py

coderabbitai · 2026-04-14T23:33:47Z

+    def _blockwise_vector_update(self, block_size):
+        """GPTQ blockwise update for vector LUT quantizers.
+
+        Pre-computes scales once, then runs the standard GPTQ 3-loop
+        with per-vector-group static quantization via clip_vector_prescaled.
+        """
+        import torch.nn.functional as F
+        from luts import clip_vector_prescaled, clip_vector_scalesign_fast
+
+        codebook, quant_block_size, scale_type = load_vector_lut_codebook(
+            self.module.weight_quantizer
+        )
+
+        # Get vector size from codebook
+        vector_size = codebook.shape[1]
+
+        assert self.weight is not None and self.h_inv is not None
+        num_cols = self.weight.shape[1]
+        assert block_size % quant_block_size == 0
+
+        # Pre-compute scales once outside the GPTQ loop
+        _, scales = clip_vector_scalesign_fast(
+            self.weight,
+            codebook,
+            quant_block_size,
+            scale_type,
+            scale_algo="max",
+            sign_scale=True,
+            return_scales=True,
+        )
+        scales_2d = scales.reshape(self.weight.shape[0], -1)
+
+        w = self.weight.clone()
+        h_inv = self.h_inv
+
+        for blk_start in range(0, num_cols, block_size):
+            blk_end = min(blk_start + block_size, num_cols)
+            errs = torch.zeros_like(w[:, blk_start:blk_end])
+
+            for j in range(blk_start, blk_end, vector_size):
+                d = min(vector_size, blk_end - j)
+                s = scales_2d[:, j // quant_block_size].contiguous()
+
+                sub = w[:, j : j + d].contiguous()
+                if d < vector_size:
+                    sub = F.pad(sub, (0, vector_size - d))
+                q_sub = clip_vector_prescaled(sub, codebook, s)
+
+                for k in range(d):
+                    col = j + k
+                    self.weight[:, col] = q_sub[:, k]
+                    err = (w[:, col] - q_sub[:, k]) / h_inv[col, col]
+                    errs[:, col - blk_start] = err
+                    w[:, col:blk_end].addr_(err, h_inv[col, col:blk_end], alpha=-1)
+
+            if blk_end < num_cols:
+                w[:, blk_end:] -= errs @ h_inv[blk_start:blk_end, blk_end:]


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if there are any constraints documented between quant_block_size and vector_size rg -n "quant_block_size.*vector" --type py rg -n "vector_size.*block" --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 193

🏁 Script executed:

# Find the load_vector_lut_codebook function definition rg -n "def load_vector_lut_codebook" --type py -A 20

Repository: NVIDIA/Model-Optimizer

Length of output: 1926

🏁 Script executed:

# Check if there are any tests for this functionality fd ".*test.*\.py" --type f | xargs rg -l "blockwise_vector_update|load_vector_lut_codebook" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 110

🏁 Script executed:

# Look for any documentation or validation of quant_block_size and vector_size rg -n "vector_size|quant_block_size" modelopt/torch/quantization/utils/calib_utils.py --type py -B 2 -A 2 | head -80

Repository: NVIDIA/Model-Optimizer

Length of output: 1415

🏁 Script executed:

# Examine the test file to understand the dimension relationships cat tests/gpu/torch/quantization/test_gptq_vq.py | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 5382

🏁 Script executed:

# Check for any comments or documentation about vector_size and quant_block_size in the entire calib_utils file cat modelopt/torch/quantization/utils/calib_utils.py | grep -n "vector_size\|quant_block_size" | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 633

🏁 Script executed:

# Check the complete load_vector_lut_codebook function sed -n '50,100p' modelopt/torch/quantization/utils/calib_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1929

Add assertion to validate quant_block_size % vector_size == 0.

The scale indexing at line 299 (j // quant_block_size) assumes vector groups align with quantization block boundaries. While line 276 asserts block_size % quant_block_size == 0, there's no explicit check that quant_block_size % vector_size == 0. The reference implementation in the test file (test_gptq_vq.py) validates this constraint explicitly; adding this assertion here ensures correct scale lookups.

Suggested fix

# Get vector size from codebook vector_size = codebook.shape[1] assert self.weight is not None and self.h_inv is not None num_cols = self.weight.shape[1] assert block_size % quant_block_size == 0 + assert quant_block_size % vector_size == 0, ( + f"quant_block_size ({quant_block_size}) must be divisible by vector_size ({vector_size})" + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 258 - 314, The _blockwise_vector_update method assumes quant_block_size aligns with vector_size for correct scale indexing; add an assertion after loading codebook/computing vector_size and quant_block_size to validate quant_block_size % vector_size == 0 (e.g., in _blockwise_vector_update right after vector_size = codebook.shape[1]) so the j // quant_block_size scale lookup in the loop is safe and fails fast with a clear message if misconfigured.

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

coderabbitai

♻️ Duplicate comments (1)

modelopt/torch/quantization/utils/calib_utils.py (1)

261-280: ⚠️ Potential issue | 🟠 Major

Add explicit vector/block alignment validation before scale indexing.

Line 280 uses j // quant_block_size for scale lookup. Without a guard that quant_block_size is divisible by vector_size, vector chunks can straddle quant-block boundaries and select incorrect scales.

Suggested fix

         vector_size = quantizer.backend_extra_args["vector_size"]
         quant_block_size = quantizer.backend_extra_args["block_sizes"]

         assert self.weight is not None and self.h_inv is not None
         num_cols = self.weight.shape[1]
         assert block_size % quant_block_size == 0
+        assert quant_block_size % vector_size == 0, (
+            f"quant_block_size ({quant_block_size}) must be divisible by vector_size ({vector_size})"
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 261 - 280, The
code can index quantizer._psx_scales incorrectly when vector chunks cross
quant-block boundaries; add an explicit validation before the blk loop ensuring
vector/block alignment (e.g., assert quant_block_size % vector_size == 0 and
vector_size <= quant_block_size, or raise a clear ValueError) so that the
expression s = quantizer._psx_scales[:, j // quant_block_size] is safe; place
this guard near the uses of vector_size/quant_block_size (around where
quant_block_size = quantizer.backend_extra_args["block_sizes"] and before the
for blk_start... loop) and include a descriptive message referencing vector_size
and quant_block_size.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 261-280: The code can index quantizer._psx_scales incorrectly when
vector chunks cross quant-block boundaries; add an explicit validation before
the blk loop ensuring vector/block alignment (e.g., assert quant_block_size %
vector_size == 0 and vector_size <= quant_block_size, or raise a clear
ValueError) so that the expression s = quantizer._psx_scales[:, j //
quant_block_size] is safe; place this guard near the uses of
vector_size/quant_block_size (around where quant_block_size =
quantizer.backend_extra_args["block_sizes"] and before the for blk_start...
loop) and include a descriptive message referencing vector_size and
quant_block_size.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 48a84d8f-2322-42e2-9fb8-609c5557c778

📥 Commits

Reviewing files that changed from the base of the PR and between 7a29dc0 and 78f0d8a.

📒 Files selected for processing (1)

modelopt/torch/quantization/utils/calib_utils.py

cjluo-nv

The PR adds GPTQ support for vector LUT quantization with several bug fixes. While the core algorithm and tests look solid, there are several issues to address:

Bare assert for runtime validation in _blockwise_vector_update — these should be RuntimeError/ValueError since asserts can be optimized away with -O.
Tests depend on undocumented luts module — all 3 tests import from luts which is not a standard package and will fail without it. Tests need skip markers or the dependency needs to be documented.
getattr(mtq, args.qformat) fallback in hf_ptq.py allows arbitrary attribute access — should validate the returned value is actually a valid quant config dict.
Inconsistent dtype → torch_dtype fix — one instance of dtype="auto" on line ~641 of example_utils.py was not updated.
weight_quantizer.disable() after GPTQ is a broad behavioral change affecting all GPTQ paths — this needs to be explicitly called out as it may affect downstream code that expects the quantizer to still be enabled after GPTQ.

cjluo-nv · 2026-04-16T17:56:14Z

+        quantizer = self.module.weight_quantizer
+        assert quantizer.backend_extra_args.get("static_scales", False), (
+            "GPTQ vector update requires static_scales=True in backend_extra_args."
+        )


Bug (per codebase guidelines): Bare assert statements are used for runtime validation here and throughout _blockwise_vector_update. These can be silently skipped with python -O. Replace with explicit exceptions:

if not quantizer.backend_extra_args.get("static_scales", False): raise RuntimeError("GPTQ vector update requires static_scales=True in backend_extra_args.")

Same applies to lines 266-267 (assert self.weight ...) and 268 (assert block_size % quant_block_size == 0).

cjluo-nv · 2026-04-16T17:56:14Z

+            quantizer._psx_scales = original
+
    def update_weights(self, block_size, perc_damp):
        """Run GPTQ blockwise weight update on this module.


Nit: is_vector_lut() accesses self.module.weight_quantizer without guarding for its existence. While hasattr is used for backend_extra_args, if the module has no weight_quantizer attribute at all, this will raise AttributeError. Consider adding a guard:

weight_quantizer = getattr(self.module, 'weight_quantizer', None) if weight_quantizer is None: return False extra_args = getattr(weight_quantizer, 'backend_extra_args', None)

cjluo-nv · 2026-04-16T17:56:14Z

    print_rank_0("Updating weights using GPTQ algorithm...")
    for handle in gptq_handles.values():
        handle.update_weights(block_size, perc_damp)
+


Important: This weight_quantizer.disable() after GPTQ is a behavioral change that affects all GPTQ paths (not just vector LUT). Previously, after GPTQ, the weight quantizer remained enabled, meaning subsequent forward passes would re-quantize the already-QDQ'd weights.

The fix seems correct in principle (weights are already quantized), but it's a significant behavioral change. Is this intentional for all formats (NVFP4, FP8, INT4)? The existing test_gptq_updates and test_gptq_e2e_flow tests should be verified to still pass with this change. Also, the hasattr check is unnecessary since GPTQ only runs on quantized linears which always have weight_quantizer.

cjluo-nv · 2026-04-16T17:56:14Z

+
+SEED = 42
+OUT_FEATURES = 64
+IN_FEATURES = 128


Tests import from luts import clip_vector_prescaled, clip_vector_scalesign_fast which is not a standard Python package and doesn't appear to be defined anywhere in this repository. This means all three tests will fail with ModuleNotFoundError in standard CI environments.

Please either:

Add a pytest.importorskip('luts') at the top of the file

Use @pytest.mark.skipif decorators on the tests

Or document the dependency and how to install it

Without this, the test file is effectively non-functional for most environments.

cjluo-nv · 2026-04-16T17:56:14Z

+# ---------------------------------------------------------------------------
+
+
+def _make_codebook_file(tmp_dir, name, n_codewords, vector_size):


torch.save({"sorted_values": codebook}, ...) in _make_codebook_file — this is fine for tests, but note that whoever loads this file will need weights_only=False in torch.load() (since it's a dict containing a tensor), which is flagged by the repo's security guidelines. Ensure the production load_vector_lut_codebook function uses weights_only=True or safetensors.

cjluo-nv · 2026-04-16T17:56:14Z

-        )
-        quant_cfg = QUANT_CFG_CHOICES[args.qformat]
+        if args.qformat in QUANT_CFG_CHOICES:
+            quant_cfg = QUANT_CFG_CHOICES[args.qformat]


The getattr(mtq, args.qformat) fallback could return any attribute of the mtq module (functions, classes, constants) if the format name happens to match. Consider validating that the result is actually a dict (quant config):

elif hasattr(mtq, args.qformat): quant_cfg = getattr(mtq, args.qformat) if not isinstance(quant_cfg, dict): raise ValueError( f"mtq.{args.qformat} is not a valid quantization config (got {type(quant_cfg).__name__})" )

This same pattern is duplicated at line ~1078. Consider extracting to a helper function to avoid duplication.

cjluo-nv · 2026-04-16T17:56:14Z

    # Don't set torch_dtype for VILA models as they handle it explicitly in their builder
    if "vila" not in ckpt_path.lower():
-        model_kwargs.setdefault("dtype", "auto")
+        model_kwargs.setdefault("torch_dtype", "auto")


Good fix: dtype → torch_dtype matches the HuggingFace from_pretrained API.

However, note there's still dtype="auto" on line ~641 inside the has_pack_quantized_config block:

model = AutoModelForCausalLM.from_pretrained( ckpt_path, device_map="auto", trust_remote_code=trust_remote_code, dtype="auto", # <-- should this also be torch_dtype? )

Is this intentional?

cjluo-nv · 2026-04-16T17:56:14Z

+                s = quantizer._psx_scales[:, j // quant_block_size].contiguous()
+
+                sub = w[:, j : j + d].contiguous()
+                if d < vector_size:


In the inner loop, w[:, col:blk_end].addr_(err, h_inv[col, col:blk_end], alpha=-1) updates the working copy w in-place. When d < vector_size (padding case), the quantizer sees a padded sub-vector and returns a full-sized q_sub. The loop only uses q_sub[:, :d], which is correct. However, the padded zeros could affect the quantizer's codeword selection for the last valid elements in the padded vector, potentially introducing different quantization compared to a non-padded path. Is this edge case tested? (It would occur when blk_end - j < vector_size.)

sugunav14 added 4 commits April 14, 2026 22:54

GPTQ vector and unit test

d101aba

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

latest tested on Qwen3-8B

c3f93ef

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

removed dataset holdout logic

1c01e02

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

reverted dataset_utils

7a29dc0

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

sugunav14 force-pushed the svelury/gptq-vector branch from 6e9b1d1 to 7a29dc0 Compare April 14, 2026 22:54

sugunav14 marked this pull request as ready for review April 14, 2026 23:28

sugunav14 requested review from a team as code owners April 14, 2026 23:28

sugunav14 requested review from cjluo-nv, meenchen and realAsma April 14, 2026 23:28

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

update

78f0d8a

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

cjluo-nv reviewed Apr 16, 2026

View reviewed changes

		# ---------------------------------------------------------------------------


		def _make_codebook_file(tmp_dir, name, n_codewords, vector_size):

Conversation

sugunav14 commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Apr 9, 2026

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-04-16 01:16 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sugunav14 commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

github-actions bot commented Apr 9, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-16 01:16 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov bot commented Apr 9, 2026 •

edited

Loading