Skip to content

GPTQ vector#1223

Open
sugunav14 wants to merge 5 commits intomainfrom
svelury/gptq-vector
Open

GPTQ vector#1223
sugunav14 wants to merge 5 commits intomainfrom
svelury/gptq-vector

Conversation

@sugunav14
Copy link
Copy Markdown
Contributor

@sugunav14 sugunav14 commented Apr 9, 2026

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Broader quantization format support with a fallback to custom formats
    • Vector lookup table (LUT) codebook support for advanced vector quantization
  • Bug Fixes

    • Fixed model loading precision keyword compatibility for certain checkpoints
    • Prevented KV-cache accumulation during calibration runs
    • Ensure weight quantizers are disabled after GPTQ weight updates
  • Tests

    • Added GPU-focused tests validating vector-LUT and GPTQ integration

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 9, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Replaces model precision keyword (dtypetorch_dtype), broadens quantization-format resolution with a fallback to modelopt attributes, prevents KV-cache accumulation during calibration replays, adds vector-LUT GPTQ update logic, and introduces GPU tests for blockwise vector quantization.

Changes

Cohort / File(s) Summary
Model loading precision
examples/llm_ptq/example_utils.py
Replaced dtype with torch_dtype in model construction kwargs for non-VILA checkpoints (2 occurrences).
Quantization format validation & CLI logic
examples/llm_ptq/hf_ptq.py
Replaced strict QUANT_CFG_CHOICES assertions with a fallback that resolves args.qformat via modelopt.torch.quantization attributes; adjusted mono_quantize conditional accordingly; minor whitespace tweak.
Calibration KV-cache and GPTQ update flow
modelopt/torch/quantization/model_calib.py
In per-layer replay loop, detect and neutralize past_key_values between replays (clone kwargs, reset cache or set to None) to avoid KV accumulation. After GPTQHelper.update_weights(...), disable the layer's weight_quantizer when present.
GPTQ vector-LUT support & utilities
modelopt/torch/quantization/utils/calib_utils.py
Added GPTQHelper.is_vector_lut(), weight_slice_scales() context manager, and _blockwise_vector_update(); updated update_weights() to branch to vector-LUT update path when configured.
GPU tests for vector quantization
tests/gpu/torch/quantization/test_gptq_vq.py
New comprehensive GPU test module with reference VQ/GPTQ implementations, helper utilities (codebook, Hessian, mock quantizer), and three tests comparing LUT-based and reference quantization paths and end-to-end behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Driver as Calibration/Quantize Driver
    participant Layer as Model Layer
    participant GPTQ as GPTQHelper
    participant Quant as Quantizer Backend
    participant IO as Codebook / LUT IO

    Driver->>Layer: run forward(batch, kwargs_input)
    alt kwargs_input.past_key_values present
        Layer->>Layer: clone kwargs_input
        Layer->>Layer: reset cache or set past_key_values=None
    end
    Layer->>Driver: return activations

    Driver->>GPTQ: compute h_inv / gather stats
    GPTQ->>Quant: query backend (quantizer(self.weight)) to populate _psx_scales
    Quant-->>GPTQ: return per-block scales
    GPTQ->>Quant: quantize sub-vectors with provided slice-scales
    GPTQ->>Layer: write quantized weights back (column updates)
    GPTQ->>Layer: disable weight_quantizer attribute (if present)

    GPTQ->>IO: (optional) load/save vector LUT codebook
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The PR title 'GPTQ vector' is too vague and does not clearly describe the main changes. While it references GPTQ and vector quantization concepts present in the changeset, it lacks specificity about what is being implemented or changed. Consider a more descriptive title such as 'Add GPTQ support for vector quantization' or 'Implement blockwise vector update for GPTQ quantization' to clearly convey the primary changes.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed Comprehensive review of all changed files reveals no security anti-patterns including unsafe load operations, eval/exec on untrusted input, hardcoded trust_remote_code, or nosec comments.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch svelury/gptq-vector

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1223/

Built to branch gh-pages at 2026-04-16 01:16 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 11.11111% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.24%. Comparing base (b6c6ec3) to head (78f0d8a).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/utils/calib_utils.py 10.86% 41 Missing ⚠️
modelopt/torch/quantization/model_calib.py 12.50% 7 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1223       +/-   ##
===========================================
- Coverage   76.90%   61.24%   -15.67%     
===========================================
  Files         350      459      +109     
  Lines       40524    49354     +8830     
===========================================
- Hits        31166    30226      -940     
- Misses       9358    19128     +9770     
Flag Coverage Δ
examples 43.61% <9.25%> (+0.98%) ⬆️
gpu 18.94% <9.25%> (-38.58%) ⬇️
unit 51.98% <11.11%> (-3.64%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
@sugunav14 sugunav14 force-pushed the svelury/gptq-vector branch from 6e9b1d1 to 7a29dc0 Compare April 14, 2026 22:54
@sugunav14 sugunav14 marked this pull request as ready for review April 14, 2026 23:28
@sugunav14 sugunav14 requested review from a team as code owners April 14, 2026 23:28
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
examples/llm_ptq/hf_ptq.py (2)

1075-1082: Same inconsistency: error message doesn't mention mtq attribute fallback.

For consistency, the error message should mention both supported sources.

♻️ Suggested fix
             else:
                 raise AssertionError(
-                    f"Unsupported quantization format: {args.qformat}, choices are: {list(QUANT_CFG_CHOICES.keys())}"
+                    f"Unsupported quantization format: {args.qformat}, choices are: {list(QUANT_CFG_CHOICES.keys())} or any mtq attribute"
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 1075 - 1082, The AssertionError
message for unsupported quantization format only lists QUANT_CFG_CHOICES keys
but omits the mtq attribute fallback; update the error message in the block
handling args.qformat (where QUANT_CFG_CHOICES, mtq, getattr(mtq, args.qformat),
and quant_cfg are used) to mention both valid keys from QUANT_CFG_CHOICES and
that additional formats may be available as attributes on mtq (or list mtq
attributes if desirable) so the error clearly indicates both supported sources.

423-431: Inconsistent error message between low-memory mode and standard mode.

The error message in low-memory mode (line 429-431) only mentions QUANT_CFG_CHOICES.keys() but doesn't mention that mtq attributes are also supported (which lines 425-426 check for). Consider aligning the error message with the actual supported formats.

♻️ Suggested fix for consistent error message
         else:
             raise AssertionError(
                 f"Quantization format is not supported for low memory mode. "
-                f"Supported formats: {QUANT_CFG_CHOICES.keys()}"
+                f"Supported formats: {list(QUANT_CFG_CHOICES.keys())} or any mtq attribute"
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 423 - 431, The error message for the
low-memory branch (the block that checks args.qformat against QUANT_CFG_CHOICES
and mtq via hasattr/getattr and sets quant_cfg) is inconsistent: update the
AssertionError text to reflect both supported sources (keys of QUANT_CFG_CHOICES
and supported attributes on mtq) — e.g., mention QUANT_CFG_CHOICES.keys() and
the list of mtq attribute names or a phrase like "or available mtq formats" — so
the message accurately describes what formats are accepted when raising the
AssertionError for args.qformat.
tests/gpu/torch/quantization/test_gptq_vq.py (1)

307-316: Consider adding test cases for edge cases and error conditions.

The current tests cover the happy path with specific dimensions. Consider adding tests for:

  • Misaligned dimensions (e.g., in_features not divisible by vector_size)
  • Invalid configurations that should raise assertions
  • Different block size combinations

This would improve confidence in the assertion checks added in the implementation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/quantization/test_gptq_vq.py` around lines 307 - 316, Add
unit tests in test_gptq_vq.py that exercise edge and error cases using the
existing constants (SEED, IN_FEATURES, OUT_FEATURES, VECTOR_SIZE,
GPTQ_BLOCK_SIZE, QUANT_BLOCK_SIZE, N_CODEWORDS, PERC_DAMP, SCALE_TYPE) by: 1)
creating a case where IN_FEATURES is not divisible by VECTOR_SIZE and asserting
the code raises the appropriate exception/AssertionError; 2) creating invalid
config combinations (e.g., VECTOR_SIZE <= 0, N_CODEWORDS out of range, PERC_DAMP
negative) and asserting they raise errors; and 3) parametrizing tests to run
across multiple GPTQ_BLOCK_SIZE/QUANT_BLOCK_SIZE combinations (including
mismatched sizes) to validate assertion checks and expected failures.
modelopt/torch/quantization/utils/calib_utils.py (1)

50-72: Adding explicit weights_only=True to torch.load would improve code clarity.

The torch.load on line 65 is already safe because the project requires PyTorch 2.8+, which defaults weights_only=True. However, making the parameter explicit improves readability and documents intent, since the codebook file only contains tensor data.

Optional improvement
     if "sorted" in encode_format:
-        cb = torch.load(encode_path + encode_format + ".pt", map_location="cpu")
+        cb = torch.load(encode_path + encode_format + ".pt", map_location="cpu", weights_only=True)
         codebook = cb["sorted_values"].cuda().float()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 50 - 72, In
load_vector_lut_codebook, make the torch.load call explicit about loading
tensor-only data by adding weights_only=True to the torch.load invocation (the
branch that loads cb = torch.load(...)). Update the torch.load call inside the
load_vector_lut_codebook function so it passes weights_only=True (and keep
map_location="cpu") to clearly document intent that the file contains only
tensor data.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 258-314: The _blockwise_vector_update method assumes
quant_block_size aligns with vector_size for correct scale indexing; add an
assertion after loading codebook/computing vector_size and quant_block_size to
validate quant_block_size % vector_size == 0 (e.g., in _blockwise_vector_update
right after vector_size = codebook.shape[1]) so the j // quant_block_size scale
lookup in the loop is safe and fails fast with a clear message if misconfigured.

In `@tests/gpu/torch/quantization/test_gptq_vq.py`:
- Around line 320-321: The tests import luts (clip_vector_prescaled,
clip_vector_scalesign_fast) in tests/gpu/torch/quantization/test_gptq_vq.py and
calib_utils.py also imports luts at two locations; add luts as an optional test
dependency in pyproject.toml (e.g., under a [project.optional-dependencies]
group like "dev-test") or modify the test modules to skip when missing by
calling pytest.importorskip("luts") at the top of the failing test functions (or
module-level in test_gptq_vq.py) and similarly protect the imports/usage sites
in modelopt/torch/quantization/utils/calib_utils.py (the two import/usage
locations around the current lines) so runtime failures are avoided when luts is
not installed.

---

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1075-1082: The AssertionError message for unsupported quantization
format only lists QUANT_CFG_CHOICES keys but omits the mtq attribute fallback;
update the error message in the block handling args.qformat (where
QUANT_CFG_CHOICES, mtq, getattr(mtq, args.qformat), and quant_cfg are used) to
mention both valid keys from QUANT_CFG_CHOICES and that additional formats may
be available as attributes on mtq (or list mtq attributes if desirable) so the
error clearly indicates both supported sources.
- Around line 423-431: The error message for the low-memory branch (the block
that checks args.qformat against QUANT_CFG_CHOICES and mtq via hasattr/getattr
and sets quant_cfg) is inconsistent: update the AssertionError text to reflect
both supported sources (keys of QUANT_CFG_CHOICES and supported attributes on
mtq) — e.g., mention QUANT_CFG_CHOICES.keys() and the list of mtq attribute
names or a phrase like "or available mtq formats" — so the message accurately
describes what formats are accepted when raising the AssertionError for
args.qformat.

In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 50-72: In load_vector_lut_codebook, make the torch.load call
explicit about loading tensor-only data by adding weights_only=True to the
torch.load invocation (the branch that loads cb = torch.load(...)). Update the
torch.load call inside the load_vector_lut_codebook function so it passes
weights_only=True (and keep map_location="cpu") to clearly document intent that
the file contains only tensor data.

In `@tests/gpu/torch/quantization/test_gptq_vq.py`:
- Around line 307-316: Add unit tests in test_gptq_vq.py that exercise edge and
error cases using the existing constants (SEED, IN_FEATURES, OUT_FEATURES,
VECTOR_SIZE, GPTQ_BLOCK_SIZE, QUANT_BLOCK_SIZE, N_CODEWORDS, PERC_DAMP,
SCALE_TYPE) by: 1) creating a case where IN_FEATURES is not divisible by
VECTOR_SIZE and asserting the code raises the appropriate
exception/AssertionError; 2) creating invalid config combinations (e.g.,
VECTOR_SIZE <= 0, N_CODEWORDS out of range, PERC_DAMP negative) and asserting
they raise errors; and 3) parametrizing tests to run across multiple
GPTQ_BLOCK_SIZE/QUANT_BLOCK_SIZE combinations (including mismatched sizes) to
validate assertion checks and expected failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8f2a775c-a4a0-4aaf-aa80-b16f69d1d713

📥 Commits

Reviewing files that changed from the base of the PR and between 73be810 and 7a29dc0.

📒 Files selected for processing (5)
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/utils/calib_utils.py
  • tests/gpu/torch/quantization/test_gptq_vq.py

Comment on lines +258 to +314
def _blockwise_vector_update(self, block_size):
"""GPTQ blockwise update for vector LUT quantizers.

Pre-computes scales once, then runs the standard GPTQ 3-loop
with per-vector-group static quantization via clip_vector_prescaled.
"""
import torch.nn.functional as F
from luts import clip_vector_prescaled, clip_vector_scalesign_fast

codebook, quant_block_size, scale_type = load_vector_lut_codebook(
self.module.weight_quantizer
)

# Get vector size from codebook
vector_size = codebook.shape[1]

assert self.weight is not None and self.h_inv is not None
num_cols = self.weight.shape[1]
assert block_size % quant_block_size == 0

# Pre-compute scales once outside the GPTQ loop
_, scales = clip_vector_scalesign_fast(
self.weight,
codebook,
quant_block_size,
scale_type,
scale_algo="max",
sign_scale=True,
return_scales=True,
)
scales_2d = scales.reshape(self.weight.shape[0], -1)

w = self.weight.clone()
h_inv = self.h_inv

for blk_start in range(0, num_cols, block_size):
blk_end = min(blk_start + block_size, num_cols)
errs = torch.zeros_like(w[:, blk_start:blk_end])

for j in range(blk_start, blk_end, vector_size):
d = min(vector_size, blk_end - j)
s = scales_2d[:, j // quant_block_size].contiguous()

sub = w[:, j : j + d].contiguous()
if d < vector_size:
sub = F.pad(sub, (0, vector_size - d))
q_sub = clip_vector_prescaled(sub, codebook, s)

for k in range(d):
col = j + k
self.weight[:, col] = q_sub[:, k]
err = (w[:, col] - q_sub[:, k]) / h_inv[col, col]
errs[:, col - blk_start] = err
w[:, col:blk_end].addr_(err, h_inv[col, col:blk_end], alpha=-1)

if blk_end < num_cols:
w[:, blk_end:] -= errs @ h_inv[blk_start:blk_end, blk_end:]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there are any constraints documented between quant_block_size and vector_size
rg -n "quant_block_size.*vector" --type py
rg -n "vector_size.*block" --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 193


🏁 Script executed:

# Find the load_vector_lut_codebook function definition
rg -n "def load_vector_lut_codebook" --type py -A 20

Repository: NVIDIA/Model-Optimizer

Length of output: 1926


🏁 Script executed:

# Check if there are any tests for this functionality
fd ".*test.*\.py" --type f | xargs rg -l "blockwise_vector_update|load_vector_lut_codebook" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 110


🏁 Script executed:

# Look for any documentation or validation of quant_block_size and vector_size
rg -n "vector_size|quant_block_size" modelopt/torch/quantization/utils/calib_utils.py --type py -B 2 -A 2 | head -80

Repository: NVIDIA/Model-Optimizer

Length of output: 1415


🏁 Script executed:

# Examine the test file to understand the dimension relationships
cat tests/gpu/torch/quantization/test_gptq_vq.py | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 5382


🏁 Script executed:

# Check for any comments or documentation about vector_size and quant_block_size in the entire calib_utils file
cat modelopt/torch/quantization/utils/calib_utils.py | grep -n "vector_size\|quant_block_size" | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 633


🏁 Script executed:

# Check the complete load_vector_lut_codebook function
sed -n '50,100p' modelopt/torch/quantization/utils/calib_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1929


Add assertion to validate quant_block_size % vector_size == 0.

The scale indexing at line 299 (j // quant_block_size) assumes vector groups align with quantization block boundaries. While line 276 asserts block_size % quant_block_size == 0, there's no explicit check that quant_block_size % vector_size == 0. The reference implementation in the test file (test_gptq_vq.py) validates this constraint explicitly; adding this assertion here ensures correct scale lookups.

Suggested fix
         # Get vector size from codebook
         vector_size = codebook.shape[1]
 
         assert self.weight is not None and self.h_inv is not None
         num_cols = self.weight.shape[1]
         assert block_size % quant_block_size == 0
+        assert quant_block_size % vector_size == 0, (
+            f"quant_block_size ({quant_block_size}) must be divisible by vector_size ({vector_size})"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 258 - 314, The
_blockwise_vector_update method assumes quant_block_size aligns with vector_size
for correct scale indexing; add an assertion after loading codebook/computing
vector_size and quant_block_size to validate quant_block_size % vector_size == 0
(e.g., in _blockwise_vector_update right after vector_size = codebook.shape[1])
so the j // quant_block_size scale lookup in the loop is safe and fails fast
with a clear message if misconfigured.

Comment thread tests/gpu/torch/quantization/test_gptq_vq.py
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
modelopt/torch/quantization/utils/calib_utils.py (1)

261-280: ⚠️ Potential issue | 🟠 Major

Add explicit vector/block alignment validation before scale indexing.

Line 280 uses j // quant_block_size for scale lookup. Without a guard that quant_block_size is divisible by vector_size, vector chunks can straddle quant-block boundaries and select incorrect scales.

Suggested fix
         vector_size = quantizer.backend_extra_args["vector_size"]
         quant_block_size = quantizer.backend_extra_args["block_sizes"]

         assert self.weight is not None and self.h_inv is not None
         num_cols = self.weight.shape[1]
         assert block_size % quant_block_size == 0
+        assert quant_block_size % vector_size == 0, (
+            f"quant_block_size ({quant_block_size}) must be divisible by vector_size ({vector_size})"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 261 - 280, The
code can index quantizer._psx_scales incorrectly when vector chunks cross
quant-block boundaries; add an explicit validation before the blk loop ensuring
vector/block alignment (e.g., assert quant_block_size % vector_size == 0 and
vector_size <= quant_block_size, or raise a clear ValueError) so that the
expression s = quantizer._psx_scales[:, j // quant_block_size] is safe; place
this guard near the uses of vector_size/quant_block_size (around where
quant_block_size = quantizer.backend_extra_args["block_sizes"] and before the
for blk_start... loop) and include a descriptive message referencing vector_size
and quant_block_size.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 261-280: The code can index quantizer._psx_scales incorrectly when
vector chunks cross quant-block boundaries; add an explicit validation before
the blk loop ensuring vector/block alignment (e.g., assert quant_block_size %
vector_size == 0 and vector_size <= quant_block_size, or raise a clear
ValueError) so that the expression s = quantizer._psx_scales[:, j //
quant_block_size] is safe; place this guard near the uses of
vector_size/quant_block_size (around where quant_block_size =
quantizer.backend_extra_args["block_sizes"] and before the for blk_start...
loop) and include a descriptive message referencing vector_size and
quant_block_size.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 48a84d8f-2322-42e2-9fb8-609c5557c778

📥 Commits

Reviewing files that changed from the base of the PR and between 7a29dc0 and 78f0d8a.

📒 Files selected for processing (1)
  • modelopt/torch/quantization/utils/calib_utils.py

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR adds GPTQ support for vector LUT quantization with several bug fixes. While the core algorithm and tests look solid, there are several issues to address:

  1. Bare assert for runtime validation in _blockwise_vector_update — these should be RuntimeError/ValueError since asserts can be optimized away with -O.
  2. Tests depend on undocumented luts module — all 3 tests import from luts which is not a standard package and will fail without it. Tests need skip markers or the dependency needs to be documented.
  3. getattr(mtq, args.qformat) fallback in hf_ptq.py allows arbitrary attribute access — should validate the returned value is actually a valid quant config dict.
  4. Inconsistent dtypetorch_dtype fix — one instance of dtype="auto" on line ~641 of example_utils.py was not updated.
  5. weight_quantizer.disable() after GPTQ is a broad behavioral change affecting all GPTQ paths — this needs to be explicitly called out as it may affect downstream code that expects the quantizer to still be enabled after GPTQ.

quantizer = self.module.weight_quantizer
assert quantizer.backend_extra_args.get("static_scales", False), (
"GPTQ vector update requires static_scales=True in backend_extra_args."
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (per codebase guidelines): Bare assert statements are used for runtime validation here and throughout _blockwise_vector_update. These can be silently skipped with python -O. Replace with explicit exceptions:

if not quantizer.backend_extra_args.get("static_scales", False):
    raise RuntimeError("GPTQ vector update requires static_scales=True in backend_extra_args.")

Same applies to lines 266-267 (assert self.weight ...) and 268 (assert block_size % quant_block_size == 0).

quantizer._psx_scales = original

def update_weights(self, block_size, perc_damp):
"""Run GPTQ blockwise weight update on this module.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is_vector_lut() accesses self.module.weight_quantizer without guarding for its existence. While hasattr is used for backend_extra_args, if the module has no weight_quantizer attribute at all, this will raise AttributeError. Consider adding a guard:

weight_quantizer = getattr(self.module, 'weight_quantizer', None)
if weight_quantizer is None:
    return False
extra_args = getattr(weight_quantizer, 'backend_extra_args', None)

print_rank_0("Updating weights using GPTQ algorithm...")
for handle in gptq_handles.values():
handle.update_weights(block_size, perc_damp)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: This weight_quantizer.disable() after GPTQ is a behavioral change that affects all GPTQ paths (not just vector LUT). Previously, after GPTQ, the weight quantizer remained enabled, meaning subsequent forward passes would re-quantize the already-QDQ'd weights.

The fix seems correct in principle (weights are already quantized), but it's a significant behavioral change. Is this intentional for all formats (NVFP4, FP8, INT4)? The existing test_gptq_updates and test_gptq_e2e_flow tests should be verified to still pass with this change. Also, the hasattr check is unnecessary since GPTQ only runs on quantized linears which always have weight_quantizer.


SEED = 42
OUT_FEATURES = 64
IN_FEATURES = 128
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests import from luts import clip_vector_prescaled, clip_vector_scalesign_fast which is not a standard Python package and doesn't appear to be defined anywhere in this repository. This means all three tests will fail with ModuleNotFoundError in standard CI environments.

Please either:

  1. Add a pytest.importorskip('luts') at the top of the file
  2. Use @pytest.mark.skipif decorators on the tests
  3. Or document the dependency and how to install it

Without this, the test file is effectively non-functional for most environments.

# ---------------------------------------------------------------------------


def _make_codebook_file(tmp_dir, name, n_codewords, vector_size):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.save({"sorted_values": codebook}, ...) in _make_codebook_file — this is fine for tests, but note that whoever loads this file will need weights_only=False in torch.load() (since it's a dict containing a tensor), which is flagged by the repo's security guidelines. Ensure the production load_vector_lut_codebook function uses weights_only=True or safetensors.

)
quant_cfg = QUANT_CFG_CHOICES[args.qformat]
if args.qformat in QUANT_CFG_CHOICES:
quant_cfg = QUANT_CFG_CHOICES[args.qformat]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getattr(mtq, args.qformat) fallback could return any attribute of the mtq module (functions, classes, constants) if the format name happens to match. Consider validating that the result is actually a dict (quant config):

elif hasattr(mtq, args.qformat):
    quant_cfg = getattr(mtq, args.qformat)
    if not isinstance(quant_cfg, dict):
        raise ValueError(
            f"mtq.{args.qformat} is not a valid quantization config (got {type(quant_cfg).__name__})"
        )

This same pattern is duplicated at line ~1078. Consider extracting to a helper function to avoid duplication.

# Don't set torch_dtype for VILA models as they handle it explicitly in their builder
if "vila" not in ckpt_path.lower():
model_kwargs.setdefault("dtype", "auto")
model_kwargs.setdefault("torch_dtype", "auto")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix: dtypetorch_dtype matches the HuggingFace from_pretrained API.

However, note there's still dtype="auto" on line ~641 inside the has_pack_quantized_config block:

model = AutoModelForCausalLM.from_pretrained(
    ckpt_path,
    device_map="auto",
    trust_remote_code=trust_remote_code,
    dtype="auto",  # <-- should this also be torch_dtype?
)

Is this intentional?

s = quantizer._psx_scales[:, j // quant_block_size].contiguous()

sub = w[:, j : j + d].contiguous()
if d < vector_size:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the inner loop, w[:, col:blk_end].addr_(err, h_inv[col, col:blk_end], alpha=-1) updates the working copy w in-place. When d < vector_size (padding case), the quantizer sees a padded sub-vector and returns a full-sized q_sub. The loop only uses q_sub[:, :d], which is correct. However, the padded zeros could affect the quantizer's codeword selection for the last valid elements in the padded vector, potentially introducing different quantization compared to a non-padded path. Is this edge case tested? (It would occur when blk_end - j < vector_size.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants