Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughReplaces model precision keyword ( Changes
Sequence Diagram(s)sequenceDiagram
participant Driver as Calibration/Quantize Driver
participant Layer as Model Layer
participant GPTQ as GPTQHelper
participant Quant as Quantizer Backend
participant IO as Codebook / LUT IO
Driver->>Layer: run forward(batch, kwargs_input)
alt kwargs_input.past_key_values present
Layer->>Layer: clone kwargs_input
Layer->>Layer: reset cache or set past_key_values=None
end
Layer->>Driver: return activations
Driver->>GPTQ: compute h_inv / gather stats
GPTQ->>Quant: query backend (quantizer(self.weight)) to populate _psx_scales
Quant-->>GPTQ: return per-block scales
GPTQ->>Quant: quantize sub-vectors with provided slice-scales
GPTQ->>Layer: write quantized weights back (column updates)
GPTQ->>Layer: disable weight_quantizer attribute (if present)
GPTQ->>IO: (optional) load/save vector LUT codebook
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1223 +/- ##
===========================================
- Coverage 76.90% 61.24% -15.67%
===========================================
Files 350 459 +109
Lines 40524 49354 +8830
===========================================
- Hits 31166 30226 -940
- Misses 9358 19128 +9770
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
6e9b1d1 to
7a29dc0
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
examples/llm_ptq/hf_ptq.py (2)
1075-1082: Same inconsistency: error message doesn't mention mtq attribute fallback.For consistency, the error message should mention both supported sources.
♻️ Suggested fix
else: raise AssertionError( - f"Unsupported quantization format: {args.qformat}, choices are: {list(QUANT_CFG_CHOICES.keys())}" + f"Unsupported quantization format: {args.qformat}, choices are: {list(QUANT_CFG_CHOICES.keys())} or any mtq attribute" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 1075 - 1082, The AssertionError message for unsupported quantization format only lists QUANT_CFG_CHOICES keys but omits the mtq attribute fallback; update the error message in the block handling args.qformat (where QUANT_CFG_CHOICES, mtq, getattr(mtq, args.qformat), and quant_cfg are used) to mention both valid keys from QUANT_CFG_CHOICES and that additional formats may be available as attributes on mtq (or list mtq attributes if desirable) so the error clearly indicates both supported sources.
423-431: Inconsistent error message between low-memory mode and standard mode.The error message in low-memory mode (line 429-431) only mentions
QUANT_CFG_CHOICES.keys()but doesn't mention thatmtqattributes are also supported (which lines 425-426 check for). Consider aligning the error message with the actual supported formats.♻️ Suggested fix for consistent error message
else: raise AssertionError( f"Quantization format is not supported for low memory mode. " - f"Supported formats: {QUANT_CFG_CHOICES.keys()}" + f"Supported formats: {list(QUANT_CFG_CHOICES.keys())} or any mtq attribute" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 423 - 431, The error message for the low-memory branch (the block that checks args.qformat against QUANT_CFG_CHOICES and mtq via hasattr/getattr and sets quant_cfg) is inconsistent: update the AssertionError text to reflect both supported sources (keys of QUANT_CFG_CHOICES and supported attributes on mtq) — e.g., mention QUANT_CFG_CHOICES.keys() and the list of mtq attribute names or a phrase like "or available mtq formats" — so the message accurately describes what formats are accepted when raising the AssertionError for args.qformat.tests/gpu/torch/quantization/test_gptq_vq.py (1)
307-316: Consider adding test cases for edge cases and error conditions.The current tests cover the happy path with specific dimensions. Consider adding tests for:
- Misaligned dimensions (e.g.,
in_featuresnot divisible byvector_size)- Invalid configurations that should raise assertions
- Different block size combinations
This would improve confidence in the assertion checks added in the implementation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/quantization/test_gptq_vq.py` around lines 307 - 316, Add unit tests in test_gptq_vq.py that exercise edge and error cases using the existing constants (SEED, IN_FEATURES, OUT_FEATURES, VECTOR_SIZE, GPTQ_BLOCK_SIZE, QUANT_BLOCK_SIZE, N_CODEWORDS, PERC_DAMP, SCALE_TYPE) by: 1) creating a case where IN_FEATURES is not divisible by VECTOR_SIZE and asserting the code raises the appropriate exception/AssertionError; 2) creating invalid config combinations (e.g., VECTOR_SIZE <= 0, N_CODEWORDS out of range, PERC_DAMP negative) and asserting they raise errors; and 3) parametrizing tests to run across multiple GPTQ_BLOCK_SIZE/QUANT_BLOCK_SIZE combinations (including mismatched sizes) to validate assertion checks and expected failures.modelopt/torch/quantization/utils/calib_utils.py (1)
50-72: Adding explicitweights_only=Truetotorch.loadwould improve code clarity.The
torch.loadon line 65 is already safe because the project requires PyTorch 2.8+, which defaultsweights_only=True. However, making the parameter explicit improves readability and documents intent, since the codebook file only contains tensor data.Optional improvement
if "sorted" in encode_format: - cb = torch.load(encode_path + encode_format + ".pt", map_location="cpu") + cb = torch.load(encode_path + encode_format + ".pt", map_location="cpu", weights_only=True) codebook = cb["sorted_values"].cuda().float()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 50 - 72, In load_vector_lut_codebook, make the torch.load call explicit about loading tensor-only data by adding weights_only=True to the torch.load invocation (the branch that loads cb = torch.load(...)). Update the torch.load call inside the load_vector_lut_codebook function so it passes weights_only=True (and keep map_location="cpu") to clearly document intent that the file contains only tensor data.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 258-314: The _blockwise_vector_update method assumes
quant_block_size aligns with vector_size for correct scale indexing; add an
assertion after loading codebook/computing vector_size and quant_block_size to
validate quant_block_size % vector_size == 0 (e.g., in _blockwise_vector_update
right after vector_size = codebook.shape[1]) so the j // quant_block_size scale
lookup in the loop is safe and fails fast with a clear message if misconfigured.
In `@tests/gpu/torch/quantization/test_gptq_vq.py`:
- Around line 320-321: The tests import luts (clip_vector_prescaled,
clip_vector_scalesign_fast) in tests/gpu/torch/quantization/test_gptq_vq.py and
calib_utils.py also imports luts at two locations; add luts as an optional test
dependency in pyproject.toml (e.g., under a [project.optional-dependencies]
group like "dev-test") or modify the test modules to skip when missing by
calling pytest.importorskip("luts") at the top of the failing test functions (or
module-level in test_gptq_vq.py) and similarly protect the imports/usage sites
in modelopt/torch/quantization/utils/calib_utils.py (the two import/usage
locations around the current lines) so runtime failures are avoided when luts is
not installed.
---
Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1075-1082: The AssertionError message for unsupported quantization
format only lists QUANT_CFG_CHOICES keys but omits the mtq attribute fallback;
update the error message in the block handling args.qformat (where
QUANT_CFG_CHOICES, mtq, getattr(mtq, args.qformat), and quant_cfg are used) to
mention both valid keys from QUANT_CFG_CHOICES and that additional formats may
be available as attributes on mtq (or list mtq attributes if desirable) so the
error clearly indicates both supported sources.
- Around line 423-431: The error message for the low-memory branch (the block
that checks args.qformat against QUANT_CFG_CHOICES and mtq via hasattr/getattr
and sets quant_cfg) is inconsistent: update the AssertionError text to reflect
both supported sources (keys of QUANT_CFG_CHOICES and supported attributes on
mtq) — e.g., mention QUANT_CFG_CHOICES.keys() and the list of mtq attribute
names or a phrase like "or available mtq formats" — so the message accurately
describes what formats are accepted when raising the AssertionError for
args.qformat.
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 50-72: In load_vector_lut_codebook, make the torch.load call
explicit about loading tensor-only data by adding weights_only=True to the
torch.load invocation (the branch that loads cb = torch.load(...)). Update the
torch.load call inside the load_vector_lut_codebook function so it passes
weights_only=True (and keep map_location="cpu") to clearly document intent that
the file contains only tensor data.
In `@tests/gpu/torch/quantization/test_gptq_vq.py`:
- Around line 307-316: Add unit tests in test_gptq_vq.py that exercise edge and
error cases using the existing constants (SEED, IN_FEATURES, OUT_FEATURES,
VECTOR_SIZE, GPTQ_BLOCK_SIZE, QUANT_BLOCK_SIZE, N_CODEWORDS, PERC_DAMP,
SCALE_TYPE) by: 1) creating a case where IN_FEATURES is not divisible by
VECTOR_SIZE and asserting the code raises the appropriate
exception/AssertionError; 2) creating invalid config combinations (e.g.,
VECTOR_SIZE <= 0, N_CODEWORDS out of range, PERC_DAMP negative) and asserting
they raise errors; and 3) parametrizing tests to run across multiple
GPTQ_BLOCK_SIZE/QUANT_BLOCK_SIZE combinations (including mismatched sizes) to
validate assertion checks and expected failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 8f2a775c-a4a0-4aaf-aa80-b16f69d1d713
📒 Files selected for processing (5)
examples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/utils/calib_utils.pytests/gpu/torch/quantization/test_gptq_vq.py
| def _blockwise_vector_update(self, block_size): | ||
| """GPTQ blockwise update for vector LUT quantizers. | ||
|
|
||
| Pre-computes scales once, then runs the standard GPTQ 3-loop | ||
| with per-vector-group static quantization via clip_vector_prescaled. | ||
| """ | ||
| import torch.nn.functional as F | ||
| from luts import clip_vector_prescaled, clip_vector_scalesign_fast | ||
|
|
||
| codebook, quant_block_size, scale_type = load_vector_lut_codebook( | ||
| self.module.weight_quantizer | ||
| ) | ||
|
|
||
| # Get vector size from codebook | ||
| vector_size = codebook.shape[1] | ||
|
|
||
| assert self.weight is not None and self.h_inv is not None | ||
| num_cols = self.weight.shape[1] | ||
| assert block_size % quant_block_size == 0 | ||
|
|
||
| # Pre-compute scales once outside the GPTQ loop | ||
| _, scales = clip_vector_scalesign_fast( | ||
| self.weight, | ||
| codebook, | ||
| quant_block_size, | ||
| scale_type, | ||
| scale_algo="max", | ||
| sign_scale=True, | ||
| return_scales=True, | ||
| ) | ||
| scales_2d = scales.reshape(self.weight.shape[0], -1) | ||
|
|
||
| w = self.weight.clone() | ||
| h_inv = self.h_inv | ||
|
|
||
| for blk_start in range(0, num_cols, block_size): | ||
| blk_end = min(blk_start + block_size, num_cols) | ||
| errs = torch.zeros_like(w[:, blk_start:blk_end]) | ||
|
|
||
| for j in range(blk_start, blk_end, vector_size): | ||
| d = min(vector_size, blk_end - j) | ||
| s = scales_2d[:, j // quant_block_size].contiguous() | ||
|
|
||
| sub = w[:, j : j + d].contiguous() | ||
| if d < vector_size: | ||
| sub = F.pad(sub, (0, vector_size - d)) | ||
| q_sub = clip_vector_prescaled(sub, codebook, s) | ||
|
|
||
| for k in range(d): | ||
| col = j + k | ||
| self.weight[:, col] = q_sub[:, k] | ||
| err = (w[:, col] - q_sub[:, k]) / h_inv[col, col] | ||
| errs[:, col - blk_start] = err | ||
| w[:, col:blk_end].addr_(err, h_inv[col, col:blk_end], alpha=-1) | ||
|
|
||
| if blk_end < num_cols: | ||
| w[:, blk_end:] -= errs @ h_inv[blk_start:blk_end, blk_end:] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if there are any constraints documented between quant_block_size and vector_size
rg -n "quant_block_size.*vector" --type py
rg -n "vector_size.*block" --type pyRepository: NVIDIA/Model-Optimizer
Length of output: 193
🏁 Script executed:
# Find the load_vector_lut_codebook function definition
rg -n "def load_vector_lut_codebook" --type py -A 20Repository: NVIDIA/Model-Optimizer
Length of output: 1926
🏁 Script executed:
# Check if there are any tests for this functionality
fd ".*test.*\.py" --type f | xargs rg -l "blockwise_vector_update|load_vector_lut_codebook" 2>/dev/nullRepository: NVIDIA/Model-Optimizer
Length of output: 110
🏁 Script executed:
# Look for any documentation or validation of quant_block_size and vector_size
rg -n "vector_size|quant_block_size" modelopt/torch/quantization/utils/calib_utils.py --type py -B 2 -A 2 | head -80Repository: NVIDIA/Model-Optimizer
Length of output: 1415
🏁 Script executed:
# Examine the test file to understand the dimension relationships
cat tests/gpu/torch/quantization/test_gptq_vq.py | head -150Repository: NVIDIA/Model-Optimizer
Length of output: 5382
🏁 Script executed:
# Check for any comments or documentation about vector_size and quant_block_size in the entire calib_utils file
cat modelopt/torch/quantization/utils/calib_utils.py | grep -n "vector_size\|quant_block_size" | head -30Repository: NVIDIA/Model-Optimizer
Length of output: 633
🏁 Script executed:
# Check the complete load_vector_lut_codebook function
sed -n '50,100p' modelopt/torch/quantization/utils/calib_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1929
Add assertion to validate quant_block_size % vector_size == 0.
The scale indexing at line 299 (j // quant_block_size) assumes vector groups align with quantization block boundaries. While line 276 asserts block_size % quant_block_size == 0, there's no explicit check that quant_block_size % vector_size == 0. The reference implementation in the test file (test_gptq_vq.py) validates this constraint explicitly; adding this assertion here ensures correct scale lookups.
Suggested fix
# Get vector size from codebook
vector_size = codebook.shape[1]
assert self.weight is not None and self.h_inv is not None
num_cols = self.weight.shape[1]
assert block_size % quant_block_size == 0
+ assert quant_block_size % vector_size == 0, (
+ f"quant_block_size ({quant_block_size}) must be divisible by vector_size ({vector_size})"
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 258 - 314, The
_blockwise_vector_update method assumes quant_block_size aligns with vector_size
for correct scale indexing; add an assertion after loading codebook/computing
vector_size and quant_block_size to validate quant_block_size % vector_size == 0
(e.g., in _blockwise_vector_update right after vector_size = codebook.shape[1])
so the j // quant_block_size scale lookup in the loop is safe and fails fast
with a clear message if misconfigured.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
modelopt/torch/quantization/utils/calib_utils.py (1)
261-280:⚠️ Potential issue | 🟠 MajorAdd explicit vector/block alignment validation before scale indexing.
Line 280 uses
j // quant_block_sizefor scale lookup. Without a guard thatquant_block_sizeis divisible byvector_size, vector chunks can straddle quant-block boundaries and select incorrect scales.Suggested fix
vector_size = quantizer.backend_extra_args["vector_size"] quant_block_size = quantizer.backend_extra_args["block_sizes"] assert self.weight is not None and self.h_inv is not None num_cols = self.weight.shape[1] assert block_size % quant_block_size == 0 + assert quant_block_size % vector_size == 0, ( + f"quant_block_size ({quant_block_size}) must be divisible by vector_size ({vector_size})" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 261 - 280, The code can index quantizer._psx_scales incorrectly when vector chunks cross quant-block boundaries; add an explicit validation before the blk loop ensuring vector/block alignment (e.g., assert quant_block_size % vector_size == 0 and vector_size <= quant_block_size, or raise a clear ValueError) so that the expression s = quantizer._psx_scales[:, j // quant_block_size] is safe; place this guard near the uses of vector_size/quant_block_size (around where quant_block_size = quantizer.backend_extra_args["block_sizes"] and before the for blk_start... loop) and include a descriptive message referencing vector_size and quant_block_size.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 261-280: The code can index quantizer._psx_scales incorrectly when
vector chunks cross quant-block boundaries; add an explicit validation before
the blk loop ensuring vector/block alignment (e.g., assert quant_block_size %
vector_size == 0 and vector_size <= quant_block_size, or raise a clear
ValueError) so that the expression s = quantizer._psx_scales[:, j //
quant_block_size] is safe; place this guard near the uses of
vector_size/quant_block_size (around where quant_block_size =
quantizer.backend_extra_args["block_sizes"] and before the for blk_start...
loop) and include a descriptive message referencing vector_size and
quant_block_size.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 48a84d8f-2322-42e2-9fb8-609c5557c778
📒 Files selected for processing (1)
modelopt/torch/quantization/utils/calib_utils.py
cjluo-nv
left a comment
There was a problem hiding this comment.
The PR adds GPTQ support for vector LUT quantization with several bug fixes. While the core algorithm and tests look solid, there are several issues to address:
- Bare
assertfor runtime validation in_blockwise_vector_update— these should beRuntimeError/ValueErrorsince asserts can be optimized away with-O. - Tests depend on undocumented
lutsmodule — all 3 tests import fromlutswhich is not a standard package and will fail without it. Tests need skip markers or the dependency needs to be documented. getattr(mtq, args.qformat)fallback inhf_ptq.pyallows arbitrary attribute access — should validate the returned value is actually a valid quant config dict.- Inconsistent
dtype→torch_dtypefix — one instance ofdtype="auto"on line ~641 ofexample_utils.pywas not updated. weight_quantizer.disable()after GPTQ is a broad behavioral change affecting all GPTQ paths — this needs to be explicitly called out as it may affect downstream code that expects the quantizer to still be enabled after GPTQ.
| quantizer = self.module.weight_quantizer | ||
| assert quantizer.backend_extra_args.get("static_scales", False), ( | ||
| "GPTQ vector update requires static_scales=True in backend_extra_args." | ||
| ) |
There was a problem hiding this comment.
Bug (per codebase guidelines): Bare assert statements are used for runtime validation here and throughout _blockwise_vector_update. These can be silently skipped with python -O. Replace with explicit exceptions:
if not quantizer.backend_extra_args.get("static_scales", False):
raise RuntimeError("GPTQ vector update requires static_scales=True in backend_extra_args.")Same applies to lines 266-267 (assert self.weight ...) and 268 (assert block_size % quant_block_size == 0).
| quantizer._psx_scales = original | ||
|
|
||
| def update_weights(self, block_size, perc_damp): | ||
| """Run GPTQ blockwise weight update on this module. |
There was a problem hiding this comment.
Nit: is_vector_lut() accesses self.module.weight_quantizer without guarding for its existence. While hasattr is used for backend_extra_args, if the module has no weight_quantizer attribute at all, this will raise AttributeError. Consider adding a guard:
weight_quantizer = getattr(self.module, 'weight_quantizer', None)
if weight_quantizer is None:
return False
extra_args = getattr(weight_quantizer, 'backend_extra_args', None)| print_rank_0("Updating weights using GPTQ algorithm...") | ||
| for handle in gptq_handles.values(): | ||
| handle.update_weights(block_size, perc_damp) | ||
|
|
There was a problem hiding this comment.
Important: This weight_quantizer.disable() after GPTQ is a behavioral change that affects all GPTQ paths (not just vector LUT). Previously, after GPTQ, the weight quantizer remained enabled, meaning subsequent forward passes would re-quantize the already-QDQ'd weights.
The fix seems correct in principle (weights are already quantized), but it's a significant behavioral change. Is this intentional for all formats (NVFP4, FP8, INT4)? The existing test_gptq_updates and test_gptq_e2e_flow tests should be verified to still pass with this change. Also, the hasattr check is unnecessary since GPTQ only runs on quantized linears which always have weight_quantizer.
|
|
||
| SEED = 42 | ||
| OUT_FEATURES = 64 | ||
| IN_FEATURES = 128 |
There was a problem hiding this comment.
Tests import from luts import clip_vector_prescaled, clip_vector_scalesign_fast which is not a standard Python package and doesn't appear to be defined anywhere in this repository. This means all three tests will fail with ModuleNotFoundError in standard CI environments.
Please either:
- Add a
pytest.importorskip('luts')at the top of the file - Use
@pytest.mark.skipifdecorators on the tests - Or document the dependency and how to install it
Without this, the test file is effectively non-functional for most environments.
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def _make_codebook_file(tmp_dir, name, n_codewords, vector_size): |
There was a problem hiding this comment.
torch.save({"sorted_values": codebook}, ...) in _make_codebook_file — this is fine for tests, but note that whoever loads this file will need weights_only=False in torch.load() (since it's a dict containing a tensor), which is flagged by the repo's security guidelines. Ensure the production load_vector_lut_codebook function uses weights_only=True or safetensors.
| ) | ||
| quant_cfg = QUANT_CFG_CHOICES[args.qformat] | ||
| if args.qformat in QUANT_CFG_CHOICES: | ||
| quant_cfg = QUANT_CFG_CHOICES[args.qformat] |
There was a problem hiding this comment.
The getattr(mtq, args.qformat) fallback could return any attribute of the mtq module (functions, classes, constants) if the format name happens to match. Consider validating that the result is actually a dict (quant config):
elif hasattr(mtq, args.qformat):
quant_cfg = getattr(mtq, args.qformat)
if not isinstance(quant_cfg, dict):
raise ValueError(
f"mtq.{args.qformat} is not a valid quantization config (got {type(quant_cfg).__name__})"
)This same pattern is duplicated at line ~1078. Consider extracting to a helper function to avoid duplication.
| # Don't set torch_dtype for VILA models as they handle it explicitly in their builder | ||
| if "vila" not in ckpt_path.lower(): | ||
| model_kwargs.setdefault("dtype", "auto") | ||
| model_kwargs.setdefault("torch_dtype", "auto") |
There was a problem hiding this comment.
Good fix: dtype → torch_dtype matches the HuggingFace from_pretrained API.
However, note there's still dtype="auto" on line ~641 inside the has_pack_quantized_config block:
model = AutoModelForCausalLM.from_pretrained(
ckpt_path,
device_map="auto",
trust_remote_code=trust_remote_code,
dtype="auto", # <-- should this also be torch_dtype?
)Is this intentional?
| s = quantizer._psx_scales[:, j // quant_block_size].contiguous() | ||
|
|
||
| sub = w[:, j : j + d].contiguous() | ||
| if d < vector_size: |
There was a problem hiding this comment.
In the inner loop, w[:, col:blk_end].addr_(err, h_inv[col, col:blk_end], alpha=-1) updates the working copy w in-place. When d < vector_size (padding case), the quantizer sees a padded sub-vector and returns a full-sized q_sub. The loop only uses q_sub[:, :d], which is correct. However, the padded zeros could affect the quantizer's codeword selection for the last valid elements in the padded vector, potentially introducing different quantization compared to a non-padded path. Is this edge case tested? (It would occur when blk_end - j < vector_size.)
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Bug Fixes
Tests