Skip to content

Exposition of KMeans param object for PQ#2005

Open
lowener wants to merge 11 commits intorapidsai:mainfrom
lowener:26.06-pq-kmeans
Open

Exposition of KMeans param object for PQ#2005
lowener wants to merge 11 commits intorapidsai:mainfrom
lowener:26.06-pq-kmeans

Conversation

@lowener
Copy link
Copy Markdown
Contributor

@lowener lowener commented Apr 9, 2026

Closes #1999.
Behavior change:

  • The pq-kmeans params now follows the default kmeans params. For classical kmeans the init algorithm is now set to Kmeans++

Breaking change:

  • pq::params now uses a constructor

Signed-off-by: Mickael Ide <mide@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 9, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@lowener lowener added breaking Introduces a breaking change feature request New feature or request C++ labels Apr 9, 2026
lowener added 2 commits April 9, 2026 22:08
Signed-off-by: Mickael Ide <mide@nvidia.com>
@cjnolet cjnolet moved this to In Progress in Unstructured Data Processing Apr 10, 2026
@lowener lowener marked this pull request as ready for review April 15, 2026 16:41
@lowener lowener requested review from a team as code owners April 15, 2026 16:41
Copy link
Copy Markdown
Contributor

@tarang-jain tarang-jain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for this! I just added a comment about one small thing. Apart from that, I have no issues with this PR.

Comment thread cpp/src/preprocessing/quantize/detail/pq.cuh Outdated
lowener added 2 commits April 16, 2026 13:41
Signed-off-by: Mickael Ide <mide@nvidia.com>
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Here are some comments.

Also, shouldn't this feature be made available in C/Python?

Comment thread cpp/src/neighbors/detail/vpq_dataset.cuh
Comment thread c/src/preprocessing/quantize/pq.cpp
Comment thread cpp/include/cuvs/preprocessing/quantize/pq.hpp Outdated
Comment thread cpp/include/cuvs/preprocessing/quantize/pq.hpp Outdated
Comment thread cpp/src/neighbors/detail/vpq_dataset.cuh Outdated
using KP = std::decay_t<decltype(base_kmeans_params)>;
if constexpr (std::is_same_v<KP, cuvs::cluster::kmeans::balanced_params>) {
auto bal_params = base_kmeans_params;
bal_params.metric = cuvs::distance::DistanceType::L2Expanded;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a RAFT_EXPECTS check to ensure that the struct was already using the L2Expanded metric. A user may attempt to use PQ with an incompatible dataset/metric.

Also, it could be interested to check that n_clusters == 1 << pq_bits.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_clusters is always infered from pq_bits. I think it's not even necessary to check it since there are no other value possible. It can just be set.

Comment thread cpp/src/neighbors/detail/vpq_dataset.cuh
Comment thread cpp/src/neighbors/detail/vpq_dataset.cuh Outdated
Comment thread cpp/tests/preprocessing/product_quantization.cu
@lowener lowener requested a review from viclafargue April 22, 2026 12:56
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

Noting that the PR does not forward the feature downstream to the C and Python layer. The PR would need a rename to clearly state that it is C++ only. Also, it may be interesting to open an issue if we plan to implement the bindings at a later date.

in_params.kmeans_n_iters,
in_params.pq_kmeans_type,
std::min<uint32_t>(in_params.max_train_points_per_pq_code,
n_rows * in_params.pq_kmeans_trainset_fraction / pq_n_centers),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a detail, but integer truncation can make this calculation imprecise.

Comment thread cpp/include/cuvs/preprocessing/quantize/pq.hpp Outdated
Comment thread cpp/src/preprocessing/quantize/detail/pq.cuh Outdated
Signed-off-by: Mickael Ide <mide@nvidia.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added flexible K-means configuration options for Product Quantization, supporting both balanced and classical approaches via a variant-based parameter structure.
    • Introduced validation to ensure K-means configurations use the correct distance metric.
  • Refactor

    • Restructured Product Quantization parameters for improved clarity and extensibility.
  • Tests

    • Added comprehensive test coverage for parameter construction and validation behavior.

Walkthrough

The PR refactors the PQ (Product Quantization) preprocessing API to use a std::variant for k-means parameters. It replaces separate kmeans_n_iters and pq_kmeans_type fields in the pq::params struct with a single kmeans_params_variant that holds either balanced_params or k-means params. Backward-compatible constructors are added, and internal implementations are updated to use std::visit for variant dispatch.

Changes

Cohort / File(s) Summary
API & Type Definitions
cpp/include/cuvs/preprocessing/quantize/pq.hpp
Introduced kmeans_params_variant type alias and refactored pq::params struct. Removed kmeans_n_iters and pq_kmeans_type fields; added kmeans_params_variant kmeans_params field with default balanced_params{}. Added two new public constructors for backward compatibility and one default constructor.
Core PQ Implementation
cpp/src/preprocessing/quantize/detail/pq.cuh, cpp/src/neighbors/detail/vpq_dataset.cuh
Added helper functions (is_balanced_kmeans, get_kmeans_n_iters, get_kmeans_metric, make_pq_params_from_vpq) to extract kmeans configuration from variant. Updated train_pq_centers and train_pq to use variant dispatch with std::visit. Replaced direct parameter access with helper function calls. Added runtime validation for L2Expanded metric and rejection of Array initialization method.
Adapter Implementations
c/src/preprocessing/quantize/pq.cpp, cpp/src/neighbors/scann/detail/scann_build.cuh
Updated PQ parameter construction from aggregate initialization to explicit constructor calls. Refactored internal scann_build to use new pq::params constructor with all required arguments.
Test Coverage
cpp/tests/preprocessing/product_quantization.cu
Updated existing test to construct kmeans_params_variant explicitly. Added new Parameters test exercising constructor variants, field extraction, and validation of invalid kmeans configurations (Array initialization and unsupported metrics).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: exposing KMeans parameters for PQ preprocessing. This is a primary objective of the PR.
Description check ✅ Passed The description is directly related to the changeset, citing issue #1999 and documenting breaking changes to pq::params and behavioral changes to kmeans initialization.
Linked Issues check ✅ Passed The PR fully addresses issue #1999 by exposing kmeans parameters via std::variant, allowing users to configure both classic and balanced kmeans options for PQ preprocessing.
Out of Scope Changes check ✅ Passed All changes are directly related to exposing kmeans parameters for PQ. No unrelated modifications to other subsystems or functionality were introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
cpp/include/cuvs/preprocessing/quantize/pq.hpp (1)

34-57: Simplified constructor has a long positional-boolean signature and doesn't cover the full field surface.

With six positional arguments including two back-to-back bools (use_subspaces, use_vq) before an unsigned count (vq_n_centers), call sites are prone to silent argument-swap bugs (e.g. the scann_build.cuh and C-API call sites), and the constructor cannot be used to set any of the classic-kmeans-only fields (e.g. init, tol, oversampling_factor, seed) — callers that want those still have to use the variant constructor. Consider either (a) keeping only the kmeans_params_variant constructor plus params() = default and exposing a small factory like make_balanced_kmeans_params(n_iters) / make_classic_kmeans_params(pq_bits, n_iters), or (b) grouping the PQ-only fields into their own aggregate so callers can keep designated-initializer syntax.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/cuvs/preprocessing/quantize/pq.hpp` around lines 34 - 57, The
params(...) constructor has a long positional boolean-heavy signature and
doesn't allow configuring classic-kmeans-only fields; refactor to remove this
error-prone overload and provide safer alternatives: either delete the long
positional params(uint32_t pq_bits, uint32_t pq_dim, bool use_subspaces, bool
use_vq, uint32_t vq_n_centers, uint32_t kmeans_n_iters, ...) constructor and
keep a default params() plus a single constructor that accepts
kmeans_params_variant, or keep params() = default and add small factory
functions (e.g. make_balanced_kmeans_params(n_iters) and
make_classic_kmeans_params(pq_bits, n_iters)) so callers construct
kmeans_params_variant explicitly; alternatively group PQ-specific fields into an
aggregate struct (e.g. PQSettings) and replace the many positional args with
that aggregate to allow designated initializers and avoid boolean-argument
swaps; update all call sites (including uses in scann_build.cuh and C-API
bindings) to use the new factories/aggregate or the kmeans_params_variant
constructor.
cpp/src/neighbors/scann/detail/scann_build.cuh (1)

163-172: Confirm positional arguments match the new constructor signature.

The call relies on the order (pq_bits, pq_dim, use_subspaces, use_vq, vq_n_centers, kmeans_n_iters, pq_kmeans_type, max_train_points_per_pq_code, max_train_points_per_vq_cluster). Here num_subspaces is passed as pq_dim (consistent with the struct's definition of pq_dim as the post-PQ dimensionality = number of subspaces), and max_train_points_per_vq_cluster=1024 is supplied even though use_vq=false, which is harmless but dead. Given the positional-only API and the bool-heavy signature, it's easy to misalign arguments at call sites; consider adding inline /*use_subspaces=*/true, /*use_vq=*/false, /*vq_n_centers=*/0 argument-name comments here to make the intent obvious and guard against future reorderings.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/scann/detail/scann_build.cuh` around lines 163 - 172, The
call constructing cuvs::preprocessing::quantize::pq::params via pq_build_params
uses a positional, bool-heavy signature and can be misread or misordered; update
the call site (the pq_build_params construction) to annotate positional booleans
and the VQ-related int with inline argument-name comments (e.g.
/*use_subspaces=*/true, /*use_vq=*/false, /*vq_n_centers=*/0,
/*kmeans_n_iters=*/params.pq_train_iters,
/*pq_kmeans_type=*/cuvs::cluster::kmeans::kmeans_type::KMeansBalanced,
/*max_train_points_per_pq_code=*/pq_n_rows_train / num_clusters,
/*max_train_points_per_vq_cluster=*/1024) so the intent and mapping to the ctor
fields in cuvs::preprocessing::quantize::pq::params is explicit and future-proof
against parameter reordering.
cpp/tests/preprocessing/product_quantization.cu (1)

256-313: LGTM — but consider also asserting that a user-specified classic-kmeans setting is actually honored end-to-end.

The new Parameters test nicely covers the three constructor paths and the two invalid-configuration build() rejections (Array init, non-L2 metric). What it does not yet do — and what the earlier review thread specifically asked about — is verify that e.g. InitMethod::Random vs some other honored setting is actually propagated into the internal kmeans fit (rather than being silently overridden, as happens for metric/n_clusters in train_pq_centers). A light-weight assertion (e.g. observing different PQ codebooks for two distinct honored settings on the same seed, or exposing the effective kmeans params from the built quantizer) would close that gap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/preprocessing/product_quantization.cu` around lines 256 - 313, The
test currently doesn't verify that a user-specified classic kmeans setting
(InitMethod) is actually honored end-to-end; add a small end-to-end assertion:
create two pq::params objects that are identical except for
cuvs::cluster::kmeans::params::InitMethod (e.g., Random vs another valid
InitMethod), call build(handle, pq_params,
raft::make_const_mdspan(dataset_host.view())) for each, and compare the
resulting quantizers' learned centroids/codebooks (or any exposed effective
kmeans params) to assert they differ (or that the effective params match the
requested InitMethod); reference kmeans_params_variant and build to locate where
to change the test.
cpp/src/neighbors/detail/vpq_dataset.cuh (1)

84-120: Silent override of user-supplied n_clusters/metric in the classic-kmeans branch.

In both branches the metric on the user's kmeans params struct is unconditionally reset to L2Expanded, and in the classic branch n_clusters is also reset to pq_centers_view.extent(0). The upstream build() in pq.cuh already rejects non-L2Expanded via RAFT_EXPECTS, so the metric reset here is redundant/defensive — but any other entry point that calls train_pq_centers directly (or in the future) will silently discard the caller's metric instead of failing loudly. Consider replacing the silent overwrites with a RAFT_EXPECTS that the incoming params are already consistent (metric == L2Expanded, and either unset or n_clusters == pq_centers_view.extent(0)), keeping the assignment only to fill in the unset case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/vpq_dataset.cuh` around lines 84 - 120, Replace the
silent overwrites of user kmeans params inside the std::visit lambda: instead of
unconditionally assigning bal_params.metric and classic_params.metric and
forcing classic_params.n_clusters, add RAFT_EXPECTS checks that the incoming
kmeans params (kmeans_params / base_kmeans_params) already have metric ==
cuvs::distance::DistanceType::L2Expanded and that for the classic branch
classic_params.n_clusters is either unset (treat as 0) or equals
pq_centers_view.extent(0); only assign classic_params.n_clusters =
pq_centers_view.extent(0) when it is explicitly unset (e.g., == 0), and do not
silently overwrite a user-specified metric — fail fast with RAFT_EXPECTS
otherwise; apply the same pattern for the balanced branch (check metric or set
only if unset).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/cuvs/preprocessing/quantize/pq.hpp`:
- Around line 30-78: Add Doxygen blocks for the two public params constructors
in struct params (the params(uint32_t,pq_dim,bool,use_vq,...) overload that
builds kmeans_params and the params(..., kmeans_params_variant, ...) overload)
including `@brief` and `@param` for every parameter (pq_bits, pq_dim, use_subspaces,
use_vq, vq_n_centers, kmeans_n_iters/pq_kmeans_type where applicable,
kmeans_params, max_train_points_per_pq_code, max_train_points_per_vq_cluster).
Also add a short migration note in the header comment for struct params
explaining the breaking change (kmeans_n_iters and pq_kmeans_type are no longer
public fields and designated-initializer usage like pq::params{.pq_bits = 8,
.use_vq = true} is no longer supported) and provide a suggested migration
snippet (use the new constructors or supply a kmeans_params_variant). Finally
add a deprecation path by reintroducing the old public fields or constructor
overloads guarded by a DEPRECATED macro/attribute that forwards to the new
constructors (or mark an inline compatibility constructor deprecated) so
downstream users get warnings while migrating.

In `@cpp/src/preprocessing/quantize/detail/pq.cuh`:
- Around line 385-407: The make_pq_params_from_vpq helper computes train-set
sizes using n_rows * fraction / n_centers which can yield zero for uninitialized
fraction values and causes unsafe narrowing from double to uint32_t; update
make_pq_params_from_vpq to (1) add precondition checks (e.g., RAFT_EXPECTS or
equivalent) that in_params.pq_kmeans_trainset_fraction and
in_params.vq_kmeans_trainset_fraction are > 0 when used, and (2) avoid implicit
double->uint32_t narrowing by computing the min using double and then explicitly
static_cast<uint32_t>(...) or compute in integer order to bound the result
safely before assigning to max_train_points_per_vq_cluster and the pq train-size
field so truncation/overflow warnings are eliminated (refer to function
make_pq_params_from_vpq and symbols pq_n_centers,
max_train_points_per_vq_cluster).

---

Nitpick comments:
In `@cpp/include/cuvs/preprocessing/quantize/pq.hpp`:
- Around line 34-57: The params(...) constructor has a long positional
boolean-heavy signature and doesn't allow configuring classic-kmeans-only
fields; refactor to remove this error-prone overload and provide safer
alternatives: either delete the long positional params(uint32_t pq_bits,
uint32_t pq_dim, bool use_subspaces, bool use_vq, uint32_t vq_n_centers,
uint32_t kmeans_n_iters, ...) constructor and keep a default params() plus a
single constructor that accepts kmeans_params_variant, or keep params() =
default and add small factory functions (e.g.
make_balanced_kmeans_params(n_iters) and make_classic_kmeans_params(pq_bits,
n_iters)) so callers construct kmeans_params_variant explicitly; alternatively
group PQ-specific fields into an aggregate struct (e.g. PQSettings) and replace
the many positional args with that aggregate to allow designated initializers
and avoid boolean-argument swaps; update all call sites (including uses in
scann_build.cuh and C-API bindings) to use the new factories/aggregate or the
kmeans_params_variant constructor.

In `@cpp/src/neighbors/detail/vpq_dataset.cuh`:
- Around line 84-120: Replace the silent overwrites of user kmeans params inside
the std::visit lambda: instead of unconditionally assigning bal_params.metric
and classic_params.metric and forcing classic_params.n_clusters, add
RAFT_EXPECTS checks that the incoming kmeans params (kmeans_params /
base_kmeans_params) already have metric ==
cuvs::distance::DistanceType::L2Expanded and that for the classic branch
classic_params.n_clusters is either unset (treat as 0) or equals
pq_centers_view.extent(0); only assign classic_params.n_clusters =
pq_centers_view.extent(0) when it is explicitly unset (e.g., == 0), and do not
silently overwrite a user-specified metric — fail fast with RAFT_EXPECTS
otherwise; apply the same pattern for the balanced branch (check metric or set
only if unset).

In `@cpp/src/neighbors/scann/detail/scann_build.cuh`:
- Around line 163-172: The call constructing
cuvs::preprocessing::quantize::pq::params via pq_build_params uses a positional,
bool-heavy signature and can be misread or misordered; update the call site (the
pq_build_params construction) to annotate positional booleans and the VQ-related
int with inline argument-name comments (e.g. /*use_subspaces=*/true,
/*use_vq=*/false, /*vq_n_centers=*/0, /*kmeans_n_iters=*/params.pq_train_iters,
/*pq_kmeans_type=*/cuvs::cluster::kmeans::kmeans_type::KMeansBalanced,
/*max_train_points_per_pq_code=*/pq_n_rows_train / num_clusters,
/*max_train_points_per_vq_cluster=*/1024) so the intent and mapping to the ctor
fields in cuvs::preprocessing::quantize::pq::params is explicit and future-proof
against parameter reordering.

In `@cpp/tests/preprocessing/product_quantization.cu`:
- Around line 256-313: The test currently doesn't verify that a user-specified
classic kmeans setting (InitMethod) is actually honored end-to-end; add a small
end-to-end assertion: create two pq::params objects that are identical except
for cuvs::cluster::kmeans::params::InitMethod (e.g., Random vs another valid
InitMethod), call build(handle, pq_params,
raft::make_const_mdspan(dataset_host.view())) for each, and compare the
resulting quantizers' learned centroids/codebooks (or any exposed effective
kmeans params) to assert they differ (or that the effective params match the
requested InitMethod); reference kmeans_params_variant and build to locate where
to change the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8cd620b4-d725-4a36-92d4-ff2cb66ecb05

📥 Commits

Reviewing files that changed from the base of the PR and between c999208 and 48d01fc.

📒 Files selected for processing (6)
  • c/src/preprocessing/quantize/pq.cpp
  • cpp/include/cuvs/preprocessing/quantize/pq.hpp
  • cpp/src/neighbors/detail/vpq_dataset.cuh
  • cpp/src/neighbors/scann/detail/scann_build.cuh
  • cpp/src/preprocessing/quantize/detail/pq.cuh
  • cpp/tests/preprocessing/product_quantization.cu

Comment on lines 30 to +78
struct params {
/**
* Simplified constructor that will build an appropriate kmeans params object.
*/
params(uint32_t pq_bits,
uint32_t pq_dim,
bool use_subspaces,
bool use_vq,
uint32_t vq_n_centers,
uint32_t kmeans_n_iters,
cuvs::cluster::kmeans::kmeans_type pq_kmeans_type =
cuvs::cluster::kmeans::kmeans_type::KMeansBalanced,
uint32_t max_train_points_per_pq_code = 256,
uint32_t max_train_points_per_vq_cluster = 1024)
: pq_bits(pq_bits),
pq_dim(pq_dim),
use_subspaces(use_subspaces),
use_vq(use_vq),
vq_n_centers(vq_n_centers),
kmeans_params(
pq_kmeans_type == cuvs::cluster::kmeans::kmeans_type::KMeansBalanced
? kmeans_params_variant{cuvs::cluster::kmeans::balanced_params{.n_iters = kmeans_n_iters}}
: kmeans_params_variant{cuvs::cluster::kmeans::params{
.n_clusters = 1 << pq_bits, .max_iter = static_cast<int>(kmeans_n_iters)}}),
max_train_points_per_pq_code(max_train_points_per_pq_code),
max_train_points_per_vq_cluster(max_train_points_per_vq_cluster)
{
}

params(uint32_t pq_bits,
uint32_t pq_dim,
bool use_subspaces,
bool use_vq,
uint32_t vq_n_centers,
kmeans_params_variant kmeans_params,
uint32_t max_train_points_per_pq_code = 256,
uint32_t max_train_points_per_vq_cluster = 1024)
: pq_bits(pq_bits),
pq_dim(pq_dim),
use_subspaces(use_subspaces),
use_vq(use_vq),
vq_n_centers(vq_n_centers),
kmeans_params(kmeans_params),
max_train_points_per_pq_code(max_train_points_per_pq_code),
max_train_points_per_vq_cluster(max_train_points_per_vq_cluster)
{
}

params() = default;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add Doxygen for the new constructors and a migration note for this breaking change.

The two new public constructors are part of the documented API but lack @brief/@param descriptions for their arguments (the second constructor has no Doxygen block at all). Additionally, since this PR removes kmeans_n_iters and pq_kmeans_type as public fields and disables the previous designated-initializer usage pattern (e.g. pq::params{.pq_bits = 8, .use_vq = true} no longer compiles), please add a migration/upgrade note and, ideally, a deprecation path for downstream users. As per coding guidelines: "For public C++ API headers ... Doxygen documentation for all public functions/classes ... Breaking changes require deprecation warnings and migration guide updates."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/cuvs/preprocessing/quantize/pq.hpp` around lines 30 - 78, Add
Doxygen blocks for the two public params constructors in struct params (the
params(uint32_t,pq_dim,bool,use_vq,...) overload that builds kmeans_params and
the params(..., kmeans_params_variant, ...) overload) including `@brief` and
`@param` for every parameter (pq_bits, pq_dim, use_subspaces, use_vq,
vq_n_centers, kmeans_n_iters/pq_kmeans_type where applicable, kmeans_params,
max_train_points_per_pq_code, max_train_points_per_vq_cluster). Also add a short
migration note in the header comment for struct params explaining the breaking
change (kmeans_n_iters and pq_kmeans_type are no longer public fields and
designated-initializer usage like pq::params{.pq_bits = 8, .use_vq = true} is no
longer supported) and provide a suggested migration snippet (use the new
constructors or supply a kmeans_params_variant). Finally add a deprecation path
by reintroducing the old public fields or constructor overloads guarded by a
DEPRECATED macro/attribute that forwards to the new constructors (or mark an
inline compatibility constructor deprecated) so downstream users get warnings
while migrating.

Comment on lines +385 to +407
inline auto make_pq_params_from_vpq(const cuvs::neighbors::vpq_params& in_params,
const uint64_t n_rows)
-> cuvs::preprocessing::quantize::pq::params
{
const uint32_t pq_n_centers = 1 << in_params.pq_bits;
uint32_t max_train_points_per_vq_cluster = in_params.max_train_points_per_vq_cluster;
if (in_params.vq_n_centers > 0) {
max_train_points_per_vq_cluster =
std::min<uint32_t>(max_train_points_per_vq_cluster,
n_rows * in_params.vq_kmeans_trainset_fraction / in_params.vq_n_centers);
}
return cuvs::preprocessing::quantize::pq::params{
in_params.pq_bits,
in_params.pq_dim,
true,
true,
in_params.vq_n_centers,
in_params.kmeans_n_iters,
in_params.pq_kmeans_type,
std::min<uint32_t>(in_params.max_train_points_per_pq_code,
n_rows * in_params.pq_kmeans_trainset_fraction / pq_n_centers),
max_train_points_per_vq_cluster};
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against zero/overflow in the n_rows * fraction / n_centers computations.

The expressions at lines 394 and 404-405 compute n_rows * fraction / n_centers as double and then narrow to uint32_t via std::min<uint32_t>. A few concerns worth confirming:

  • If in_params.pq_kmeans_trainset_fraction == 0 or in_params.vq_kmeans_trainset_fraction == 0 (which is the sentinel "unfilled" value in vpq_params), the result collapses to 0 and max_train_points_per_* becomes 0, making downstream train_pq fail the n_rows_train >= pq_n_centers check in a non-obvious way. vpq_build calls fill_missing_params_heuristics first so this is fine today, but make_pq_params_from_vpq is a public helper in this header and could be reached with unfilled vpq_params in the future — a RAFT_EXPECTS on the fractions would harden it.
  • Narrowing a double result to uint32_t via std::min<uint32_t>(..., static_cast<double>(...)) will raise implicit-conversion warnings under some compilers and silently truncate for very large n_rows. An explicit static_cast<uint32_t>(std::min<double>(..., ...)) (or computing in integer after ordering the division) would be more defensible.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/preprocessing/quantize/detail/pq.cuh` around lines 385 - 407, The
make_pq_params_from_vpq helper computes train-set sizes using n_rows * fraction
/ n_centers which can yield zero for uninitialized fraction values and causes
unsafe narrowing from double to uint32_t; update make_pq_params_from_vpq to (1)
add precondition checks (e.g., RAFT_EXPECTS or equivalent) that
in_params.pq_kmeans_trainset_fraction and in_params.vq_kmeans_trainset_fraction
are > 0 when used, and (2) avoid implicit double->uint32_t narrowing by
computing the min using double and then explicitly static_cast<uint32_t>(...) or
compute in integer order to bound the result safely before assigning to
max_train_points_per_vq_cluster and the pq train-size field so
truncation/overflow warnings are eliminated (refer to function
make_pq_params_from_vpq and symbols pq_n_centers,
max_train_points_per_vq_cluster).

Copy link
Copy Markdown
Contributor

@tarang-jain tarang-jain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one small comment. But otherwise approving!

params config_;
};

TEST(ProductQuantizationTestF, Parameters)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not seen tests that simply check if parameter are set. I think we should be mindful with CI resources already overloaded with cpp tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Introduces a breaking change C++ feature request New feature or request

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

[FEA] Expose kmeans param object for PQ preprocessing

4 participants