Exposition of KMeans param object for PQ#2005
Exposition of KMeans param object for PQ#2005lowener wants to merge 11 commits intorapidsai:mainfrom
Conversation
Signed-off-by: Mickael Ide <mide@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Mickael Ide <mide@nvidia.com>
Signed-off-by: Mickael Ide <mide@nvidia.com>
tarang-jain
left a comment
There was a problem hiding this comment.
LGTM. Thanks for this! I just added a comment about one small thing. Apart from that, I have no issues with this PR.
Signed-off-by: Mickael Ide <mide@nvidia.com>
viclafargue
left a comment
There was a problem hiding this comment.
Thanks! Here are some comments.
Also, shouldn't this feature be made available in C/Python?
| using KP = std::decay_t<decltype(base_kmeans_params)>; | ||
| if constexpr (std::is_same_v<KP, cuvs::cluster::kmeans::balanced_params>) { | ||
| auto bal_params = base_kmeans_params; | ||
| bal_params.metric = cuvs::distance::DistanceType::L2Expanded; |
There was a problem hiding this comment.
Please add a RAFT_EXPECTS check to ensure that the struct was already using the L2Expanded metric. A user may attempt to use PQ with an incompatible dataset/metric.
Also, it could be interested to check that n_clusters == 1 << pq_bits.
There was a problem hiding this comment.
n_clusters is always infered from pq_bits. I think it's not even necessary to check it since there are no other value possible. It can just be set.
Signed-off-by: Mickael Ide <mide@nvidia.com>
viclafargue
left a comment
There was a problem hiding this comment.
Thanks! LGTM
Noting that the PR does not forward the feature downstream to the C and Python layer. The PR would need a rename to clearly state that it is C++ only. Also, it may be interesting to open an issue if we plan to implement the bindings at a later date.
| in_params.kmeans_n_iters, | ||
| in_params.pq_kmeans_type, | ||
| std::min<uint32_t>(in_params.max_train_points_per_pq_code, | ||
| n_rows * in_params.pq_kmeans_trainset_fraction / pq_n_centers), |
There was a problem hiding this comment.
Just a detail, but integer truncation can make this calculation imprecise.
Signed-off-by: Mickael Ide <mide@nvidia.com>
📝 WalkthroughSummary by CodeRabbit
WalkthroughThe PR refactors the PQ (Product Quantization) preprocessing API to use a Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
cpp/include/cuvs/preprocessing/quantize/pq.hpp (1)
34-57: Simplified constructor has a long positional-boolean signature and doesn't cover the full field surface.With six positional arguments including two back-to-back
bools (use_subspaces,use_vq) before an unsigned count (vq_n_centers), call sites are prone to silent argument-swap bugs (e.g. thescann_build.cuhand C-API call sites), and the constructor cannot be used to set any of the classic-kmeans-only fields (e.g.init,tol,oversampling_factor,seed) — callers that want those still have to use the variant constructor. Consider either (a) keeping only thekmeans_params_variantconstructor plusparams() = defaultand exposing a small factory likemake_balanced_kmeans_params(n_iters)/make_classic_kmeans_params(pq_bits, n_iters), or (b) grouping the PQ-only fields into their own aggregate so callers can keep designated-initializer syntax.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/cuvs/preprocessing/quantize/pq.hpp` around lines 34 - 57, The params(...) constructor has a long positional boolean-heavy signature and doesn't allow configuring classic-kmeans-only fields; refactor to remove this error-prone overload and provide safer alternatives: either delete the long positional params(uint32_t pq_bits, uint32_t pq_dim, bool use_subspaces, bool use_vq, uint32_t vq_n_centers, uint32_t kmeans_n_iters, ...) constructor and keep a default params() plus a single constructor that accepts kmeans_params_variant, or keep params() = default and add small factory functions (e.g. make_balanced_kmeans_params(n_iters) and make_classic_kmeans_params(pq_bits, n_iters)) so callers construct kmeans_params_variant explicitly; alternatively group PQ-specific fields into an aggregate struct (e.g. PQSettings) and replace the many positional args with that aggregate to allow designated initializers and avoid boolean-argument swaps; update all call sites (including uses in scann_build.cuh and C-API bindings) to use the new factories/aggregate or the kmeans_params_variant constructor.cpp/src/neighbors/scann/detail/scann_build.cuh (1)
163-172: Confirm positional arguments match the new constructor signature.The call relies on the order
(pq_bits, pq_dim, use_subspaces, use_vq, vq_n_centers, kmeans_n_iters, pq_kmeans_type, max_train_points_per_pq_code, max_train_points_per_vq_cluster). Herenum_subspacesis passed aspq_dim(consistent with the struct's definition ofpq_dimas the post-PQ dimensionality = number of subspaces), andmax_train_points_per_vq_cluster=1024is supplied even thoughuse_vq=false, which is harmless but dead. Given the positional-only API and the bool-heavy signature, it's easy to misalign arguments at call sites; consider adding inline/*use_subspaces=*/true, /*use_vq=*/false, /*vq_n_centers=*/0argument-name comments here to make the intent obvious and guard against future reorderings.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/scann/detail/scann_build.cuh` around lines 163 - 172, The call constructing cuvs::preprocessing::quantize::pq::params via pq_build_params uses a positional, bool-heavy signature and can be misread or misordered; update the call site (the pq_build_params construction) to annotate positional booleans and the VQ-related int with inline argument-name comments (e.g. /*use_subspaces=*/true, /*use_vq=*/false, /*vq_n_centers=*/0, /*kmeans_n_iters=*/params.pq_train_iters, /*pq_kmeans_type=*/cuvs::cluster::kmeans::kmeans_type::KMeansBalanced, /*max_train_points_per_pq_code=*/pq_n_rows_train / num_clusters, /*max_train_points_per_vq_cluster=*/1024) so the intent and mapping to the ctor fields in cuvs::preprocessing::quantize::pq::params is explicit and future-proof against parameter reordering.cpp/tests/preprocessing/product_quantization.cu (1)
256-313: LGTM — but consider also asserting that a user-specified classic-kmeans setting is actually honored end-to-end.The new
Parameterstest nicely covers the three constructor paths and the two invalid-configuration build() rejections (Array init, non-L2 metric). What it does not yet do — and what the earlier review thread specifically asked about — is verify that e.g.InitMethod::Randomvs some other honored setting is actually propagated into the internal kmeans fit (rather than being silently overridden, as happens formetric/n_clustersintrain_pq_centers). A light-weight assertion (e.g. observing different PQ codebooks for two distinct honored settings on the same seed, or exposing the effective kmeans params from the built quantizer) would close that gap.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/preprocessing/product_quantization.cu` around lines 256 - 313, The test currently doesn't verify that a user-specified classic kmeans setting (InitMethod) is actually honored end-to-end; add a small end-to-end assertion: create two pq::params objects that are identical except for cuvs::cluster::kmeans::params::InitMethod (e.g., Random vs another valid InitMethod), call build(handle, pq_params, raft::make_const_mdspan(dataset_host.view())) for each, and compare the resulting quantizers' learned centroids/codebooks (or any exposed effective kmeans params) to assert they differ (or that the effective params match the requested InitMethod); reference kmeans_params_variant and build to locate where to change the test.cpp/src/neighbors/detail/vpq_dataset.cuh (1)
84-120: Silent override of user-suppliedn_clusters/metricin the classic-kmeans branch.In both branches the
metricon the user's kmeans params struct is unconditionally reset toL2Expanded, and in the classic branchn_clustersis also reset topq_centers_view.extent(0). The upstreambuild()inpq.cuhalready rejects non-L2ExpandedviaRAFT_EXPECTS, so the metric reset here is redundant/defensive — but any other entry point that callstrain_pq_centersdirectly (or in the future) will silently discard the caller's metric instead of failing loudly. Consider replacing the silent overwrites with aRAFT_EXPECTSthat the incoming params are already consistent (metric == L2Expanded, and either unset orn_clusters == pq_centers_view.extent(0)), keeping the assignment only to fill in the unset case.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/detail/vpq_dataset.cuh` around lines 84 - 120, Replace the silent overwrites of user kmeans params inside the std::visit lambda: instead of unconditionally assigning bal_params.metric and classic_params.metric and forcing classic_params.n_clusters, add RAFT_EXPECTS checks that the incoming kmeans params (kmeans_params / base_kmeans_params) already have metric == cuvs::distance::DistanceType::L2Expanded and that for the classic branch classic_params.n_clusters is either unset (treat as 0) or equals pq_centers_view.extent(0); only assign classic_params.n_clusters = pq_centers_view.extent(0) when it is explicitly unset (e.g., == 0), and do not silently overwrite a user-specified metric — fail fast with RAFT_EXPECTS otherwise; apply the same pattern for the balanced branch (check metric or set only if unset).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/include/cuvs/preprocessing/quantize/pq.hpp`:
- Around line 30-78: Add Doxygen blocks for the two public params constructors
in struct params (the params(uint32_t,pq_dim,bool,use_vq,...) overload that
builds kmeans_params and the params(..., kmeans_params_variant, ...) overload)
including `@brief` and `@param` for every parameter (pq_bits, pq_dim, use_subspaces,
use_vq, vq_n_centers, kmeans_n_iters/pq_kmeans_type where applicable,
kmeans_params, max_train_points_per_pq_code, max_train_points_per_vq_cluster).
Also add a short migration note in the header comment for struct params
explaining the breaking change (kmeans_n_iters and pq_kmeans_type are no longer
public fields and designated-initializer usage like pq::params{.pq_bits = 8,
.use_vq = true} is no longer supported) and provide a suggested migration
snippet (use the new constructors or supply a kmeans_params_variant). Finally
add a deprecation path by reintroducing the old public fields or constructor
overloads guarded by a DEPRECATED macro/attribute that forwards to the new
constructors (or mark an inline compatibility constructor deprecated) so
downstream users get warnings while migrating.
In `@cpp/src/preprocessing/quantize/detail/pq.cuh`:
- Around line 385-407: The make_pq_params_from_vpq helper computes train-set
sizes using n_rows * fraction / n_centers which can yield zero for uninitialized
fraction values and causes unsafe narrowing from double to uint32_t; update
make_pq_params_from_vpq to (1) add precondition checks (e.g., RAFT_EXPECTS or
equivalent) that in_params.pq_kmeans_trainset_fraction and
in_params.vq_kmeans_trainset_fraction are > 0 when used, and (2) avoid implicit
double->uint32_t narrowing by computing the min using double and then explicitly
static_cast<uint32_t>(...) or compute in integer order to bound the result
safely before assigning to max_train_points_per_vq_cluster and the pq train-size
field so truncation/overflow warnings are eliminated (refer to function
make_pq_params_from_vpq and symbols pq_n_centers,
max_train_points_per_vq_cluster).
---
Nitpick comments:
In `@cpp/include/cuvs/preprocessing/quantize/pq.hpp`:
- Around line 34-57: The params(...) constructor has a long positional
boolean-heavy signature and doesn't allow configuring classic-kmeans-only
fields; refactor to remove this error-prone overload and provide safer
alternatives: either delete the long positional params(uint32_t pq_bits,
uint32_t pq_dim, bool use_subspaces, bool use_vq, uint32_t vq_n_centers,
uint32_t kmeans_n_iters, ...) constructor and keep a default params() plus a
single constructor that accepts kmeans_params_variant, or keep params() =
default and add small factory functions (e.g.
make_balanced_kmeans_params(n_iters) and make_classic_kmeans_params(pq_bits,
n_iters)) so callers construct kmeans_params_variant explicitly; alternatively
group PQ-specific fields into an aggregate struct (e.g. PQSettings) and replace
the many positional args with that aggregate to allow designated initializers
and avoid boolean-argument swaps; update all call sites (including uses in
scann_build.cuh and C-API bindings) to use the new factories/aggregate or the
kmeans_params_variant constructor.
In `@cpp/src/neighbors/detail/vpq_dataset.cuh`:
- Around line 84-120: Replace the silent overwrites of user kmeans params inside
the std::visit lambda: instead of unconditionally assigning bal_params.metric
and classic_params.metric and forcing classic_params.n_clusters, add
RAFT_EXPECTS checks that the incoming kmeans params (kmeans_params /
base_kmeans_params) already have metric ==
cuvs::distance::DistanceType::L2Expanded and that for the classic branch
classic_params.n_clusters is either unset (treat as 0) or equals
pq_centers_view.extent(0); only assign classic_params.n_clusters =
pq_centers_view.extent(0) when it is explicitly unset (e.g., == 0), and do not
silently overwrite a user-specified metric — fail fast with RAFT_EXPECTS
otherwise; apply the same pattern for the balanced branch (check metric or set
only if unset).
In `@cpp/src/neighbors/scann/detail/scann_build.cuh`:
- Around line 163-172: The call constructing
cuvs::preprocessing::quantize::pq::params via pq_build_params uses a positional,
bool-heavy signature and can be misread or misordered; update the call site (the
pq_build_params construction) to annotate positional booleans and the VQ-related
int with inline argument-name comments (e.g. /*use_subspaces=*/true,
/*use_vq=*/false, /*vq_n_centers=*/0, /*kmeans_n_iters=*/params.pq_train_iters,
/*pq_kmeans_type=*/cuvs::cluster::kmeans::kmeans_type::KMeansBalanced,
/*max_train_points_per_pq_code=*/pq_n_rows_train / num_clusters,
/*max_train_points_per_vq_cluster=*/1024) so the intent and mapping to the ctor
fields in cuvs::preprocessing::quantize::pq::params is explicit and future-proof
against parameter reordering.
In `@cpp/tests/preprocessing/product_quantization.cu`:
- Around line 256-313: The test currently doesn't verify that a user-specified
classic kmeans setting (InitMethod) is actually honored end-to-end; add a small
end-to-end assertion: create two pq::params objects that are identical except
for cuvs::cluster::kmeans::params::InitMethod (e.g., Random vs another valid
InitMethod), call build(handle, pq_params,
raft::make_const_mdspan(dataset_host.view())) for each, and compare the
resulting quantizers' learned centroids/codebooks (or any exposed effective
kmeans params) to assert they differ (or that the effective params match the
requested InitMethod); reference kmeans_params_variant and build to locate where
to change the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 8cd620b4-d725-4a36-92d4-ff2cb66ecb05
📒 Files selected for processing (6)
c/src/preprocessing/quantize/pq.cppcpp/include/cuvs/preprocessing/quantize/pq.hppcpp/src/neighbors/detail/vpq_dataset.cuhcpp/src/neighbors/scann/detail/scann_build.cuhcpp/src/preprocessing/quantize/detail/pq.cuhcpp/tests/preprocessing/product_quantization.cu
| struct params { | ||
| /** | ||
| * Simplified constructor that will build an appropriate kmeans params object. | ||
| */ | ||
| params(uint32_t pq_bits, | ||
| uint32_t pq_dim, | ||
| bool use_subspaces, | ||
| bool use_vq, | ||
| uint32_t vq_n_centers, | ||
| uint32_t kmeans_n_iters, | ||
| cuvs::cluster::kmeans::kmeans_type pq_kmeans_type = | ||
| cuvs::cluster::kmeans::kmeans_type::KMeansBalanced, | ||
| uint32_t max_train_points_per_pq_code = 256, | ||
| uint32_t max_train_points_per_vq_cluster = 1024) | ||
| : pq_bits(pq_bits), | ||
| pq_dim(pq_dim), | ||
| use_subspaces(use_subspaces), | ||
| use_vq(use_vq), | ||
| vq_n_centers(vq_n_centers), | ||
| kmeans_params( | ||
| pq_kmeans_type == cuvs::cluster::kmeans::kmeans_type::KMeansBalanced | ||
| ? kmeans_params_variant{cuvs::cluster::kmeans::balanced_params{.n_iters = kmeans_n_iters}} | ||
| : kmeans_params_variant{cuvs::cluster::kmeans::params{ | ||
| .n_clusters = 1 << pq_bits, .max_iter = static_cast<int>(kmeans_n_iters)}}), | ||
| max_train_points_per_pq_code(max_train_points_per_pq_code), | ||
| max_train_points_per_vq_cluster(max_train_points_per_vq_cluster) | ||
| { | ||
| } | ||
|
|
||
| params(uint32_t pq_bits, | ||
| uint32_t pq_dim, | ||
| bool use_subspaces, | ||
| bool use_vq, | ||
| uint32_t vq_n_centers, | ||
| kmeans_params_variant kmeans_params, | ||
| uint32_t max_train_points_per_pq_code = 256, | ||
| uint32_t max_train_points_per_vq_cluster = 1024) | ||
| : pq_bits(pq_bits), | ||
| pq_dim(pq_dim), | ||
| use_subspaces(use_subspaces), | ||
| use_vq(use_vq), | ||
| vq_n_centers(vq_n_centers), | ||
| kmeans_params(kmeans_params), | ||
| max_train_points_per_pq_code(max_train_points_per_pq_code), | ||
| max_train_points_per_vq_cluster(max_train_points_per_vq_cluster) | ||
| { | ||
| } | ||
|
|
||
| params() = default; |
There was a problem hiding this comment.
Add Doxygen for the new constructors and a migration note for this breaking change.
The two new public constructors are part of the documented API but lack @brief/@param descriptions for their arguments (the second constructor has no Doxygen block at all). Additionally, since this PR removes kmeans_n_iters and pq_kmeans_type as public fields and disables the previous designated-initializer usage pattern (e.g. pq::params{.pq_bits = 8, .use_vq = true} no longer compiles), please add a migration/upgrade note and, ideally, a deprecation path for downstream users. As per coding guidelines: "For public C++ API headers ... Doxygen documentation for all public functions/classes ... Breaking changes require deprecation warnings and migration guide updates."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/include/cuvs/preprocessing/quantize/pq.hpp` around lines 30 - 78, Add
Doxygen blocks for the two public params constructors in struct params (the
params(uint32_t,pq_dim,bool,use_vq,...) overload that builds kmeans_params and
the params(..., kmeans_params_variant, ...) overload) including `@brief` and
`@param` for every parameter (pq_bits, pq_dim, use_subspaces, use_vq,
vq_n_centers, kmeans_n_iters/pq_kmeans_type where applicable, kmeans_params,
max_train_points_per_pq_code, max_train_points_per_vq_cluster). Also add a short
migration note in the header comment for struct params explaining the breaking
change (kmeans_n_iters and pq_kmeans_type are no longer public fields and
designated-initializer usage like pq::params{.pq_bits = 8, .use_vq = true} is no
longer supported) and provide a suggested migration snippet (use the new
constructors or supply a kmeans_params_variant). Finally add a deprecation path
by reintroducing the old public fields or constructor overloads guarded by a
DEPRECATED macro/attribute that forwards to the new constructors (or mark an
inline compatibility constructor deprecated) so downstream users get warnings
while migrating.
| inline auto make_pq_params_from_vpq(const cuvs::neighbors::vpq_params& in_params, | ||
| const uint64_t n_rows) | ||
| -> cuvs::preprocessing::quantize::pq::params | ||
| { | ||
| const uint32_t pq_n_centers = 1 << in_params.pq_bits; | ||
| uint32_t max_train_points_per_vq_cluster = in_params.max_train_points_per_vq_cluster; | ||
| if (in_params.vq_n_centers > 0) { | ||
| max_train_points_per_vq_cluster = | ||
| std::min<uint32_t>(max_train_points_per_vq_cluster, | ||
| n_rows * in_params.vq_kmeans_trainset_fraction / in_params.vq_n_centers); | ||
| } | ||
| return cuvs::preprocessing::quantize::pq::params{ | ||
| in_params.pq_bits, | ||
| in_params.pq_dim, | ||
| true, | ||
| true, | ||
| in_params.vq_n_centers, | ||
| in_params.kmeans_n_iters, | ||
| in_params.pq_kmeans_type, | ||
| std::min<uint32_t>(in_params.max_train_points_per_pq_code, | ||
| n_rows * in_params.pq_kmeans_trainset_fraction / pq_n_centers), | ||
| max_train_points_per_vq_cluster}; | ||
| } |
There was a problem hiding this comment.
Guard against zero/overflow in the n_rows * fraction / n_centers computations.
The expressions at lines 394 and 404-405 compute n_rows * fraction / n_centers as double and then narrow to uint32_t via std::min<uint32_t>. A few concerns worth confirming:
- If
in_params.pq_kmeans_trainset_fraction == 0orin_params.vq_kmeans_trainset_fraction == 0(which is the sentinel "unfilled" value invpq_params), the result collapses to 0 andmax_train_points_per_*becomes 0, making downstreamtrain_pqfail then_rows_train >= pq_n_centerscheck in a non-obvious way.vpq_buildcallsfill_missing_params_heuristicsfirst so this is fine today, butmake_pq_params_from_vpqis a public helper in this header and could be reached with unfilledvpq_paramsin the future — aRAFT_EXPECTSon the fractions would harden it. - Narrowing a
doubleresult touint32_tviastd::min<uint32_t>(..., static_cast<double>(...))will raise implicit-conversion warnings under some compilers and silently truncate for very largen_rows. An explicitstatic_cast<uint32_t>(std::min<double>(..., ...))(or computing in integer after ordering the division) would be more defensible.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/preprocessing/quantize/detail/pq.cuh` around lines 385 - 407, The
make_pq_params_from_vpq helper computes train-set sizes using n_rows * fraction
/ n_centers which can yield zero for uninitialized fraction values and causes
unsafe narrowing from double to uint32_t; update make_pq_params_from_vpq to (1)
add precondition checks (e.g., RAFT_EXPECTS or equivalent) that
in_params.pq_kmeans_trainset_fraction and in_params.vq_kmeans_trainset_fraction
are > 0 when used, and (2) avoid implicit double->uint32_t narrowing by
computing the min using double and then explicitly static_cast<uint32_t>(...) or
compute in integer order to bound the result safely before assigning to
max_train_points_per_vq_cluster and the pq train-size field so
truncation/overflow warnings are eliminated (refer to function
make_pq_params_from_vpq and symbols pq_n_centers,
max_train_points_per_vq_cluster).
tarang-jain
left a comment
There was a problem hiding this comment.
Added one small comment. But otherwise approving!
| params config_; | ||
| }; | ||
|
|
||
| TEST(ProductQuantizationTestF, Parameters) |
There was a problem hiding this comment.
I have not seen tests that simply check if parameter are set. I think we should be mindful with CI resources already overloaded with cpp tests.
Closes #1999.
Behavior change:
Breaking change: