[Cleanup] Combine Batched and Regular KMeans Impl#2015
[Cleanup] Combine Batched and Regular KMeans Impl#2015tarang-jain wants to merge 40 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
…nto combine-batch
viclafargue
left a comment
There was a problem hiding this comment.
Thanks! Here are some comments.
|
|
||
| auto minClusterAndDistance = raft::make_device_vector<raft::KeyValuePair<IndexT, DataT>, IndexT>( | ||
| handle, streaming_batch_size); | ||
| auto L2NormBatch = raft::make_device_vector<DataT, IndexT>(handle, streaming_batch_size); |
There was a problem hiding this comment.
pams.streaming_batch_size = 0 by default in the data on device case, but nothing prevent a user from setting a value. This would allocate a smaller than n_samples L2NormBatch which would cause OOB writes (and later reads) during norm computation.
We should probably guard this with a check :
RAFT_EXPECTS(streaming_batch_size == n_samples || !data_on_device, ...)
There was a problem hiding this comment.
I have updated this so that for device arrays, we simply ignore the streaming_batch_size and use the entire dataset always.
| auto init_sample = | ||
| raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features); | ||
| raft::matrix::sample_rows(handle, random_state, X, init_sample.view()); |
There was a problem hiding this comment.
| auto init_sample = | |
| raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features); | |
| raft::matrix::sample_rows(handle, random_state, X, init_sample.view()); | |
| if (init_sample_size == n_samples && data_on_device) { | |
| auto init_sample_const = raft::make_device_matrix_view<const DataT, IndexT>(X.data_handle(), n_samples, n_features); | |
| // pass directly to kmeansPlusPlus / initScalableKMeansPlusPlus | |
| } else { | |
| auto init_sample = raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features); | |
| raft::matrix::sample_rows(handle, random_state, X, init_sample.view()); | |
| // pass init_sample to kmeansPlusPlus / initScalableKMeansPlusPlus | |
| } |
If init_size = 0 in the data on device path, we basically double memory use by copying the dataset over. Let's skip this by creating a view on the dataset.
There was a problem hiding this comment.
I completely skipped the sampling for the device path. That is how it was being done earlier. The init size is only used if the data is on host.
| auto batch_workspace = rmm::device_uvector<char>( | ||
| current_batch_sz, stream, raft::resource::get_workspace_resource(handle)); |
There was a problem hiding this comment.
Every call to process_batch allocates both this workspace and the device scalar below. Both buffers could be instantiated out of the process_batch function.
There was a problem hiding this comment.
moved the workspace buffer allocation outside the process_batch function.
| raft::matrix::sample_rows(handle, random_state, X, centroidsRawData); | ||
| } else if (iter_params.init == cuvs::cluster::kmeans::params::InitMethod::KMeansPlusPlus) { | ||
| IndexT default_init_size = | ||
| data_on_device ? n_samples : std::min(static_cast<IndexT>(3 * n_clusters), n_samples); |
There was a problem hiding this comment.
Unlikely to be an actual issue, but n_clusters could be casted before the multiplication to avoid any risk of integer overflow.
There was a problem hiding this comment.
I have gotten rid of the batching for the device path. So when a user sets a batch size for device mdspan, we just set it to n_samples and warn the user. We should definitely not be creating a new buffer just for the init sample if we can accommodate the entire input matrix on device already.
| DataT curClusteringCost = DataT{0}; | ||
| raft::copy(&curClusteringCost, clustering_cost.data_handle(), 1, stream); | ||
| raft::resource::sync_stream(handle, stream); | ||
|
|
||
| if (curClusteringCost == DataT{0}) { | ||
| RAFT_LOG_WARN("Zero clustering cost detected: all points coincide with their centroids."); |
There was a problem hiding this comment.
Going from ASSERT to RAFT_LOG_WARN may indeed be useful for the spectral clustering case. However, removing the inertia_check option forces the sync at every iteration. Do we truly need to drop this option?
There was a problem hiding this comment.
I'm not sure we need the log and an assert might be better?
Early stopping (aka skipping iterations) is ultimately going to be the best way to extract perf here. Whether it's by explicitly computing inertia or just looking at the residuals of the centroids from the prior iteration.
Seems like inertia check / residuals could be done on gpu if we had to in order to avoid syncing so we would only need to sync in the final iteration, right?
There was a problem hiding this comment.
Seems like inertia check / residuals could be done on gpu if we had to in order to avoid syncing so we would only need to sync in the final iteration, right?
Until the iteration has completed, the CPU should not start the next iteration. So all the operations on the GPU stream must complete to finish the iteration.
There was a problem hiding this comment.
Seems like inertia check / residuals could be done on gpu if we had to in order to avoid syncing so we would only need to sync in the final iteration, right?
Yes, but this is throwing an error in the spectral clustering case wherein all the points converge on the centroids themselves. This is happening in one of the spectral tests and an assertion here is leading to an error, where instead it should simply return those centroids directly.
There was a problem hiding this comment.
I'm not sure we need the log and an assert might be better?
Therefore, I had to change it to a warning instead of an assertion. Earlier those spectral tests were skipping the inertia check which was avoiding the assertion.
| } else { | ||
| std::vector<DataT> h_weights(n_samples); | ||
| auto d_view = raft::make_device_vector_view<const DataT, IndexT>(weight_ptr, n_samples); | ||
| auto h_view = raft::make_host_vector_view<DataT, IndexT>(h_weights.data(), n_samples); | ||
| raft::copy(handle, h_view, d_view); | ||
| raft::resource::sync_stream(handle); | ||
| for (IndexT i = 0; i < n_samples; ++i) { | ||
| wt_sum += h_weights[i]; | ||
| } |
There was a problem hiding this comment.
In the data on device case since the data is already on device it would be much faster to sumreduce thanks to cub::DeviceReduce::Sum or raft::linalg::reduce. The summation would also have better precision since it is done in a tree fashion O(log N).
There was a problem hiding this comment.
When its device accessible, I have changed that to a raft::linalg::mapThenSumReduce. I have also removed this function and directly updated checkWeight (changed its name to weightSum). We do the scaling after the weight sum is computed.
| @@ -33,9 +33,10 @@ cdef extern from "cuvs/cluster/kmeans.h" nogil: | |||
| int batch_samples, | |||
| int batch_centroids, | |||
| bool inertia_check, | |||
There was a problem hiding this comment.
Maybe add a comment saying the field is present but deprecated.
There was a problem hiding this comment.
I dont think its necessary to add a comment here in the .pxd. The C header already has that information. And this file will be updated along with the C headers / src files.
…nto combine-batch
|
@viclafargue ABI wont break. We are allowed to add new members to structs and enums. However I have added the |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughDeprecates C/API Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
cpp/src/cluster/detail/kmeans.cuh (1)
591-599:⚠️ Potential issue | 🟡 MinorExact float comparison for
wt_sum == n_samples— inconsistent withkmeans_predict.
kmeans_predict(lines 998-999) uses a relative-tolerance check:const DataT rel_tol = n_samples * std::numeric_limits<DataT>::epsilon(); if (std::abs(wt_sum - n_samples) > rel_tol) { ... }But
kmeans_fitstill compares with!=on both line 595 (to emit the debug message) and line 696 (insideprepare_batch_weights, which governs whether the per-batch scaling op runs). With a largen_samples, summation imprecision means the!=branch will almost always trigger even when user weights genuinely sum ton_samples, causing an unnecessary map/divide on every batch and every outer iteration.🛠️ Proposed fix — mirror the predict path
- DataT wt_sum = sample_weight.has_value() ? weightSum(handle, sample_weight.value()) - : static_cast<DataT>(n_samples); - if (sample_weight.has_value() && wt_sum != static_cast<DataT>(n_samples)) { + DataT wt_sum = sample_weight.has_value() ? weightSum(handle, sample_weight.value()) + : static_cast<DataT>(n_samples); + const DataT wt_rel_tol = static_cast<DataT>(n_samples) * std::numeric_limits<DataT>::epsilon(); + const bool needs_wt_norm = + sample_weight.has_value() && std::abs(wt_sum - static_cast<DataT>(n_samples)) > wt_rel_tol; + if (needs_wt_norm) { RAFT_LOG_DEBUG( "[Warning!] KMeans: normalizing the user provided sample weight to sum up to %zu samples", static_cast<size_t>(n_samples)); }Then replace
wt_sum != static_cast<DataT>(n_samples)on line 696 with the capturedneeds_wt_normflag.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/cluster/detail/kmeans.cuh` around lines 591 - 599, In kmeans_fit, avoid exact float comparison of wt_sum to n_samples: compute rel_tol = n_samples * std::numeric_limits<DataT>::epsilon(); set a bool needs_wt_norm = (std::abs(wt_sum - static_cast<DataT>(n_samples)) > rel_tol) (mirroring kmeans_predict) and use that flag for the debug RAFT_LOG_DEBUG and to decide whether to run the per-batch scaling in prepare_batch_weights; replace the existing != checks on wt_sum with checks against needs_wt_norm and pass/use that flag instead of recomputing or comparing wt_sum directly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@c/include/cuvs/cluster/kmeans.h`:
- Around line 123-199: Summary: cuvsKMeansParams_v1 is declared but not exposed
via the C API pattern, making it unreachable for C consumers; add the missing
typedef and lifecycle APIs or remove the struct until helpers land. Fix: add a
typedef cuvsKMeansParams_v1_t for cuvsKMeansParams_v1 and declare public C API
functions cuvsKMeansParamsCreate_v1(...) and
cuvsKMeansParamsDestroy_v1(cuvsKMeansParams_v1_t*) following the
signatures/semantics of existing cuvsKMeansParamsCreate/cuvsKMeansParamsDestroy,
include Doxygen comments for each new symbol (typedef and functions), and mark
any older/changed APIs with deprecation notes and update the migration guide per
repo guidelines; alternatively, if you choose to defer, remove
cuvsKMeansParams_v1 now to avoid dead surface area.
- Line 126: Remove the stray leading whitespace before the struct declaration so
the line starts with "struct cuvsKMeansParams_v1 {" (no leading spaces); update
the declaration in the header (symbol: cuvsKMeansParams_v1) to match the file's
code style/indentation conventions so the struct aligns with other top-level
declarations.
In `@cpp/include/cuvs/cluster/kmeans.hpp`:
- Around line 115-123: Restore a deprecated stub for the removed field in the
cuvs::cluster::kmeans::params struct by reintroducing a [[deprecated("ignored;
use init_size instead")]] bool inertia_check = false; member so existing C++
call sites compile with a warning (reference cuvs::cluster::kmeans::params and
inertia_check), and update the project docs/CHANGELOG with a short migration
note stating inertia_check is deprecated/ignored and that inertia-based
convergence is now always on (point users to use init_size instead); keep the C
API behavior unchanged.
In `@python/cuvs/cuvs/cluster/kmeans/kmeans.pyx`:
- Around line 79-82: Update the KMeans docstring for init_size to state that it
applies to both device and host data and document the differing
defaults/behaviors (on device default = n_samples when init_size==0; on host
default = min(3 * n_clusters, n_samples)) referencing the same semantics as
cpp/include/cuvs/cluster/kmeans.hpp; also restore acceptance of the
inertia_check keyword in KMeansParams.__init__ (class KMeansParams) so existing
callers do not raise TypeError, emit a DeprecationWarning and ignore the
provided value for one release (no-op) to match the C-struct ABI-preservation
strategy.
---
Duplicate comments:
In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 591-599: In kmeans_fit, avoid exact float comparison of wt_sum to
n_samples: compute rel_tol = n_samples * std::numeric_limits<DataT>::epsilon();
set a bool needs_wt_norm = (std::abs(wt_sum - static_cast<DataT>(n_samples)) >
rel_tol) (mirroring kmeans_predict) and use that flag for the debug
RAFT_LOG_DEBUG and to decide whether to run the per-batch scaling in
prepare_batch_weights; replace the existing != checks on wt_sum with checks
against needs_wt_norm and pass/use that flag instead of recomputing or comparing
wt_sum directly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: bc3ad388-89a6-4de0-87f7-fd8710a65b58
📒 Files selected for processing (15)
c/include/cuvs/cluster/kmeans.hc/src/cluster/kmeans.cppcpp/include/cuvs/cluster/kmeans.hppcpp/src/cluster/detail/kmeans.cuhcpp/src/cluster/detail/kmeans_batched.cuhcpp/src/cluster/detail/kmeans_common.cuhcpp/src/cluster/detail/kmeans_mg.cuhcpp/src/cluster/detail/minClusterDistanceCompute.cucpp/src/cluster/kmeans.cuhcpp/src/cluster/kmeans_fit_double.cucpp/src/cluster/kmeans_fit_float.cucpp/src/cluster/kmeans_impl.cuhcpp/tests/cluster/kmeans.cupython/cuvs/cuvs/cluster/kmeans/kmeans.pxdpython/cuvs/cuvs/cluster/kmeans/kmeans.pyx
💤 Files with no reviewable changes (2)
- cpp/src/cluster/kmeans.cuh
- cpp/src/cluster/detail/kmeans_batched.cuh
| /** | ||
| * @brief Hyper-parameters for the kmeans algorithm | ||
| */ | ||
| struct cuvsKMeansParams_v1 { | ||
| cuvsDistanceType metric; | ||
|
|
||
| /** | ||
| * The number of clusters to form as well as the number of centroids to generate (default:8). | ||
| */ | ||
| int n_clusters; | ||
|
|
||
| /** | ||
| * Method for initialization, defaults to k-means++: | ||
| * - cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm | ||
| * to select the initial cluster centers. | ||
| * - cuvsKMeansInitMethod::Random (random): Choose 'n_clusters' observations (rows) at | ||
| * random from the input data for the initial centroids. | ||
| * - cuvsKMeansInitMethod::Array (ndarray): Use 'centroids' as initial cluster centers. | ||
| */ | ||
| cuvsKMeansInitMethod init; | ||
|
|
||
| /** | ||
| * Maximum number of iterations of the k-means algorithm for a single run. | ||
| */ | ||
| int max_iter; | ||
|
|
||
| /** | ||
| * Relative tolerance with regards to inertia to declare convergence. | ||
| */ | ||
| double tol; | ||
|
|
||
| /** | ||
| * Number of instance k-means algorithm will be run with different seeds. | ||
| */ | ||
| int n_init; | ||
|
|
||
| /** | ||
| * Oversampling factor for use in the k-means|| algorithm | ||
| */ | ||
| double oversampling_factor; | ||
|
|
||
| /** | ||
| * batch_samples and batch_centroids are used to tile 1NN computation which is | ||
| * useful to optimize/control the memory footprint | ||
| * Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 | ||
| * then don't tile the centroids | ||
| */ | ||
| int batch_samples; | ||
|
|
||
| /** | ||
| * if 0 then batch_centroids = n_clusters | ||
| */ | ||
| int batch_centroids; | ||
|
|
||
| /** | ||
| * Whether to use hierarchical (balanced) kmeans or not | ||
| */ | ||
| bool hierarchical; | ||
|
|
||
| /** | ||
| * For hierarchical k-means , defines the number of training iterations | ||
| */ | ||
| int hierarchical_n_iters; | ||
|
|
||
| /** | ||
| * Number of samples to process per GPU batch for the batched (host-data) API. | ||
| * When set to 0, defaults to n_samples (process all at once). | ||
| */ | ||
| int64_t streaming_batch_size; | ||
|
|
||
| /** | ||
| * Number of samples to draw for KMeansPlusPlus initialization. | ||
| * When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data, | ||
| * or n_samples for device data. | ||
| */ | ||
| int64_t init_size; | ||
| }; |
There was a problem hiding this comment.
cuvsKMeansParams_v1 is declared but not reachable through the standard C API pattern.
The existing API exposes cuvsKMeansParams via cuvsKMeansParams_t typedef and cuvsKMeansParamsCreate/cuvsKMeansParamsDestroy. The new cuvsKMeansParams_v1 struct has:
- No corresponding
cuvsKMeansParams_v1_ttypedef. - No
cuvsKMeansParamsCreate_v1/cuvsKMeansParamsDestroy_v1(or equivalent) declarations.
As a result, C consumers cannot allocate/initialize/free a cuvsKMeansParams_v1 using the established pattern, and the struct effectively becomes dead surface area until 26.08. Please either add the companion typedef and lifecycle APIs now, or defer adding the v1 struct until the matching helpers land together.
As per coding guidelines, c/include/cuvs/**/*: public C API headers must include Doxygen documentation for all public functions/classes, and API changes require deprecation warnings and migration guide updates.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/include/cuvs/cluster/kmeans.h` around lines 123 - 199, Summary:
cuvsKMeansParams_v1 is declared but not exposed via the C API pattern, making it
unreachable for C consumers; add the missing typedef and lifecycle APIs or
remove the struct until helpers land. Fix: add a typedef cuvsKMeansParams_v1_t
for cuvsKMeansParams_v1 and declare public C API functions
cuvsKMeansParamsCreate_v1(...) and
cuvsKMeansParamsDestroy_v1(cuvsKMeansParams_v1_t*) following the
signatures/semantics of existing cuvsKMeansParamsCreate/cuvsKMeansParamsDestroy,
include Doxygen comments for each new symbol (typedef and functions), and mark
any older/changed APIs with deprecation notes and update the migration guide per
repo guidelines; alternatively, if you choose to defer, remove
cuvsKMeansParams_v1 now to avoid dead surface area.
| /** | ||
| * @brief Hyper-parameters for the kmeans algorithm | ||
| */ | ||
| struct cuvsKMeansParams_v1 { |
There was a problem hiding this comment.
Stray leading whitespace before struct cuvsKMeansParams_v1.
-/**
- * `@brief` Hyper-parameters for the kmeans algorithm
- */
- struct cuvsKMeansParams_v1 {
+/**
+ * `@brief` Hyper-parameters for the kmeans algorithm
+ */
+struct cuvsKMeansParams_v1 {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| struct cuvsKMeansParams_v1 { | |
| /** | |
| * `@brief` Hyper-parameters for the kmeans algorithm | |
| */ | |
| struct cuvsKMeansParams_v1 { |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/include/cuvs/cluster/kmeans.h` at line 126, Remove the stray leading
whitespace before the struct declaration so the line starts with "struct
cuvsKMeansParams_v1 {" (no leading spaces); update the declaration in the header
(symbol: cuvsKMeansParams_v1) to match the file's code style/indentation
conventions so the struct aligns with other top-level declarations.
| /** | ||
| * If true, check inertia during iterations for early convergence. | ||
| * Number of samples to randomly draw for the KMeansPlusPlus initialization | ||
| * step. A random subset of this size is used for centroid seeding. | ||
| * When set to 0 the default depends on the data location: | ||
| * - Device data: n_samples (use the full dataset). | ||
| * - Host data: min(3 * n_clusters, n_samples). | ||
| * Default: 0. | ||
| */ | ||
| bool inertia_check = false; | ||
| int64_t init_size = 0; |
There was a problem hiding this comment.
Breaking removal of inertia_check from the public C++ params struct lacks deprecation/migration notes.
The cuvs::cluster::kmeans::params::inertia_check field has been removed outright (replaced by init_size). Any downstream C++ code that sets or reads params.inertia_check will now fail to compile with a hard error, despite the PR being labeled non-breaking. The C API preserves the field for ABI compatibility (marked deprecated/ignored), but the C++ API does not offer the same grace period.
Please either:
- Retain a deprecated
[[deprecated("ignored; use init_size instead")]] bool inertia_check = false;stub in the struct for one release cycle so existing call sites keep compiling with a warning, and - Add a migration note in the project docs/CHANGELOG explaining that
inertia_checkis gone and that inertia-based convergence is now always on.
As per coding guidelines: "API changes flagged for docs/ updates" and "Breaking changes require deprecation warnings and migration guide updates".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/include/cuvs/cluster/kmeans.hpp` around lines 115 - 123, Restore a
deprecated stub for the removed field in the cuvs::cluster::kmeans::params
struct by reintroducing a [[deprecated("ignored; use init_size instead")]] bool
inertia_check = false; member so existing C++ call sites compile with a warning
(reference cuvs::cluster::kmeans::params and inertia_check), and update the
project docs/CHANGELOG with a short migration note stating inertia_check is
deprecated/ignored and that inertia-based convergence is now always on (point
users to use init_size instead); keep the C API behavior unchanged.
| init_size : int | ||
| Number of samples to draw for KMeansPlusPlus initialization with | ||
| host (out-of-core) data. When set to 0, uses the heuristic | ||
| min(3 * n_clusters, n_samples). Default: 0. |
There was a problem hiding this comment.
Docstring understates init_size scope and Python API drops inertia_check without deprecation.
Two concerns:
-
The docstring says
init_sizeapplies only to "host (out-of-core) data", butcpp/include/cuvs/cluster/kmeans.hppdocuments it for both device and host data (defaulting ton_sampleson device vs.min(3 * n_clusters, n_samples)on host when set to0). Users on device data will not realize they can tune this parameter. -
The previously accepted
inertia_check=keyword onKMeansParams.__init__was removed outright. Existing Python code passingKMeansParams(inertia_check=...)will now raiseTypeError: unexpected keyword argument. Consider accepting the kwarg for one release cycle (warn viaDeprecationWarningand ignore) to match the ABI-preservation strategy used for the C struct.
📝 Suggested docstring adjustment
init_size : int
- Number of samples to draw for KMeansPlusPlus initialization with
- host (out-of-core) data. When set to 0, uses the heuristic
- min(3 * n_clusters, n_samples). Default: 0.
+ Number of samples to draw for the KMeansPlusPlus initialization
+ step. When set to 0, the default depends on data location:
+ ``n_samples`` for device data, ``min(3 * n_clusters, n_samples)``
+ for host data. Default: 0.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| init_size : int | |
| Number of samples to draw for KMeansPlusPlus initialization with | |
| host (out-of-core) data. When set to 0, uses the heuristic | |
| min(3 * n_clusters, n_samples). Default: 0. | |
| init_size : int | |
| Number of samples to draw for the KMeansPlusPlus initialization | |
| step. When set to 0, the default depends on data location: | |
| ``n_samples`` for device data, ``min(3 * n_clusters, n_samples)`` | |
| for host data. Default: 0. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/cuvs/cuvs/cluster/kmeans/kmeans.pyx` around lines 79 - 82, Update the
KMeans docstring for init_size to state that it applies to both device and host
data and document the differing defaults/behaviors (on device default =
n_samples when init_size==0; on host default = min(3 * n_clusters, n_samples))
referencing the same semantics as cpp/include/cuvs/cluster/kmeans.hpp; also
restore acceptance of the inertia_check keyword in KMeansParams.__init__ (class
KMeansParams) so existing callers do not raise TypeError, emit a
DeprecationWarning and ignore the provided value for one release (no-op) to
match the C-struct ABI-preservation strategy.
dantegd
left a comment
There was a problem hiding this comment.
This PR improves things nicely, the code get significant improvement, love to see this!
| DataT curClusteringCost = DataT{0}; | ||
| raft::copy(&curClusteringCost, clustering_cost.data_handle(), 1, stream); | ||
| raft::resource::sync_stream(handle, stream); | ||
|
|
||
| if (curClusteringCost == DataT{0}) { | ||
| RAFT_LOG_WARN("Zero clustering cost detected: all points coincide with their centroids."); | ||
| } else if (n_current_iter > 1) { | ||
| DataT delta = curClusteringCost / priorClusteringCost; | ||
| if (delta > 1 - iter_params.tol) done = true; | ||
| } |
There was a problem hiding this comment.
Seems that this block was gated by params.inertia_check which defaulted oto false, now it's run unconditionally which will force the device to host copy on every iteration, is this change intentional? I think we could avoid this copy, or make it conditional, or at least explicitly mention this change if it's intended.
There was a problem hiding this comment.
This was intended. According to my discussion with @cjnolet we should always be early stopping. Yes this does mean that we now will always have to wait for one iteration to complete before the next one because the CPU should be made aware of whether or not it should proceed with the next iteration. Keeping this thread open to discussion if others have some inputs.
| comm.allreduce(&(clusterCostD.data()->value), | ||
| &(clusterCostD.data()->value), | ||
| 1, | ||
| raft::comms::op_t::SUM, | ||
| stream); | ||
|
|
||
| DataT curClusteringCost = 0; | ||
| raft::copy(handle, | ||
| raft::make_host_scalar_view(&curClusteringCost), | ||
| raft::make_device_scalar_view(&(clusterCostD.data()->value))); | ||
|
|
||
| ASSERT(comm.sync_stream(stream) == raft::comms::status_t::SUCCESS, | ||
| "An error occurred in the distributed operation. This can result " | ||
| "from a failed rank"); | ||
| if (curClusteringCost == (DataT)0.0) { | ||
| RAFT_LOG_WARN("Zero clustering cost detected: all points coincide with their centroids."); | ||
| } else if (n_iter[0] > 1) { | ||
| DataT delta = curClusteringCost / priorClusteringCost; | ||
| if (delta > 1 - params.tol) done = true; |
There was a problem hiding this comment.
We have the same change of behavior here with this now running unconditionally, so we need to do the same thing we do in the single gpu change I mentioned before.
There was a problem hiding this comment.
@viclafargue will be taking care of the iteration loop in kmeans_mg in #2017
| run_kmeanspp(X); | ||
| } else { | ||
| IndexT default_init_size = | ||
| std::min(static_cast<IndexT>(std::int64_t{3} * n_clusters), n_samples); |
There was a problem hiding this comment.
The reasoning for the change makes sense to me, but we should add a log/warning to alert users of the behavior change that otherwise will go silent for users that don't set init_size, and add it do the param doc in kmeans.hpp
| auto centroids_const = raft::make_device_matrix_view<const DataT, IndexT>( | ||
| cur_centroids_ptr, n_clusters, n_features); | ||
|
|
||
| iter_inertia = DataT{0}; | ||
| data_batches.reset(); | ||
| weight_batches.reset(); | ||
| auto wt_it = weight_batches.begin(); | ||
| for (const auto& data_batch : data_batches) { | ||
| IndexT cur_batch_size = static_cast<IndexT>(data_batch.size()); | ||
| const auto& wt_batch = *wt_it; | ||
| ++wt_it; | ||
|
|
||
| auto batch_data_view = raft::make_device_matrix_view<const DataT, IndexT>( | ||
| data_batch.data(), cur_batch_size, n_features); | ||
|
|
||
| std::optional<raft::device_vector_view<const DataT, IndexT>> batch_sw = std::nullopt; | ||
| if (weight_ptr != nullptr) { batch_sw = prepare_batch_weights(wt_batch, cur_batch_size); } | ||
|
|
||
| DataT batch_cost = DataT{0}; | ||
| cuvs::cluster::kmeans::cluster_cost(handle, | ||
| batch_data_view, | ||
| centroids_const, | ||
| raft::make_host_scalar_view(&batch_cost), | ||
| batch_sw); | ||
|
|
||
| iter_inertia += batch_cost; | ||
| } | ||
| } |
There was a problem hiding this comment.
Now that clustering_cost is accumulated on device every iteration inside the Lloyd loop, the final accepted iteration's clustering_cost is exactly the inertia to report, no?
This extra post-loop batched pass (one full read of X per n_init) duplicates work that was just done if I'm not mistaken. If so, consider copying the last clustering_cost scalar to iter_inertia
There was a problem hiding this comment.
The final inertia was being computed even before this PR with a cuvs::cluster::kmeans::cluster_cost call. The thing is that the centroids are modified in the end after the inertia computation. So one extra pass is needed with the final centroids.
| params.inertia_check = true; | ||
| params.max_iter = 20; | ||
| params.init = cuvs::cluster::kmeans::params::Array; | ||
| params.max_iter = 20; |
There was a problem hiding this comment.
A few new tests could be useful to consider:
- a test that runs fit with host input and device input on the same data/seed and asserts bit-identical centroids/inertia
- a test that exercises init_size (including 0 default and an explicit value)
- a test for the zero-cost warning path (e.g., n_samples == n_clusters with distinct points), to lock down the new behavior
There was a problem hiding this comment.
The first one (bit-identical centroids and inertia with streaming host data and data on device) was already being tested.
I added tests for init_size and zero-cost paths as well as n_init > 0.
| self.params.batch_centroids = batch_centroids | ||
| if inertia_check is not None: | ||
| self.params.inertia_check = inertia_check | ||
| if init_size is not None: |
There was a problem hiding this comment.
Existing Python code that passes inertia_check=... will now raise TypeError. We should consider accepting and doing a deprecation warning for one release, then removing. Same applies to the docstring update
There was a problem hiding this comment.
Because the C API still has it, this wont raise an error, but I added a deprecation warning.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
cpp/src/cluster/detail/kmeans.cuh (1)
591-602:⚠️ Potential issue | 🟠 MajorMissing
wt_sum > 0guard inkmeans_fit; inconsistent withkmeans_predict.If the caller passes a
sample_weightwhose elements all sum to 0,wt_sum == 0here,needs_wt_rescalebecomestrue(since|0 - n_samples| > wt_rel_tol), andprepare_batch_weightsat lines 716-720 appliesdiv_const_op<DataT>{wt_sum}→ every batch weight becomesinf/NaN. Becausewt_sumis captured once and reused for every batch of every Lloyd iteration across alln_initseeds, the corruption silently propagates intocentroid_sums/weight_per_clusterand the final centroids, with no diagnostic.Note that the peer
kmeans_predictfunction already guards this at Line 1015 withRAFT_EXPECTS(wt_sum > DataT{0}, ...)— please apply the same guard here for consistent behavior between training and inference.🛡️ Proposed fix
DataT wt_sum = sample_weight.has_value() ? weightSum(handle, sample_weight.value()) : static_cast<DataT>(n_samples); + RAFT_EXPECTS(wt_sum > DataT{0}, + "KMeans: sum of sample weights must be strictly positive"); const DataT wt_rel_tol = n_samples * std::numeric_limits<DataT>::epsilon();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/cluster/detail/kmeans.cuh` around lines 591 - 602, In kmeans_fit, guard against a zero total sample weight like kmeans_predict does: after computing wt_sum (and before using it / before setting needs_wt_rescale and calling prepare_batch_weights), add a precondition check (e.g., RAFT_EXPECTS(wt_sum > DataT{0}, "KMeans: sample_weight must sum to > 0")) so that wt_sum==0 is rejected early; this prevents prepare_batch_weights (and the div_const_op captured wt_sum) from producing inf/NaN and keeps behavior consistent with kmeans_predict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 591-602: In kmeans_fit, guard against a zero total sample weight
like kmeans_predict does: after computing wt_sum (and before using it / before
setting needs_wt_rescale and calling prepare_batch_weights), add a precondition
check (e.g., RAFT_EXPECTS(wt_sum > DataT{0}, "KMeans: sample_weight must sum to
> 0")) so that wt_sum==0 is rejected early; this prevents prepare_batch_weights
(and the div_const_op captured wt_sum) from producing inf/NaN and keeps behavior
consistent with kmeans_predict.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: ec785235-cad6-4fe9-9a5c-1a941f244525
📒 Files selected for processing (5)
cpp/include/cuvs/cluster/kmeans.hppcpp/src/cluster/detail/kmeans.cuhcpp/src/cluster/detail/kmeans_common.cuhcpp/src/cluster/detail/kmeans_mg.cuhpython/cuvs/cuvs/cluster/kmeans/kmeans.pyx
🚧 Files skipped from review as they are similar to previous changes (2)
- python/cuvs/cuvs/cluster/kmeans/kmeans.pyx
- cpp/src/cluster/detail/kmeans_common.cuh
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/cluster/detail/kmeans_common.cuh`:
- Around line 144-165: The function weightSum currently only checks the final
wt_sum > 0 but must validate each weight is finite and non-negative; update
weightSum to (1) in the host path iterate weights(i) and RAFT_EXPECTS each value
is finite and >= 0 before summing, and (2) in the device path add a device-side
reduction/check (e.g., use a device kernel/thrust/raft primitive) to verify all
weight.data_handle() elements are finite and >= 0 (and return an error via
RAFT_EXPECTS if any fail) before performing raft::linalg::mapThenSumReduce; keep
the final positive-sum RAFT_EXPECTS message but add a specific error for invalid
per-sample weights.
In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 703-714: The code currently dereferences the weight_batches
iterator (e.g., using const auto& wt_batch = *wt_it) even when weight_ptr ==
nullptr; move all iterator access of weight_batches (and any wt_it / wt_batch
dereferences) inside branches guarded by weight_ptr != nullptr and, for the
unweighted case, use the prefilled batch_weights_buf produced by
raft::matrix::fill and prepare_batch_weights lambda instead; update the loops
around the weight handling (the weight_batches iterator creation,
prepare_batch_weights lambda, and the loops that currently dereference wt_it —
including the similar sections at the other occurrences referenced in the
comment) so that when weight_ptr is null you never call *wt_it and always
operate on the filled batch_weights_buf.
- Around line 481-495: The reclustering block currently constructs a fresh
cuvs::cluster::kmeans::params (recluster_params) which loses caller settings
like metric, max_iter, tol, batch_samples, batch_centroids and verbosity; change
it to start from the existing params (copy params into recluster_params) and
then only override the fields that must differ (n_clusters, init, n_init) so
that when you call cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(...)
the original metric, max_iter, tol, batch_samples, batch_centroids and verbosity
are preserved during the weighted reclustering step.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 38e8e433-d81d-440d-b687-b4bfce7012bd
📒 Files selected for processing (2)
cpp/src/cluster/detail/kmeans.cuhcpp/src/cluster/detail/kmeans_common.cuh
| template <typename DataT, typename IndexT, typename Accessor> | ||
| DataT weightSum( | ||
| raft::resources const& handle, | ||
| raft::mdspan<const DataT, raft::vector_extent<IndexT>, raft::layout_right, Accessor> weight) | ||
| { | ||
| cudaStream_t stream = raft::resource::get_cuda_stream(handle); | ||
| auto wt_aggr = raft::make_device_scalar<DataT>(handle, 0); | ||
| auto n_samples = weight.extent(0); | ||
|
|
||
| size_t temp_storage_bytes = 0; | ||
| RAFT_CUDA_TRY(cub::DeviceReduce::Sum( | ||
| nullptr, temp_storage_bytes, weight.data_handle(), wt_aggr.data_handle(), n_samples, stream)); | ||
|
|
||
| workspace.resize(temp_storage_bytes, stream); | ||
|
|
||
| RAFT_CUDA_TRY(cub::DeviceReduce::Sum(workspace.data(), | ||
| temp_storage_bytes, | ||
| weight.data_handle(), | ||
| wt_aggr.data_handle(), | ||
| n_samples, | ||
| stream)); | ||
| DataT wt_sum = 0; | ||
| raft::copy(handle, | ||
| raft::make_host_scalar_view(&wt_sum), | ||
| raft::make_device_scalar_view(wt_aggr.data_handle())); | ||
| raft::resource::sync_stream(handle, stream); | ||
|
|
||
| if (wt_sum != n_samples) { | ||
| RAFT_LOG_DEBUG( | ||
| "[Warning!] KMeans: normalizing the user provided sample weight to " | ||
| "sum up to %d samples", | ||
| n_samples); | ||
|
|
||
| auto scale = static_cast<DataT>(n_samples) / wt_sum; | ||
| raft::linalg::map( | ||
| handle, weight, raft::mul_const_op<DataT>{scale}, raft::make_const_mdspan(weight)); | ||
| auto n_samples = weight.extent(0); | ||
|
|
||
| DataT wt_sum = DataT{0}; | ||
| if constexpr (raft::is_device_mdspan_v<decltype(weight)>) { | ||
| auto stream = raft::resource::get_cuda_stream(handle); | ||
| auto d_wt_sum = raft::make_device_scalar<DataT>(handle, DataT{0}); | ||
| raft::linalg::mapThenSumReduce( | ||
| d_wt_sum.data_handle(), n_samples, raft::identity_op{}, stream, weight.data_handle()); | ||
| raft::copy(&wt_sum, d_wt_sum.data_handle(), 1, stream); | ||
| raft::resource::sync_stream(handle); | ||
| } else { | ||
| for (IndexT i = 0; i < n_samples; ++i) { | ||
| wt_sum += weight(i); | ||
| } | ||
| } | ||
| RAFT_EXPECTS(wt_sum > DataT{0}, "invalid parameter (sum of sample weights must be positive)"); | ||
| return wt_sum; |
There was a problem hiding this comment.
Reject negative sample weights, not just non-positive totals.
wt_sum > 0 still allows inputs like [-1, 2], which can make weight_per_cluster, centroids, and inertia invalid downstream. Please validate each weight is finite/non-negative before returning the sum.
🛡️ Suggested direction
DataT wt_sum = DataT{0};
+ bool has_invalid_weight = false;
if constexpr (raft::is_device_mdspan_v<decltype(weight)>) {
auto stream = raft::resource::get_cuda_stream(handle);
auto d_wt_sum = raft::make_device_scalar<DataT>(handle, DataT{0});
+ auto d_invalid_count = raft::make_device_scalar<int>(handle, int{0});
raft::linalg::mapThenSumReduce(
d_wt_sum.data_handle(), n_samples, raft::identity_op{}, stream, weight.data_handle());
+ raft::linalg::mapThenSumReduce(
+ d_invalid_count.data_handle(),
+ n_samples,
+ [] __device__(DataT w) { return !(w >= DataT{0}) ? int{1} : int{0}; },
+ stream,
+ weight.data_handle());
raft::copy(&wt_sum, d_wt_sum.data_handle(), 1, stream);
+ int invalid_count = 0;
+ raft::copy(&invalid_count, d_invalid_count.data_handle(), 1, stream);
raft::resource::sync_stream(handle);
+ has_invalid_weight = invalid_count != 0;
} else {
for (IndexT i = 0; i < n_samples; ++i) {
+ has_invalid_weight |= !(weight(i) >= DataT{0});
wt_sum += weight(i);
}
}
+ RAFT_EXPECTS(!has_invalid_weight, "invalid parameter (sample weights must be non-negative)");
RAFT_EXPECTS(wt_sum > DataT{0}, "invalid parameter (sum of sample weights must be positive)");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/cluster/detail/kmeans_common.cuh` around lines 144 - 165, The
function weightSum currently only checks the final wt_sum > 0 but must validate
each weight is finite and non-negative; update weightSum to (1) in the host path
iterate weights(i) and RAFT_EXPECTS each value is finite and >= 0 before
summing, and (2) in the device path add a device-side reduction/check (e.g., use
a device kernel/thrust/raft primitive) to verify all weight.data_handle()
elements are finite and >= 0 (and return an error via RAFT_EXPECTS if any fail)
before performing raft::linalg::mapThenSumReduce; keep the final positive-sum
RAFT_EXPECTS message but add a specific error for invalid per-sample weights.
| cuvs::cluster::kmeans::params recluster_params; | ||
| recluster_params.n_clusters = params.n_clusters; | ||
| recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array; | ||
| recluster_params.n_init = 1; | ||
|
|
||
| auto weight_opt = std::make_optional(raft::make_const_mdspan(weight.view())); | ||
| cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>( | ||
| handle, | ||
| recluster_params, | ||
| raft::make_const_mdspan(potentialCentroids), | ||
| weight_opt, | ||
| centroidsRawData, | ||
| inertia.view(), | ||
| n_iter.view(), | ||
| std::ref(workspace)); |
There was a problem hiding this comment.
Preserve user parameters during KMeans|| reclustering.
Starting from default kmeans::params drops caller settings such as metric, max_iter, tol, batch_samples, batch_centroids, and verbosity during the weighted reclustering step.
🐛 Proposed fix
- cuvs::cluster::kmeans::params recluster_params;
- recluster_params.n_clusters = params.n_clusters;
- recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array;
- recluster_params.n_init = 1;
+ cuvs::cluster::kmeans::params recluster_params = params;
+ recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array;
+ recluster_params.n_init = 1;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cuvs::cluster::kmeans::params recluster_params; | |
| recluster_params.n_clusters = params.n_clusters; | |
| recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array; | |
| recluster_params.n_init = 1; | |
| auto weight_opt = std::make_optional(raft::make_const_mdspan(weight.view())); | |
| cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>( | |
| handle, | |
| recluster_params, | |
| raft::make_const_mdspan(potentialCentroids), | |
| weight_opt, | |
| centroidsRawData, | |
| inertia.view(), | |
| n_iter.view(), | |
| std::ref(workspace)); | |
| cuvs::cluster::kmeans::params recluster_params = params; | |
| recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array; | |
| recluster_params.n_init = 1; | |
| auto weight_opt = std::make_optional(raft::make_const_mdspan(weight.view())); | |
| cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>( | |
| handle, | |
| recluster_params, | |
| raft::make_const_mdspan(potentialCentroids), | |
| weight_opt, | |
| centroidsRawData, | |
| inertia.view(), | |
| n_iter.view(), | |
| std::ref(workspace)); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/cluster/detail/kmeans.cuh` around lines 481 - 495, The reclustering
block currently constructs a fresh cuvs::cluster::kmeans::params
(recluster_params) which loses caller settings like metric, max_iter, tol,
batch_samples, batch_centroids and verbosity; change it to start from the
existing params (copy params into recluster_params) and then only override the
fields that must differ (n_clusters, init, n_init) so that when you call
cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(...) the original
metric, max_iter, tol, batch_samples, batch_centroids and verbosity are
preserved during the weighted reclustering step.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
cpp/src/cluster/detail/kmeans.cuh (1)
481-495:⚠️ Potential issue | 🟠 MajorPreserve caller parameters during KMeans|| reclustering.
recluster_paramsstarts from defaults, so weighted reclustering drops caller settings such asmetric,max_iter,tol, batching, and verbosity. Start fromparamsand only override the fields required for reclustering. This matches an earlier unresolved review thread.🐛 Proposed fix
- cuvs::cluster::kmeans::params recluster_params; - recluster_params.n_clusters = params.n_clusters; - recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array; - recluster_params.n_init = 1; + cuvs::cluster::kmeans::params recluster_params = params; + recluster_params.n_clusters = params.n_clusters; + recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array; + recluster_params.n_init = 1;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/cluster/detail/kmeans.cuh` around lines 481 - 495, recluster_params is being constructed from defaults which drops caller-specified options (metric, max_iter, tol, batching, verbosity, etc.); instead initialize recluster_params from the incoming params (e.g., copy params into recluster_params) and then only override the fields required for reclustering: set recluster_params.n_clusters = params.n_clusters, recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array, and recluster_params.n_init = 1 before calling cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(...); keep the weight handling (weight_opt) and the rest of the arguments unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 741-759: The cosine norm convention is inconsistent:
compute_batch_norms applies raft::sqrt_op for DistanceType::CosineExpanded while
centroid norms are computed without sqrt; either make both use the same
convention or disallow CosineExpanded here. Fix by updating the centroid norm
computation (the code that computes norms for centroids/centroid_norms) to also
use raft::sqrt_op when metric == cuvs::distance::DistanceType::CosineExpanded,
or alternatively add an early check before the batching loop to throw/return if
metric == CosineExpanded; ensure you reference compute_batch_norms, L2NormBatch,
and the centroid norm computation path so both sides use identical sqrt behavior
or the metric is rejected consistently.
- Around line 622-637: The KMeans++ host init can proceed with fewer samples
than n_clusters when params.init_size > 0, causing degenerate seeding; before
allocating init_sample in kmeans.cuh, add a guard that if pams.init_size > 0 and
init_sample_size < n_clusters you reject early by throwing a clear exception
(e.g., throw std::invalid_argument) with a message referencing params.init_size
and required n_clusters; place this check just before the call to
raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features) so
the logic using init_sample never runs with too few rows.
- Around line 651-654: The Random init path constructs a new RngState from only
the seed, losing any non-default RNG type; instead, construct the RNG using
iter_params.rng_state (preserving its type) like the KMeans++ paths do, and then
pass that RNG into raft::matrix::sample_rows when sampling rows into
centroidsRawData from X (i.e., replace raft::random::RngState
random_state(iter_params.rng_state.seed) with constructing/initializing RngState
from iter_params.rng_state so sample_rows uses the configured rng type).
---
Duplicate comments:
In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 481-495: recluster_params is being constructed from defaults which
drops caller-specified options (metric, max_iter, tol, batching, verbosity,
etc.); instead initialize recluster_params from the incoming params (e.g., copy
params into recluster_params) and then only override the fields required for
reclustering: set recluster_params.n_clusters = params.n_clusters,
recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array, and
recluster_params.n_init = 1 before calling
cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(...); keep the weight
handling (weight_opt) and the rest of the arguments unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 44e402a4-4ebe-4e58-be48-8589e58d62cc
📒 Files selected for processing (2)
cpp/src/cluster/detail/kmeans.cuhpython/cuvs/cuvs/cluster/kmeans/kmeans.pyx
| IndexT default_init_size = | ||
| std::min(static_cast<IndexT>(std::int64_t{3} * n_clusters), n_samples); | ||
| IndexT init_sample_size = pams.init_size > 0 | ||
| ? std::min(static_cast<IndexT>(pams.init_size), n_samples) | ||
| : default_init_size; | ||
|
|
||
| if (pams.init_size <= 0 && init_sample_size < n_samples) { | ||
| RAFT_LOG_WARN( | ||
| "KMeans.fit: KMeans++ initialization is using a random subsample of %zu/%zu host rows " | ||
| "(params.init_size=0 defaults to min(3 * n_clusters, n_samples) for host data). " | ||
| "Set params.init_size to n_samples to use the full dataset for seeding.", | ||
| static_cast<size_t>(init_sample_size), | ||
| static_cast<size_t>(n_samples)); | ||
| } | ||
|
|
||
| auto centroidsRawData = raft::make_device_matrix<DataT, IndexT>(handle, n_clusters, n_features); | ||
| init_sample = raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features); |
There was a problem hiding this comment.
Reject KMeans++ init samples smaller than n_clusters.
When params.init_size > 0 but less than n_clusters, host KMeans++ samples too few rows and then tries to seed n_clusters centroids from that smaller sample. Reject this early to avoid degenerate duplicate/zero-probability seeding.
🐛 Proposed fix
IndexT init_sample_size = pams.init_size > 0
? std::min(static_cast<IndexT>(pams.init_size), n_samples)
: default_init_size;
+ RAFT_EXPECTS(init_sample_size >= n_clusters,
+ "invalid parameter (init_size must be 0 or >= n_clusters)");
if (pams.init_size <= 0 && init_sample_size < n_samples) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/cluster/detail/kmeans.cuh` around lines 622 - 637, The KMeans++ host
init can proceed with fewer samples than n_clusters when params.init_size > 0,
causing degenerate seeding; before allocating init_sample in kmeans.cuh, add a
guard that if pams.init_size > 0 and init_sample_size < n_clusters you reject
early by throwing a clear exception (e.g., throw std::invalid_argument) with a
message referencing params.init_size and required n_clusters; place this check
just before the call to raft::make_device_matrix<DataT, IndexT>(handle,
init_sample_size, n_features) so the logic using init_sample never runs with too
few rows.
| raft::random::RngState random_state(iter_params.rng_state.seed); | ||
|
|
||
| if (iter_params.init == cuvs::cluster::kmeans::params::InitMethod::Random) { | ||
| raft::matrix::sample_rows(handle, random_state, X, centroidsRawData); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Inspect RngState construction patterns around KMeans initialization.
rg -n -C3 'RngState random_state|RngState rng' --iglob '*kmeans*.cuh' --iglob '*kmeans*.cu'Repository: rapidsai/cuvs
Length of output: 3467
🏁 Script executed:
#!/bin/bash
# Check RngState constructor signature
find . -type f \( -name "*.hpp" -o -name "*.h" -o -name "*.cuh" \) | xargs grep -l "RngState" | head -5Repository: rapidsai/cuvs
Length of output: 266
🏁 Script executed:
#!/bin/bash
# Find and inspect RngState class definition
fd -e hpp -e h "rng" | xargs grep -l "class RngState" 2>/dev/null | head -3Repository: rapidsai/cuvs
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Look at the context around line 651 to understand iter_params
sed -n '640,665p' cpp/src/cluster/detail/kmeans.cuhRepository: rapidsai/cuvs
Length of output: 1307
🏁 Script executed:
#!/bin/bash
# Search for RngState class definition in raft
find . -type f \( -name "*.hpp" -o -name "*.cuh" \) -path "*/raft/*" | xargs grep -l "class RngState" 2>/dev/null | head -5Repository: rapidsai/cuvs
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Look for RngState constructor definition
rg "class RngState|struct RngState" -A 20 --type cpp --type cuh 2>/dev/null | head -60Repository: rapidsai/cuvs
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Check the rng_state struct in kmeans params
rg "struct.*RngState|class.*RngState" -B 2 -A 10 cpp/include/cuvs/cluster/kmeans.hppRepository: rapidsai/cuvs
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Check the rng_state field in kmeans params and its type
rg "rng_state" cpp/include/cuvs/cluster/kmeans.hpp -B 2 -A 2Repository: rapidsai/cuvs
Length of output: 294
🏁 Script executed:
#!/bin/bash
# Check what RngState is and its definition location
rg "include.*random" cpp/src/cluster/detail/kmeans.cuh | head -20Repository: rapidsai/cuvs
Length of output: 105
🏁 Script executed:
#!/bin/bash
# Verify the context of all RngState constructions to see the pattern
rg -n "RngState.*seed.*type" cpp/src/cluster/detail/kmeans.cuhRepository: rapidsai/cuvs
Length of output: 216
🏁 Script executed:
#!/bin/bash
# Check the RngState default member initialization to see what type it has
rg "RngState.*rng_state" cpp/include/cuvs/cluster/kmeans.hpp -B 3 -A 3Repository: rapidsai/cuvs
Length of output: 232
🏁 Script executed:
#!/bin/bash
# Look at how sample_rows is called with random_state to understand expectations
rg -n "sample_rows.*random_state" cpp/src/cluster/detail/kmeans.cuh -B 2 -A 2Repository: rapidsai/cuvs
Length of output: 639
🏁 Script executed:
#!/bin/bash
# Check if there are any other places where Random init path is used
rg -n "InitMethod::Random" cpp/src/cluster/detail/kmeans.cuh -B 5 -A 5Repository: rapidsai/cuvs
Length of output: 1171
Preserve the configured RNG type for row sampling.
This constructs RngState from the seed only, unlike the KMeans++ and KMeans++ subsampling paths at lines 144 and 316 that preserve params.rng_state.type. Non-default RNG configurations will be ignored for the Random initialization method.
Proposed fix
- raft::random::RngState random_state(iter_params.rng_state.seed);
+ raft::random::RngState random_state(iter_params.rng_state.seed, iter_params.rng_state.type);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| raft::random::RngState random_state(iter_params.rng_state.seed); | |
| if (iter_params.init == cuvs::cluster::kmeans::params::InitMethod::Random) { | |
| raft::matrix::sample_rows(handle, random_state, X, centroidsRawData); | |
| raft::random::RngState random_state(iter_params.rng_state.seed, iter_params.rng_state.type); | |
| if (iter_params.init == cuvs::cluster::kmeans::params::InitMethod::Random) { | |
| raft::matrix::sample_rows(handle, random_state, X, centroidsRawData); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/cluster/detail/kmeans.cuh` around lines 651 - 654, The Random init
path constructs a new RngState from only the seed, losing any non-default RNG
type; instead, construct the RNG using iter_params.rng_state (preserving its
type) like the KMeans++ paths do, and then pass that RNG into
raft::matrix::sample_rows when sampling rows into centroidsRawData from X (i.e.,
replace raft::random::RngState random_state(iter_params.rng_state.seed) with
constructing/initializing RngState from iter_params.rng_state so sample_rows
uses the configured rng type).
| bool need_compute_norms = metric == cuvs::distance::DistanceType::L2Expanded || | ||
| metric == cuvs::distance::DistanceType::L2SqrtExpanded || | ||
| metric == cuvs::distance::DistanceType::CosineExpanded; | ||
| bool use_norm_cache = need_compute_norms && !data_on_device; | ||
| std::vector<DataT> h_norm_cache; | ||
| if (use_norm_cache) { h_norm_cache.resize(n_samples); } | ||
| bool norms_cached = false; | ||
|
|
||
| auto compute_batch_norms = [&](const DataT* batch_ptr, IndexT batch_size) { | ||
| auto batch_view = | ||
| raft::make_device_matrix_view<const DataT, IndexT>(batch_ptr, batch_size, n_features); | ||
| auto norm_view = | ||
| raft::make_device_vector_view<DataT, IndexT>(L2NormBatch.data_handle(), batch_size); | ||
| if (metric == cuvs::distance::DistanceType::CosineExpanded) { | ||
| raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>( | ||
| handle, batch_view, norm_view, raft::sqrt_op{}); | ||
| } else { | ||
| raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>( | ||
| handle, batch_view, norm_view); |
There was a problem hiding this comment.
Make cosine norm handling consistent or reject cosine here.
compute_batch_norms uses sqrt_op for CosineExpanded, but centroid norms are computed without sqrt_op. If cosine is supported in this path, centroid norms should use the same convention; if not, reject CosineExpanded before entering the loop.
🐛 Proposed fix if cosine is intended to be supported
if (need_compute_norms) {
- raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>(
- handle, centroids_const, centroid_norms_buf.view());
+ if (metric == cuvs::distance::DistanceType::CosineExpanded) {
+ raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>(
+ handle, centroids_const, centroid_norms_buf.view(), raft::sqrt_op{});
+ } else {
+ raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>(
+ handle, centroids_const, centroid_norms_buf.view());
+ }
centroid_norms_opt = raft::make_device_vector_view<const DataT, IndexT>(
centroid_norms_buf.data_handle(), n_clusters);
}Also applies to: 802-809
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/cluster/detail/kmeans.cuh` around lines 741 - 759, The cosine norm
convention is inconsistent: compute_batch_norms applies raft::sqrt_op for
DistanceType::CosineExpanded while centroid norms are computed without sqrt;
either make both use the same convention or disallow CosineExpanded here. Fix by
updating the centroid norm computation (the code that computes norms for
centroids/centroid_norms) to also use raft::sqrt_op when metric ==
cuvs::distance::DistanceType::CosineExpanded, or alternatively add an early
check before the batching loop to throw/return if metric == CosineExpanded;
ensure you reference compute_batch_norms, L2NormBatch, and the centroid norm
computation path so both sides use identical sqrt behavior or the metric is
rejected consistently.
|
|
||
| for (n_current_iter = 1; n_current_iter <= iter_params.max_iter; ++n_current_iter) { | ||
| if (n_current_iter > 1) { | ||
| RAFT_CUDA_TRY(cudaEventSynchronize(convergence_event)); |
There was a problem hiding this comment.
The convergence event is recorded right after the inertia is computed, but we check it here before proceeding to the next iteration so that the operations after the convergence check are not blocked. The use of an event also means that only the operations up until the event need to be completed in order for it to be synchronized.
Combine batched and regular k-means implementations
fitinto a singlekmeans_fittemplate that works with both host and device mdspans viabatch_load_iteratorinit_centroidsinertia_checkparameter — inertia-based convergence checking now always runs. Zero clustering cost (perfect fit) logs a warning instead of asserting. This is needed because spectral clustering can cause all points to converge on the cluster centroids itself.init_sizeparameter to control how many samples are drawn for KMeansPlusPlus initialization. Defaults ton_samplesfor device data,(3 * n_clusters)for host dataraft::copywithstd::swapof buffer pointersprocess_batchno longer computes norms internallycudaPointerGetAttributescall withraft::memory_type_from_pointercub::DeviceReduce::Sumcalls withraft::linalg::mapThenSumReduce(w / wt_sum) * n_samplesvia a composed op instead of precomputing a scale, so very smallwt_sumvalues don't produce infcheckWeighttoweightSumand made it mdspan-based with anAccessortemplate: device reduce for device weights, host loop for host weights. Callers apply the scaling themselvesbatch_sums/batch_countsscratch buffers by accumulating directly intocentroid_sums/weight_per_clusterviareset_sums=falseinreduce_rows_by_key/reduce_cols_by_key, removing two per-batchraft::linalg::addkernelsupdate_centroidshelpers (both thedetailand public template) — no remaining callers after thefit_mainconsolidationraft::sync_streamcalls and add a CUDA Event to record if the convergence criteria is met. Convergence check is now done on device. Average per-iteration time with mandatory inertia check now matches previous benchmarks even when previously inertia check was disabled.