[Cleanup] Combine Batched and Regular KMeans Impl by tarang-jain · Pull Request #2015 · rapidsai/cuvs

tarang-jain · 2026-04-10T22:59:21Z

Combine batched and regular k-means implementations

Unified the batched (host-data) and regular (device-data) k-means fit into a single kmeans_fit template that works with both host and device mdspans via batch_load_iterator
Unified the device and host initialization paths in init_centroids
Removed the inertia_check parameter — inertia-based convergence checking now always runs. Zero clustering cost (perfect fit) logs a warning instead of asserting. This is needed because spectral clustering can cause all points to converge on the cluster centroids itself.
Added init_size parameter to control how many samples are drawn for KMeansPlusPlus initialization. Defaults to n_samples for device data, (3 * n_clusters) for host data
Replaced per-iteration centroid raft::copy with std::swap of buffer pointers
For streaming fit, precompute data norms once and cache them: host norms cached to a host buffer on the first iteration and copied back for subsequent iterations. process_batch no longer computes norms internally
Replaced raw cudaPointerGetAttributes call with raft::memory_type_from_pointer
Replaced cub::DeviceReduce::Sum calls with raft::linalg::mapThenSumReduce
Guarded weight normalization against overflow: apply (w / wt_sum) * n_samples via a composed op instead of precomputing a scale, so very small wt_sum values don't produce inf
Renamed checkWeight to weightSum and made it mdspan-based with an Accessor template: device reduce for device weights, host loop for host weights. Callers apply the scaling themselves
Eliminated batch_sums / batch_counts scratch buffers by accumulating directly into centroid_sums / weight_per_cluster via reset_sums=false in reduce_rows_by_key / reduce_cols_by_key, removing two per-batch raft::linalg::add kernels
Removed dead update_centroids helpers (both the detail and public template) — no remaining callers after the fit_main consolidation
Perf: remove multiple raft::sync_stream calls and add a CUDA Event to record if the convergence criteria is met. Convergence check is now done on device. Average per-iteration time with mandatory inertia check now matches previous benchmarks even when previously inertia check was disabled.

copy-pr-bot · 2026-04-10T22:59:25Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…nto combine-batch

viclafargue

Thanks! Here are some comments.

viclafargue · 2026-04-16T08:48:59Z

+
+  auto minClusterAndDistance = raft::make_device_vector<raft::KeyValuePair<IndexT, DataT>, IndexT>(
+    handle, streaming_batch_size);
+  auto L2NormBatch       = raft::make_device_vector<DataT, IndexT>(handle, streaming_batch_size);


pams.streaming_batch_size = 0 by default in the data on device case, but nothing prevent a user from setting a value. This would allocate a smaller than n_samples L2NormBatch which would cause OOB writes (and later reads) during norm computation.

We should probably guard this with a check :
RAFT_EXPECTS(streaming_batch_size == n_samples || !data_on_device, ...)

I have updated this so that for device arrays, we simply ignore the streaming_batch_size and use the entire dataset always.

viclafargue · 2026-04-16T08:59:13Z

+      auto init_sample =
+        raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features);
+      raft::matrix::sample_rows(handle, random_state, X, init_sample.view());


Suggested change

auto init_sample =

raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features);

raft::matrix::sample_rows(handle, random_state, X, init_sample.view());

if (init_sample_size == n_samples && data_on_device) {

auto init_sample_const = raft::make_device_matrix_view<const DataT, IndexT>(X.data_handle(), n_samples, n_features);

// pass directly to kmeansPlusPlus / initScalableKMeansPlusPlus

} else {

auto init_sample = raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features);

raft::matrix::sample_rows(handle, random_state, X, init_sample.view());

// pass init_sample to kmeansPlusPlus / initScalableKMeansPlusPlus

}

If init_size = 0 in the data on device path, we basically double memory use by copying the dataset over. Let's skip this by creating a view on the dataset.

I completely skipped the sampling for the device path. That is how it was being done earlier. The init size is only used if the data is on host.

viclafargue · 2026-04-16T09:04:41Z

+  auto batch_workspace = rmm::device_uvector<char>(
+    current_batch_sz, stream, raft::resource::get_workspace_resource(handle));


Every call to process_batch allocates both this workspace and the device scalar below. Both buffers could be instantiated out of the process_batch function.

moved the workspace buffer allocation outside the process_batch function.

viclafargue · 2026-04-16T09:06:09Z

+      raft::matrix::sample_rows(handle, random_state, X, centroidsRawData);
+    } else if (iter_params.init == cuvs::cluster::kmeans::params::InitMethod::KMeansPlusPlus) {
+      IndexT default_init_size =
+        data_on_device ? n_samples : std::min(static_cast<IndexT>(3 * n_clusters), n_samples);


Unlikely to be an actual issue, but n_clusters could be casted before the multiplication to avoid any risk of integer overflow.

I have gotten rid of the batching for the device path. So when a user sets a batch size for device mdspan, we just set it to n_samples and warn the user. We should definitely not be creating a new buffer just for the init sample if we can accommodate the entire input matrix on device already.

viclafargue · 2026-04-16T09:21:34Z

+      DataT curClusteringCost = DataT{0};
+      raft::copy(&curClusteringCost, clustering_cost.data_handle(), 1, stream);
+      raft::resource::sync_stream(handle, stream);
+
+      if (curClusteringCost == DataT{0}) {
+        RAFT_LOG_WARN("Zero clustering cost detected: all points coincide with their centroids.");


Going from ASSERT to RAFT_LOG_WARN may indeed be useful for the spectral clustering case. However, removing the inertia_check option forces the sync at every iteration. Do we truly need to drop this option?

I'm not sure we need the log and an assert might be better?

Early stopping (aka skipping iterations) is ultimately going to be the best way to extract perf here. Whether it's by explicitly computing inertia or just looking at the residuals of the centroids from the prior iteration.

Seems like inertia check / residuals could be done on gpu if we had to in order to avoid syncing so we would only need to sync in the final iteration, right?

Seems like inertia check / residuals could be done on gpu if we had to in order to avoid syncing so we would only need to sync in the final iteration, right?

Until the iteration has completed, the CPU should not start the next iteration. So all the operations on the GPU stream must complete to finish the iteration.

Seems like inertia check / residuals could be done on gpu if we had to in order to avoid syncing so we would only need to sync in the final iteration, right?

Yes, but this is throwing an error in the spectral clustering case wherein all the points converge on the centroids themselves. This is happening in one of the spectral tests and an assertion here is leading to an error, where instead it should simply return those centroids directly.

I'm not sure we need the log and an assert might be better?

Therefore, I had to change it to a warning instead of an assertion. Earlier those spectral tests were skipping the inertia check which was avoiding the assertion.

viclafargue · 2026-04-16T09:27:10Z

+  } else {
+    std::vector<DataT> h_weights(n_samples);
+    auto d_view = raft::make_device_vector_view<const DataT, IndexT>(weight_ptr, n_samples);
+    auto h_view = raft::make_host_vector_view<DataT, IndexT>(h_weights.data(), n_samples);
+    raft::copy(handle, h_view, d_view);
+    raft::resource::sync_stream(handle);
+    for (IndexT i = 0; i < n_samples; ++i) {
+      wt_sum += h_weights[i];
+    }


In the data on device case since the data is already on device it would be much faster to sumreduce thanks to cub::DeviceReduce::Sum or raft::linalg::reduce. The summation would also have better precision since it is done in a tree fashion O(log N).

When its device accessible, I have changed that to a raft::linalg::mapThenSumReduce. I have also removed this function and directly updated checkWeight (changed its name to weightSum). We do the scaling after the weight sum is computed.

viclafargue · 2026-04-16T09:36:14Z

@@ -33,9 +33,10 @@ cdef extern from "cuvs/cluster/kmeans.h" nogil:
        int batch_samples,
        int batch_centroids,
        bool inertia_check,


Maybe add a comment saying the field is present but deprecated.

I dont think its necessary to add a comment here in the .pxd. The C header already has that information. And this file will be updated along with the C headers / src files.

Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>

…nto combine-batch

…batch

tarang-jain · 2026-04-22T18:59:52Z

@viclafargue ABI wont break. We are allowed to add new members to structs and enums. However I have added the breaking label to this PR because downstream users of the C++ API will be affected directly.

coderabbitai · 2026-04-22T19:04:53Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Deprecates C/API inertia_check (kept for ABI), adds init_size to params and C/Python bindings, removes legacy host-batched implementation, and refactors KMeans into a unified mdspan-based streaming-batched pipeline with host overloads, weight-normalization changes, and related utility/signature updates.

Changes

Cohort / File(s)	Summary
C API header `c/include/cuvs/cluster/kmeans.h`	Marked `inertia_check` deprecated/ignored (ABI kept), fixed spacing, added `int64_t init_size`, and introduced `struct cuvsKMeansParams_v1` (omits legacy `inertia_check`).
C API impl `c/src/cluster/kmeans.cpp`	Conversion maps `cuvsKMeansParams::init_size` into internal params and forces internal `inertia_check = false` during C→C++ param translation.
C++ public params `cpp/include/cuvs/cluster/kmeans.hpp`	Removed `params::inertia_check`; added `params::init_size` (int64_t, default 0) with host/device semantics and documented behavior.
Unified mdspan fit & host overloads `cpp/src/cluster/detail/kmeans.cuh`, `cpp/src/cluster/kmeans_impl.cuh`	Introduced unified `detail::kmeans_fit` using mdspan, streaming-batched loop (ping‑pong centroids), host overloads, and new public host-matrix `fit` forwarding to mdspan implementation.
Removed batched impl `cpp/src/cluster/detail/kmeans_batched.cuh`	Deleted legacy host-batched KMeans implementation (host init, weight scaling, per-batch accumulation, and previous fit driver).
Common utilities & batch processing `cpp/src/cluster/detail/kmeans_common.cuh`, `cpp/src/cluster/detail/minClusterDistanceCompute.cu`	Replaced `checkWeight` with `weightSum`; added `process_batch`; extended min-cluster funcs to accept optional `precomputed_centroid_norms`; adjusted sync usage and related signatures.
Multi‑GPU changes `cpp/src/cluster/detail/kmeans_mg.cuh`	Replaced two-pass CUB reduction with `mapThenSumReduce`; changed weight-normalization to composed ops; always compute and all-reduce clustering cost each iteration (warn on zero); removed `inertia_check` gating.
Removed helper `cpp/src/cluster/kmeans.cuh`	Deleted `update_centroids` template helper (workspace/alloc wrapper removed).
Frontend instantiations `cpp/src/cluster/kmeans_fit_float.cu`, `cpp/src/cluster/kmeans_fit_double.cu`	Removed include of batched header; host-matrix overloads now call public `cuvs::cluster::kmeans::fit` instead of internal batched fit.
Tests `cpp/tests/cluster/kmeans.cu`	Batched test removed explicit `params.inertia_check = true`; uses default behavior for inertia checking.
Python bindings `python/cuvs/cuvs/cluster/kmeans/kmeans.pxd`, `python/cuvs/cuvs/cluster/kmeans/kmeans.pyx`	Reordered C struct fields and added `init_size`; `inertia_check` treated as deprecated/ignored (emits warning); added `init_size` ctor arg and property; removed `inertia_check` property.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the main objective: combining batched and regular KMeans implementations into a unified implementation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description accurately describes the key changes: unifying batched and regular k-means implementations, removing inertia_check parameter, adding init_size parameter, and various optimization and refactoring changes documented in the raw summary.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

cpp/src/cluster/detail/kmeans.cuh (1)

591-599: ⚠️ Potential issue | 🟡 Minor

Exact float comparison for wt_sum == n_samples — inconsistent with kmeans_predict.

kmeans_predict (lines 998-999) uses a relative-tolerance check:

const DataT rel_tol = n_samples * std::numeric_limits<DataT>::epsilon();
if (std::abs(wt_sum - n_samples) > rel_tol) { ... }

But kmeans_fit still compares with != on both line 595 (to emit the debug message) and line 696 (inside prepare_batch_weights, which governs whether the per-batch scaling op runs). With a large n_samples, summation imprecision means the != branch will almost always trigger even when user weights genuinely sum to n_samples, causing an unnecessary map/divide on every batch and every outer iteration.

🛠️ Proposed fix — mirror the predict path

-  DataT wt_sum = sample_weight.has_value() ? weightSum(handle, sample_weight.value())
-                                           : static_cast<DataT>(n_samples);
-  if (sample_weight.has_value() && wt_sum != static_cast<DataT>(n_samples)) {
+  DataT wt_sum = sample_weight.has_value() ? weightSum(handle, sample_weight.value())
+                                           : static_cast<DataT>(n_samples);
+  const DataT wt_rel_tol = static_cast<DataT>(n_samples) * std::numeric_limits<DataT>::epsilon();
+  const bool needs_wt_norm =
+    sample_weight.has_value() && std::abs(wt_sum - static_cast<DataT>(n_samples)) > wt_rel_tol;
+  if (needs_wt_norm) {
     RAFT_LOG_DEBUG(
       "[Warning!] KMeans: normalizing the user provided sample weight to sum up to %zu samples",
       static_cast<size_t>(n_samples));
   }

Then replace wt_sum != static_cast<DataT>(n_samples) on line 696 with the captured needs_wt_norm flag.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/cluster/detail/kmeans.cuh` around lines 591 - 599, In kmeans_fit,
avoid exact float comparison of wt_sum to n_samples: compute rel_tol = n_samples
* std::numeric_limits<DataT>::epsilon(); set a bool needs_wt_norm =
(std::abs(wt_sum - static_cast<DataT>(n_samples)) > rel_tol) (mirroring
kmeans_predict) and use that flag for the debug RAFT_LOG_DEBUG and to decide
whether to run the per-batch scaling in prepare_batch_weights; replace the
existing != checks on wt_sum with checks against needs_wt_norm and pass/use that
flag instead of recomputing or comparing wt_sum directly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@c/include/cuvs/cluster/kmeans.h`:
- Around line 123-199: Summary: cuvsKMeansParams_v1 is declared but not exposed
via the C API pattern, making it unreachable for C consumers; add the missing
typedef and lifecycle APIs or remove the struct until helpers land. Fix: add a
typedef cuvsKMeansParams_v1_t for cuvsKMeansParams_v1 and declare public C API
functions cuvsKMeansParamsCreate_v1(...) and
cuvsKMeansParamsDestroy_v1(cuvsKMeansParams_v1_t*) following the
signatures/semantics of existing cuvsKMeansParamsCreate/cuvsKMeansParamsDestroy,
include Doxygen comments for each new symbol (typedef and functions), and mark
any older/changed APIs with deprecation notes and update the migration guide per
repo guidelines; alternatively, if you choose to defer, remove
cuvsKMeansParams_v1 now to avoid dead surface area.
- Line 126: Remove the stray leading whitespace before the struct declaration so
the line starts with "struct cuvsKMeansParams_v1 {" (no leading spaces); update
the declaration in the header (symbol: cuvsKMeansParams_v1) to match the file's
code style/indentation conventions so the struct aligns with other top-level
declarations.

In `@cpp/include/cuvs/cluster/kmeans.hpp`:
- Around line 115-123: Restore a deprecated stub for the removed field in the
cuvs::cluster::kmeans::params struct by reintroducing a [[deprecated("ignored;
use init_size instead")]] bool inertia_check = false; member so existing C++
call sites compile with a warning (reference cuvs::cluster::kmeans::params and
inertia_check), and update the project docs/CHANGELOG with a short migration
note stating inertia_check is deprecated/ignored and that inertia-based
convergence is now always on (point users to use init_size instead); keep the C
API behavior unchanged.

In `@python/cuvs/cuvs/cluster/kmeans/kmeans.pyx`:
- Around line 79-82: Update the KMeans docstring for init_size to state that it
applies to both device and host data and document the differing
defaults/behaviors (on device default = n_samples when init_size==0; on host
default = min(3 * n_clusters, n_samples)) referencing the same semantics as
cpp/include/cuvs/cluster/kmeans.hpp; also restore acceptance of the
inertia_check keyword in KMeansParams.__init__ (class KMeansParams) so existing
callers do not raise TypeError, emit a DeprecationWarning and ignore the
provided value for one release (no-op) to match the C-struct ABI-preservation
strategy.

---

Duplicate comments:
In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 591-599: In kmeans_fit, avoid exact float comparison of wt_sum to
n_samples: compute rel_tol = n_samples * std::numeric_limits<DataT>::epsilon();
set a bool needs_wt_norm = (std::abs(wt_sum - static_cast<DataT>(n_samples)) >
rel_tol) (mirroring kmeans_predict) and use that flag for the debug
RAFT_LOG_DEBUG and to decide whether to run the per-batch scaling in
prepare_batch_weights; replace the existing != checks on wt_sum with checks
against needs_wt_norm and pass/use that flag instead of recomputing or comparing
wt_sum directly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: bc3ad388-89a6-4de0-87f7-fd8710a65b58

📥 Commits

Reviewing files that changed from the base of the PR and between f2bffb6 and 410092c.

📒 Files selected for processing (15)

c/include/cuvs/cluster/kmeans.h
c/src/cluster/kmeans.cpp
cpp/include/cuvs/cluster/kmeans.hpp
cpp/src/cluster/detail/kmeans.cuh
cpp/src/cluster/detail/kmeans_batched.cuh
cpp/src/cluster/detail/kmeans_common.cuh
cpp/src/cluster/detail/kmeans_mg.cuh
cpp/src/cluster/detail/minClusterDistanceCompute.cu
cpp/src/cluster/kmeans.cuh
cpp/src/cluster/kmeans_fit_double.cu
cpp/src/cluster/kmeans_fit_float.cu
cpp/src/cluster/kmeans_impl.cuh
cpp/tests/cluster/kmeans.cu
python/cuvs/cuvs/cluster/kmeans/kmeans.pxd
python/cuvs/cuvs/cluster/kmeans/kmeans.pyx

💤 Files with no reviewable changes (2)

cpp/src/cluster/kmeans.cuh
cpp/src/cluster/detail/kmeans_batched.cuh

coderabbitai · 2026-04-22T19:04:56Z

+/**
+ * @brief Hyper-parameters for the kmeans algorithm
+ */
+ struct cuvsKMeansParams_v1 {
+  cuvsDistanceType metric;
+
+  /**
+   * The number of clusters to form as well as the number of centroids to generate (default:8).
+   */
+  int n_clusters;
+
+  /**
+   * Method for initialization, defaults to k-means++:
+   *  - cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm
+   * to select the initial cluster centers.
+   *  - cuvsKMeansInitMethod::Random (random): Choose 'n_clusters' observations (rows) at
+   * random from the input data for the initial centroids.
+   *  - cuvsKMeansInitMethod::Array (ndarray): Use 'centroids' as initial cluster centers.
+   */
+  cuvsKMeansInitMethod init;
+
+  /**
+   * Maximum number of iterations of the k-means algorithm for a single run.
+   */
+  int max_iter;
+
+  /**
+   * Relative tolerance with regards to inertia to declare convergence.
+   */
+  double tol;
+
+  /**
+   * Number of instance k-means algorithm will be run with different seeds.
+   */
+  int n_init;
+
+  /**
+   * Oversampling factor for use in the k-means|| algorithm
+   */
+  double oversampling_factor;
+
+  /**
+   * batch_samples and batch_centroids are used to tile 1NN computation which is
+   * useful to optimize/control the memory footprint
+   * Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0
+   * then don't tile the centroids
+   */
+  int batch_samples;
+
+  /**
+   * if 0 then batch_centroids = n_clusters
+   */
+  int batch_centroids;
+
+  /**
+   * Whether to use hierarchical (balanced) kmeans or not
+   */
+  bool hierarchical;
+
+  /**
+   * For hierarchical k-means , defines the number of training iterations
+   */
+  int hierarchical_n_iters;
+
+  /**
+   * Number of samples to process per GPU batch for the batched (host-data) API.
+   * When set to 0, defaults to n_samples (process all at once).
+   */
+  int64_t streaming_batch_size;
+
+  /**
+   * Number of samples to draw for KMeansPlusPlus initialization.
+   * When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data,
+   * or n_samples for device data.
+   */
+  int64_t init_size;
 };


⚠️ Potential issue | 🟠 Major

cuvsKMeansParams_v1 is declared but not reachable through the standard C API pattern.

The existing API exposes cuvsKMeansParams via cuvsKMeansParams_t typedef and cuvsKMeansParamsCreate/cuvsKMeansParamsDestroy. The new cuvsKMeansParams_v1 struct has:

No corresponding cuvsKMeansParams_v1_t typedef.

No cuvsKMeansParamsCreate_v1 / cuvsKMeansParamsDestroy_v1 (or equivalent) declarations.

As a result, C consumers cannot allocate/initialize/free a cuvsKMeansParams_v1 using the established pattern, and the struct effectively becomes dead surface area until 26.08. Please either add the companion typedef and lifecycle APIs now, or defer adding the v1 struct until the matching helpers land together.

As per coding guidelines, c/include/cuvs/**/*: public C API headers must include Doxygen documentation for all public functions/classes, and API changes require deprecation warnings and migration guide updates.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@c/include/cuvs/cluster/kmeans.h` around lines 123 - 199, Summary: cuvsKMeansParams_v1 is declared but not exposed via the C API pattern, making it unreachable for C consumers; add the missing typedef and lifecycle APIs or remove the struct until helpers land. Fix: add a typedef cuvsKMeansParams_v1_t for cuvsKMeansParams_v1 and declare public C API functions cuvsKMeansParamsCreate_v1(...) and cuvsKMeansParamsDestroy_v1(cuvsKMeansParams_v1_t*) following the signatures/semantics of existing cuvsKMeansParamsCreate/cuvsKMeansParamsDestroy, include Doxygen comments for each new symbol (typedef and functions), and mark any older/changed APIs with deprecation notes and update the migration guide per repo guidelines; alternatively, if you choose to defer, remove cuvsKMeansParams_v1 now to avoid dead surface area.

coderabbitai · 2026-04-22T19:04:56Z

+/**
+ * @brief Hyper-parameters for the kmeans algorithm
+ */
+ struct cuvsKMeansParams_v1 {


⚠️ Potential issue | 🟡 Minor

Stray leading whitespace before struct cuvsKMeansParams_v1.

-/** - * `@brief` Hyper-parameters for the kmeans algorithm - */ - struct cuvsKMeansParams_v1 { +/** + * `@brief` Hyper-parameters for the kmeans algorithm + */ +struct cuvsKMeansParams_v1 {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

struct cuvsKMeansParams_v1 {

/**

* `@brief` Hyper-parameters for the kmeans algorithm

*/

struct cuvsKMeansParams_v1 {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@c/include/cuvs/cluster/kmeans.h` at line 126, Remove the stray leading whitespace before the struct declaration so the line starts with "struct cuvsKMeansParams_v1 {" (no leading spaces); update the declaration in the header (symbol: cuvsKMeansParams_v1) to match the file's code style/indentation conventions so the struct aligns with other top-level declarations.

coderabbitai · 2026-04-22T19:04:56Z

  /**
-   * If true, check inertia during iterations for early convergence.
+   * Number of samples to randomly draw for the KMeansPlusPlus initialization
+   * step. A random subset of this size is used for centroid seeding.
+   * When set to 0 the default depends on the data location:
+   *   - Device data: n_samples (use the full dataset).
+   *   - Host data:   min(3 * n_clusters, n_samples).
+   * Default: 0.
   */
-  bool inertia_check = false;
+  int64_t init_size = 0;


⚠️ Potential issue | 🟠 Major

Breaking removal of inertia_check from the public C++ params struct lacks deprecation/migration notes.

The cuvs::cluster::kmeans::params::inertia_check field has been removed outright (replaced by init_size). Any downstream C++ code that sets or reads params.inertia_check will now fail to compile with a hard error, despite the PR being labeled non-breaking. The C API preserves the field for ABI compatibility (marked deprecated/ignored), but the C++ API does not offer the same grace period.

Please either:

Retain a deprecated [[deprecated("ignored; use init_size instead")]] bool inertia_check = false; stub in the struct for one release cycle so existing call sites keep compiling with a warning, and

Add a migration note in the project docs/CHANGELOG explaining that inertia_check is gone and that inertia-based convergence is now always on.

As per coding guidelines: "API changes flagged for docs/ updates" and "Breaking changes require deprecation warnings and migration guide updates".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/include/cuvs/cluster/kmeans.hpp` around lines 115 - 123, Restore a deprecated stub for the removed field in the cuvs::cluster::kmeans::params struct by reintroducing a [[deprecated("ignored; use init_size instead")]] bool inertia_check = false; member so existing C++ call sites compile with a warning (reference cuvs::cluster::kmeans::params and inertia_check), and update the project docs/CHANGELOG with a short migration note stating inertia_check is deprecated/ignored and that inertia-based convergence is now always on (point users to use init_size instead); keep the C API behavior unchanged.

coderabbitai · 2026-04-22T19:04:56Z

+    init_size : int
+        Number of samples to draw for KMeansPlusPlus initialization with
+        host (out-of-core) data. When set to 0, uses the heuristic
+        min(3 * n_clusters, n_samples). Default: 0.


⚠️ Potential issue | 🟡 Minor

Docstring understates init_size scope and Python API drops inertia_check without deprecation.

Two concerns:

The docstring says init_size applies only to "host (out-of-core) data", but cpp/include/cuvs/cluster/kmeans.hpp documents it for both device and host data (defaulting to n_samples on device vs. min(3 * n_clusters, n_samples) on host when set to 0). Users on device data will not realize they can tune this parameter.

The previously accepted inertia_check= keyword on KMeansParams.__init__ was removed outright. Existing Python code passing KMeansParams(inertia_check=...) will now raise TypeError: unexpected keyword argument. Consider accepting the kwarg for one release cycle (warn via DeprecationWarning and ignore) to match the ABI-preservation strategy used for the C struct.

📝 Suggested docstring adjustment

init_size : int - Number of samples to draw for KMeansPlusPlus initialization with - host (out-of-core) data. When set to 0, uses the heuristic - min(3 * n_clusters, n_samples). Default: 0. + Number of samples to draw for the KMeansPlusPlus initialization + step. When set to 0, the default depends on data location: + ``n_samples`` for device data, ``min(3 * n_clusters, n_samples)`` + for host data. Default: 0.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

init_size : int

Number of samples to draw for KMeansPlusPlus initialization with

host (out-of-core) data. When set to 0, uses the heuristic

min(3 * n_clusters, n_samples). Default: 0.

init_size : int

Number of samples to draw for the KMeansPlusPlus initialization

step. When set to 0, the default depends on data location:

``n_samples`` for device data, ``min(3 * n_clusters, n_samples)``

for host data. Default: 0.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/cuvs/cuvs/cluster/kmeans/kmeans.pyx` around lines 79 - 82, Update the KMeans docstring for init_size to state that it applies to both device and host data and document the differing defaults/behaviors (on device default = n_samples when init_size==0; on host default = min(3 * n_clusters, n_samples)) referencing the same semantics as cpp/include/cuvs/cluster/kmeans.hpp; also restore acceptance of the inertia_check keyword in KMeansParams.__init__ (class KMeansParams) so existing callers do not raise TypeError, emit a DeprecationWarning and ignore the provided value for one release (no-op) to match the C-struct ABI-preservation strategy.

dantegd

This PR improves things nicely, the code get significant improvement, love to see this!

dantegd · 2026-04-22T19:35:05Z

+      DataT curClusteringCost = DataT{0};
+      raft::copy(&curClusteringCost, clustering_cost.data_handle(), 1, stream);
+      raft::resource::sync_stream(handle, stream);
+
+      if (curClusteringCost == DataT{0}) {
+        RAFT_LOG_WARN("Zero clustering cost detected: all points coincide with their centroids.");
+      } else if (n_current_iter > 1) {
+        DataT delta = curClusteringCost / priorClusteringCost;
+        if (delta > 1 - iter_params.tol) done = true;
+      }


Seems that this block was gated by params.inertia_check which defaulted oto false, now it's run unconditionally which will force the device to host copy on every iteration, is this change intentional? I think we could avoid this copy, or make it conditional, or at least explicitly mention this change if it's intended.

This was intended. According to my discussion with @cjnolet we should always be early stopping. Yes this does mean that we now will always have to wait for one iteration to complete before the next one because the CPU should be made aware of whether or not it should proceed with the next iteration. Keeping this thread open to discussion if others have some inputs.

dantegd · 2026-04-22T19:37:07Z

+    comm.allreduce(&(clusterCostD.data()->value),
+                   &(clusterCostD.data()->value),
+                   1,
+                   raft::comms::op_t::SUM,
+                   stream);
+
+    DataT curClusteringCost = 0;
+    raft::copy(handle,
+               raft::make_host_scalar_view(&curClusteringCost),
+               raft::make_device_scalar_view(&(clusterCostD.data()->value)));
+
+    ASSERT(comm.sync_stream(stream) == raft::comms::status_t::SUCCESS,
+           "An error occurred in the distributed operation. This can result "
+           "from a failed rank");
+    if (curClusteringCost == (DataT)0.0) {
+      RAFT_LOG_WARN("Zero clustering cost detected: all points coincide with their centroids.");
+    } else if (n_iter[0] > 1) {
+      DataT delta = curClusteringCost / priorClusteringCost;
+      if (delta > 1 - params.tol) done = true;


We have the same change of behavior here with this now running unconditionally, so we need to do the same thing we do in the single gpu change I mentioned before.

@viclafargue will be taking care of the iteration loop in kmeans_mg in #2017

dantegd · 2026-04-22T19:39:24Z

+        run_kmeanspp(X);
+      } else {
+        IndexT default_init_size =
+          std::min(static_cast<IndexT>(std::int64_t{3} * n_clusters), n_samples);


The reasoning for the change makes sense to me, but we should add a log/warning to alert users of the behavior change that otherwise will go silent for users that don't set init_size, and add it do the param doc in kmeans.hpp

dantegd · 2026-04-22T20:09:43Z

+      auto centroids_const = raft::make_device_matrix_view<const DataT, IndexT>(
+        cur_centroids_ptr, n_clusters, n_features);
+
+      iter_inertia = DataT{0};
+      data_batches.reset();
+      weight_batches.reset();
+      auto wt_it = weight_batches.begin();
+      for (const auto& data_batch : data_batches) {
+        IndexT cur_batch_size = static_cast<IndexT>(data_batch.size());
+        const auto& wt_batch  = *wt_it;
+        ++wt_it;
+
+        auto batch_data_view = raft::make_device_matrix_view<const DataT, IndexT>(
+          data_batch.data(), cur_batch_size, n_features);
+
+        std::optional<raft::device_vector_view<const DataT, IndexT>> batch_sw = std::nullopt;
+        if (weight_ptr != nullptr) { batch_sw = prepare_batch_weights(wt_batch, cur_batch_size); }
+
+        DataT batch_cost = DataT{0};
+        cuvs::cluster::kmeans::cluster_cost(handle,
+                                            batch_data_view,
+                                            centroids_const,
+                                            raft::make_host_scalar_view(&batch_cost),
+                                            batch_sw);
+
+        iter_inertia += batch_cost;
+      }
+    }


Now that clustering_cost is accumulated on device every iteration inside the Lloyd loop, the final accepted iteration's clustering_cost is exactly the inertia to report, no?

This extra post-loop batched pass (one full read of X per n_init) duplicates work that was just done if I'm not mistaken. If so, consider copying the last clustering_cost scalar to iter_inertia

The final inertia was being computed even before this PR with a cuvs::cluster::kmeans::cluster_cost call. The thing is that the centroids are modified in the end after the inertia computation. So one extra pass is needed with the final centroids.

dantegd · 2026-04-22T20:28:55Z

-    params.inertia_check = true;
-    params.max_iter      = 20;
+    params.init     = cuvs::cluster::kmeans::params::Array;
+    params.max_iter = 20;


A few new tests could be useful to consider:

a test that runs fit with host input and device input on the same data/seed and asserts bit-identical centroids/inertia

a test that exercises init_size (including 0 default and an explicit value)

a test for the zero-cost warning path (e.g., n_samples == n_clusters with distinct points), to lock down the new behavior

The first one (bit-identical centroids and inertia with streaming host data and data on device) was already being tested.
I added tests for init_size and zero-cost paths as well as n_init > 0.

dantegd · 2026-04-22T20:30:25Z

            self.params.batch_centroids = batch_centroids
-        if inertia_check is not None:
-            self.params.inertia_check = inertia_check
+        if init_size is not None:


Existing Python code that passes inertia_check=... will now raise TypeError. We should consider accepting and doing a deprecation warning for one release, then removing. Same applies to the docstring update

Because the C API still has it, this wont raise an error, but I added a deprecation warning.

coderabbitai

♻️ Duplicate comments (1)

cpp/src/cluster/detail/kmeans.cuh (1)
591-602: ⚠️ Potential issue | 🟠 Major

Missing wt_sum > 0 guard in kmeans_fit; inconsistent with kmeans_predict.

If the caller passes a sample_weight whose elements all sum to 0, wt_sum == 0 here, needs_wt_rescale becomes true (since |0 - n_samples| > wt_rel_tol), and prepare_batch_weights at lines 716-720 applies div_const_op<DataT>{wt_sum} → every batch weight becomes inf/NaN. Because wt_sum is captured once and reused for every batch of every Lloyd iteration across all n_init seeds, the corruption silently propagates into centroid_sums / weight_per_cluster and the final centroids, with no diagnostic.

Note that the peer kmeans_predict function already guards this at Line 1015 with RAFT_EXPECTS(wt_sum > DataT{0}, ...) — please apply the same guard here for consistent behavior between training and inference.
🛡️ Proposed fix
   DataT wt_sum           = sample_weight.has_value() ? weightSum(handle, sample_weight.value())
                                                      : static_cast<DataT>(n_samples);
+  RAFT_EXPECTS(wt_sum > DataT{0},
+               "KMeans: sum of sample weights must be strictly positive");
   const DataT wt_rel_tol = n_samples * std::numeric_limits<DataT>::epsilon();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/cluster/detail/kmeans.cuh` around lines 591 - 602, In kmeans_fit,
guard against a zero total sample weight like kmeans_predict does: after
computing wt_sum (and before using it / before setting needs_wt_rescale and
calling prepare_batch_weights), add a precondition check (e.g.,
RAFT_EXPECTS(wt_sum > DataT{0}, "KMeans: sample_weight must sum to > 0")) so
that wt_sum==0 is rejected early; this prevents prepare_batch_weights (and the
div_const_op captured wt_sum) from producing inf/NaN and keeps behavior
consistent with kmeans_predict.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 591-602: In kmeans_fit, guard against a zero total sample weight
like kmeans_predict does: after computing wt_sum (and before using it / before
setting needs_wt_rescale and calling prepare_batch_weights), add a precondition
check (e.g., RAFT_EXPECTS(wt_sum > DataT{0}, "KMeans: sample_weight must sum to
> 0")) so that wt_sum==0 is rejected early; this prevents prepare_batch_weights
(and the div_const_op captured wt_sum) from producing inf/NaN and keeps behavior
consistent with kmeans_predict.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ec785235-cad6-4fe9-9a5c-1a941f244525

📥 Commits

Reviewing files that changed from the base of the PR and between 410092c and c515c1e.

📒 Files selected for processing (5)

cpp/include/cuvs/cluster/kmeans.hpp
cpp/src/cluster/detail/kmeans.cuh
cpp/src/cluster/detail/kmeans_common.cuh
cpp/src/cluster/detail/kmeans_mg.cuh
python/cuvs/cuvs/cluster/kmeans/kmeans.pyx

🚧 Files skipped from review as they are similar to previous changes (2)

python/cuvs/cuvs/cluster/kmeans/kmeans.pyx
cpp/src/cluster/detail/kmeans_common.cuh

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/cluster/detail/kmeans_common.cuh`:
- Around line 144-165: The function weightSum currently only checks the final
wt_sum > 0 but must validate each weight is finite and non-negative; update
weightSum to (1) in the host path iterate weights(i) and RAFT_EXPECTS each value
is finite and >= 0 before summing, and (2) in the device path add a device-side
reduction/check (e.g., use a device kernel/thrust/raft primitive) to verify all
weight.data_handle() elements are finite and >= 0 (and return an error via
RAFT_EXPECTS if any fail) before performing raft::linalg::mapThenSumReduce; keep
the final positive-sum RAFT_EXPECTS message but add a specific error for invalid
per-sample weights.

In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 703-714: The code currently dereferences the weight_batches
iterator (e.g., using const auto& wt_batch = *wt_it) even when weight_ptr ==
nullptr; move all iterator access of weight_batches (and any wt_it / wt_batch
dereferences) inside branches guarded by weight_ptr != nullptr and, for the
unweighted case, use the prefilled batch_weights_buf produced by
raft::matrix::fill and prepare_batch_weights lambda instead; update the loops
around the weight handling (the weight_batches iterator creation,
prepare_batch_weights lambda, and the loops that currently dereference wt_it —
including the similar sections at the other occurrences referenced in the
comment) so that when weight_ptr is null you never call *wt_it and always
operate on the filled batch_weights_buf.
- Around line 481-495: The reclustering block currently constructs a fresh
cuvs::cluster::kmeans::params (recluster_params) which loses caller settings
like metric, max_iter, tol, batch_samples, batch_centroids and verbosity; change
it to start from the existing params (copy params into recluster_params) and
then only override the fields that must differ (n_clusters, init, n_init) so
that when you call cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(...)
the original metric, max_iter, tol, batch_samples, batch_centroids and verbosity
are preserved during the weighted reclustering step.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 38e8e433-d81d-440d-b687-b4bfce7012bd

📥 Commits

Reviewing files that changed from the base of the PR and between e8e63ab and 30c457c.

📒 Files selected for processing (2)

cpp/src/cluster/detail/kmeans.cuh
cpp/src/cluster/detail/kmeans_common.cuh

coderabbitai · 2026-04-22T21:58:26Z

+template <typename DataT, typename IndexT, typename Accessor>
+DataT weightSum(
+  raft::resources const& handle,
+  raft::mdspan<const DataT, raft::vector_extent<IndexT>, raft::layout_right, Accessor> weight)
 {
-  cudaStream_t stream = raft::resource::get_cuda_stream(handle);
-  auto wt_aggr        = raft::make_device_scalar<DataT>(handle, 0);
-  auto n_samples      = weight.extent(0);
-
-  size_t temp_storage_bytes = 0;
-  RAFT_CUDA_TRY(cub::DeviceReduce::Sum(
-    nullptr, temp_storage_bytes, weight.data_handle(), wt_aggr.data_handle(), n_samples, stream));
-
-  workspace.resize(temp_storage_bytes, stream);
-
-  RAFT_CUDA_TRY(cub::DeviceReduce::Sum(workspace.data(),
-                                       temp_storage_bytes,
-                                       weight.data_handle(),
-                                       wt_aggr.data_handle(),
-                                       n_samples,
-                                       stream));
-  DataT wt_sum = 0;
-  raft::copy(handle,
-             raft::make_host_scalar_view(&wt_sum),
-             raft::make_device_scalar_view(wt_aggr.data_handle()));
-  raft::resource::sync_stream(handle, stream);
-
-  if (wt_sum != n_samples) {
-    RAFT_LOG_DEBUG(
-      "[Warning!] KMeans: normalizing the user provided sample weight to "
-      "sum up to %d samples",
-      n_samples);
-
-    auto scale = static_cast<DataT>(n_samples) / wt_sum;
-    raft::linalg::map(
-      handle, weight, raft::mul_const_op<DataT>{scale}, raft::make_const_mdspan(weight));
+  auto n_samples = weight.extent(0);
+
+  DataT wt_sum = DataT{0};
+  if constexpr (raft::is_device_mdspan_v<decltype(weight)>) {
+    auto stream   = raft::resource::get_cuda_stream(handle);
+    auto d_wt_sum = raft::make_device_scalar<DataT>(handle, DataT{0});
+    raft::linalg::mapThenSumReduce(
+      d_wt_sum.data_handle(), n_samples, raft::identity_op{}, stream, weight.data_handle());
+    raft::copy(&wt_sum, d_wt_sum.data_handle(), 1, stream);
+    raft::resource::sync_stream(handle);
+  } else {
+    for (IndexT i = 0; i < n_samples; ++i) {
+      wt_sum += weight(i);
+    }
  }
+  RAFT_EXPECTS(wt_sum > DataT{0}, "invalid parameter (sum of sample weights must be positive)");
+  return wt_sum;


⚠️ Potential issue | 🟠 Major

Reject negative sample weights, not just non-positive totals.

wt_sum > 0 still allows inputs like [-1, 2], which can make weight_per_cluster, centroids, and inertia invalid downstream. Please validate each weight is finite/non-negative before returning the sum.

🛡️ Suggested direction

DataT wt_sum = DataT{0}; + bool has_invalid_weight = false; if constexpr (raft::is_device_mdspan_v<decltype(weight)>) { auto stream = raft::resource::get_cuda_stream(handle); auto d_wt_sum = raft::make_device_scalar<DataT>(handle, DataT{0}); + auto d_invalid_count = raft::make_device_scalar<int>(handle, int{0}); raft::linalg::mapThenSumReduce( d_wt_sum.data_handle(), n_samples, raft::identity_op{}, stream, weight.data_handle()); + raft::linalg::mapThenSumReduce( + d_invalid_count.data_handle(), + n_samples, + [] __device__(DataT w) { return !(w >= DataT{0}) ? int{1} : int{0}; }, + stream, + weight.data_handle()); raft::copy(&wt_sum, d_wt_sum.data_handle(), 1, stream); + int invalid_count = 0; + raft::copy(&invalid_count, d_invalid_count.data_handle(), 1, stream); raft::resource::sync_stream(handle); + has_invalid_weight = invalid_count != 0; } else { for (IndexT i = 0; i < n_samples; ++i) { + has_invalid_weight |= !(weight(i) >= DataT{0}); wt_sum += weight(i); } } + RAFT_EXPECTS(!has_invalid_weight, "invalid parameter (sample weights must be non-negative)"); RAFT_EXPECTS(wt_sum > DataT{0}, "invalid parameter (sum of sample weights must be positive)");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/src/cluster/detail/kmeans_common.cuh` around lines 144 - 165, The function weightSum currently only checks the final wt_sum > 0 but must validate each weight is finite and non-negative; update weightSum to (1) in the host path iterate weights(i) and RAFT_EXPECTS each value is finite and >= 0 before summing, and (2) in the device path add a device-side reduction/check (e.g., use a device kernel/thrust/raft primitive) to verify all weight.data_handle() elements are finite and >= 0 (and return an error via RAFT_EXPECTS if any fail) before performing raft::linalg::mapThenSumReduce; keep the final positive-sum RAFT_EXPECTS message but add a specific error for invalid per-sample weights.

coderabbitai · 2026-04-22T21:58:27Z

+    cuvs::cluster::kmeans::params recluster_params;
+    recluster_params.n_clusters = params.n_clusters;
+    recluster_params.init       = cuvs::cluster::kmeans::params::InitMethod::Array;
+    recluster_params.n_init     = 1;
+
+    auto weight_opt = std::make_optional(raft::make_const_mdspan(weight.view()));
+    cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(
+      handle,
+      recluster_params,
+      raft::make_const_mdspan(potentialCentroids),
+      weight_opt,
+      centroidsRawData,
+      inertia.view(),
+      n_iter.view(),
+      std::ref(workspace));


⚠️ Potential issue | 🟠 Major

Preserve user parameters during KMeans|| reclustering.

Starting from default kmeans::params drops caller settings such as metric, max_iter, tol, batch_samples, batch_centroids, and verbosity during the weighted reclustering step.

🐛 Proposed fix

- cuvs::cluster::kmeans::params recluster_params; - recluster_params.n_clusters = params.n_clusters; - recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array; - recluster_params.n_init = 1; + cuvs::cluster::kmeans::params recluster_params = params; + recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array; + recluster_params.n_init = 1;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cuvs::cluster::kmeans::params recluster_params;

recluster_params.n_clusters = params.n_clusters;

recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array;

recluster_params.n_init = 1;

auto weight_opt = std::make_optional(raft::make_const_mdspan(weight.view()));

cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(

handle,

recluster_params,

raft::make_const_mdspan(potentialCentroids),

weight_opt,

centroidsRawData,

inertia.view(),

n_iter.view(),

std::ref(workspace));

cuvs::cluster::kmeans::params recluster_params = params;

recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array;

recluster_params.n_init = 1;

auto weight_opt = std::make_optional(raft::make_const_mdspan(weight.view()));

cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(

handle,

recluster_params,

raft::make_const_mdspan(potentialCentroids),

weight_opt,

centroidsRawData,

inertia.view(),

n_iter.view(),

std::ref(workspace));

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/src/cluster/detail/kmeans.cuh` around lines 481 - 495, The reclustering block currently constructs a fresh cuvs::cluster::kmeans::params (recluster_params) which loses caller settings like metric, max_iter, tol, batch_samples, batch_centroids and verbosity; change it to start from the existing params (copy params into recluster_params) and then only override the fields that must differ (n_clusters, init, n_init) so that when you call cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(...) the original metric, max_iter, tol, batch_samples, batch_centroids and verbosity are preserved during the weighted reclustering step.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

cpp/src/cluster/detail/kmeans.cuh (1)

481-495: ⚠️ Potential issue | 🟠 Major

Preserve caller parameters during KMeans|| reclustering.

recluster_params starts from defaults, so weighted reclustering drops caller settings such as metric, max_iter, tol, batching, and verbosity. Start from params and only override the fields required for reclustering. This matches an earlier unresolved review thread.

🐛 Proposed fix

-    cuvs::cluster::kmeans::params recluster_params;
-    recluster_params.n_clusters = params.n_clusters;
-    recluster_params.init       = cuvs::cluster::kmeans::params::InitMethod::Array;
-    recluster_params.n_init     = 1;
+    cuvs::cluster::kmeans::params recluster_params = params;
+    recluster_params.n_clusters                    = params.n_clusters;
+    recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array;
+    recluster_params.n_init = 1;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/cluster/detail/kmeans.cuh` around lines 481 - 495, recluster_params
is being constructed from defaults which drops caller-specified options (metric,
max_iter, tol, batching, verbosity, etc.); instead initialize recluster_params
from the incoming params (e.g., copy params into recluster_params) and then only
override the fields required for reclustering: set recluster_params.n_clusters =
params.n_clusters, recluster_params.init =
cuvs::cluster::kmeans::params::InitMethod::Array, and recluster_params.n_init =
1 before calling cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(...);
keep the weight handling (weight_opt) and the rest of the arguments unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 741-759: The cosine norm convention is inconsistent:
compute_batch_norms applies raft::sqrt_op for DistanceType::CosineExpanded while
centroid norms are computed without sqrt; either make both use the same
convention or disallow CosineExpanded here. Fix by updating the centroid norm
computation (the code that computes norms for centroids/centroid_norms) to also
use raft::sqrt_op when metric == cuvs::distance::DistanceType::CosineExpanded,
or alternatively add an early check before the batching loop to throw/return if
metric == CosineExpanded; ensure you reference compute_batch_norms, L2NormBatch,
and the centroid norm computation path so both sides use identical sqrt behavior
or the metric is rejected consistently.
- Around line 622-637: The KMeans++ host init can proceed with fewer samples
than n_clusters when params.init_size > 0, causing degenerate seeding; before
allocating init_sample in kmeans.cuh, add a guard that if pams.init_size > 0 and
init_sample_size < n_clusters you reject early by throwing a clear exception
(e.g., throw std::invalid_argument) with a message referencing params.init_size
and required n_clusters; place this check just before the call to
raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features) so
the logic using init_sample never runs with too few rows.
- Around line 651-654: The Random init path constructs a new RngState from only
the seed, losing any non-default RNG type; instead, construct the RNG using
iter_params.rng_state (preserving its type) like the KMeans++ paths do, and then
pass that RNG into raft::matrix::sample_rows when sampling rows into
centroidsRawData from X (i.e., replace raft::random::RngState
random_state(iter_params.rng_state.seed) with constructing/initializing RngState
from iter_params.rng_state so sample_rows uses the configured rng type).

---

Duplicate comments:
In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 481-495: recluster_params is being constructed from defaults which
drops caller-specified options (metric, max_iter, tol, batching, verbosity,
etc.); instead initialize recluster_params from the incoming params (e.g., copy
params into recluster_params) and then only override the fields required for
reclustering: set recluster_params.n_clusters = params.n_clusters,
recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array, and
recluster_params.n_init = 1 before calling
cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(...); keep the weight
handling (weight_opt) and the rest of the arguments unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 44e402a4-4ebe-4e58-be48-8589e58d62cc

📥 Commits

Reviewing files that changed from the base of the PR and between 30c457c and 269f23c.

📒 Files selected for processing (2)

cpp/src/cluster/detail/kmeans.cuh
python/cuvs/cuvs/cluster/kmeans/kmeans.pyx

coderabbitai · 2026-04-22T22:12:54Z

+      IndexT default_init_size =
+        std::min(static_cast<IndexT>(std::int64_t{3} * n_clusters), n_samples);
+      IndexT init_sample_size = pams.init_size > 0
+                                  ? std::min(static_cast<IndexT>(pams.init_size), n_samples)
+                                  : default_init_size;
+
+      if (pams.init_size <= 0 && init_sample_size < n_samples) {
+        RAFT_LOG_WARN(
+          "KMeans.fit: KMeans++ initialization is using a random subsample of %zu/%zu host rows "
+          "(params.init_size=0 defaults to min(3 * n_clusters, n_samples) for host data). "
+          "Set params.init_size to n_samples to use the full dataset for seeding.",
+          static_cast<size_t>(init_sample_size),
+          static_cast<size_t>(n_samples));
+      }

-  auto centroidsRawData = raft::make_device_matrix<DataT, IndexT>(handle, n_clusters, n_features);
+      init_sample = raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features);


⚠️ Potential issue | 🟠 Major

Reject KMeans++ init samples smaller than n_clusters.

When params.init_size > 0 but less than n_clusters, host KMeans++ samples too few rows and then tries to seed n_clusters centroids from that smaller sample. Reject this early to avoid degenerate duplicate/zero-probability seeding.

🐛 Proposed fix

IndexT init_sample_size = pams.init_size > 0 ? std::min(static_cast<IndexT>(pams.init_size), n_samples) : default_init_size; + RAFT_EXPECTS(init_sample_size >= n_clusters, + "invalid parameter (init_size must be 0 or >= n_clusters)"); if (pams.init_size <= 0 && init_sample_size < n_samples) {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/src/cluster/detail/kmeans.cuh` around lines 622 - 637, The KMeans++ host init can proceed with fewer samples than n_clusters when params.init_size > 0, causing degenerate seeding; before allocating init_sample in kmeans.cuh, add a guard that if pams.init_size > 0 and init_sample_size < n_clusters you reject early by throwing a clear exception (e.g., throw std::invalid_argument) with a message referencing params.init_size and required n_clusters; place this check just before the call to raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features) so the logic using init_sample never runs with too few rows.

coderabbitai · 2026-04-22T22:12:54Z

+    raft::random::RngState random_state(iter_params.rng_state.seed);
+
+    if (iter_params.init == cuvs::cluster::kmeans::params::InitMethod::Random) {
+      raft::matrix::sample_rows(handle, random_state, X, centroidsRawData);


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Inspect RngState construction patterns around KMeans initialization. rg -n -C3 'RngState random_state|RngState rng' --iglob '*kmeans*.cuh' --iglob '*kmeans*.cu'

Repository: rapidsai/cuvs

Length of output: 3467

🏁 Script executed:

#!/bin/bash # Check RngState constructor signature find . -type f \( -name "*.hpp" -o -name "*.h" -o -name "*.cuh" \) | xargs grep -l "RngState" | head -5

Repository: rapidsai/cuvs

Length of output: 266

🏁 Script executed:

#!/bin/bash # Find and inspect RngState class definition fd -e hpp -e h "rng" | xargs grep -l "class RngState" 2>/dev/null | head -3

Repository: rapidsai/cuvs

Length of output: 39

🏁 Script executed:

#!/bin/bash # Look at the context around line 651 to understand iter_params sed -n '640,665p' cpp/src/cluster/detail/kmeans.cuh

Repository: rapidsai/cuvs

Length of output: 1307

🏁 Script executed:

#!/bin/bash # Search for RngState class definition in raft find . -type f \( -name "*.hpp" -o -name "*.cuh" \) -path "*/raft/*" | xargs grep -l "class RngState" 2>/dev/null | head -5

Repository: rapidsai/cuvs

Length of output: 39

🏁 Script executed:

#!/bin/bash # Look for RngState constructor definition rg "class RngState|struct RngState" -A 20 --type cpp --type cuh 2>/dev/null | head -60

Repository: rapidsai/cuvs

Length of output: 39

🏁 Script executed:

#!/bin/bash # Check the rng_state struct in kmeans params rg "struct.*RngState|class.*RngState" -B 2 -A 10 cpp/include/cuvs/cluster/kmeans.hpp

Repository: rapidsai/cuvs

Length of output: 39

🏁 Script executed:

#!/bin/bash # Check the rng_state field in kmeans params and its type rg "rng_state" cpp/include/cuvs/cluster/kmeans.hpp -B 2 -A 2

Repository: rapidsai/cuvs

Length of output: 294

🏁 Script executed:

#!/bin/bash # Check what RngState is and its definition location rg "include.*random" cpp/src/cluster/detail/kmeans.cuh | head -20

Repository: rapidsai/cuvs

Length of output: 105

🏁 Script executed:

#!/bin/bash # Verify the context of all RngState constructions to see the pattern rg -n "RngState.*seed.*type" cpp/src/cluster/detail/kmeans.cuh

Repository: rapidsai/cuvs

Length of output: 216

🏁 Script executed:

#!/bin/bash # Check the RngState default member initialization to see what type it has rg "RngState.*rng_state" cpp/include/cuvs/cluster/kmeans.hpp -B 3 -A 3

Repository: rapidsai/cuvs

Length of output: 232

🏁 Script executed:

#!/bin/bash # Look at how sample_rows is called with random_state to understand expectations rg -n "sample_rows.*random_state" cpp/src/cluster/detail/kmeans.cuh -B 2 -A 2

Repository: rapidsai/cuvs

Length of output: 639

🏁 Script executed:

#!/bin/bash # Check if there are any other places where Random init path is used rg -n "InitMethod::Random" cpp/src/cluster/detail/kmeans.cuh -B 5 -A 5

Repository: rapidsai/cuvs

Length of output: 1171

Preserve the configured RNG type for row sampling.

This constructs RngState from the seed only, unlike the KMeans++ and KMeans++ subsampling paths at lines 144 and 316 that preserve params.rng_state.type. Non-default RNG configurations will be ignored for the Random initialization method.

Proposed fix

- raft::random::RngState random_state(iter_params.rng_state.seed); + raft::random::RngState random_state(iter_params.rng_state.seed, iter_params.rng_state.type);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

raft::random::RngState random_state(iter_params.rng_state.seed);

if (iter_params.init == cuvs::cluster::kmeans::params::InitMethod::Random) {

raft::matrix::sample_rows(handle, random_state, X, centroidsRawData);

raft::random::RngState random_state(iter_params.rng_state.seed, iter_params.rng_state.type);

if (iter_params.init == cuvs::cluster::kmeans::params::InitMethod::Random) {

raft::matrix::sample_rows(handle, random_state, X, centroidsRawData);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/src/cluster/detail/kmeans.cuh` around lines 651 - 654, The Random init path constructs a new RngState from only the seed, losing any non-default RNG type; instead, construct the RNG using iter_params.rng_state (preserving its type) like the KMeans++ paths do, and then pass that RNG into raft::matrix::sample_rows when sampling rows into centroidsRawData from X (i.e., replace raft::random::RngState random_state(iter_params.rng_state.seed) with constructing/initializing RngState from iter_params.rng_state so sample_rows uses the configured rng type).

coderabbitai · 2026-04-22T22:12:54Z

+  bool need_compute_norms = metric == cuvs::distance::DistanceType::L2Expanded ||
+                            metric == cuvs::distance::DistanceType::L2SqrtExpanded ||
+                            metric == cuvs::distance::DistanceType::CosineExpanded;
+  bool use_norm_cache = need_compute_norms && !data_on_device;
+  std::vector<DataT> h_norm_cache;
+  if (use_norm_cache) { h_norm_cache.resize(n_samples); }
+  bool norms_cached = false;
+
+  auto compute_batch_norms = [&](const DataT* batch_ptr, IndexT batch_size) {
+    auto batch_view =
+      raft::make_device_matrix_view<const DataT, IndexT>(batch_ptr, batch_size, n_features);
+    auto norm_view =
+      raft::make_device_vector_view<DataT, IndexT>(L2NormBatch.data_handle(), batch_size);
+    if (metric == cuvs::distance::DistanceType::CosineExpanded) {
+      raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>(
+        handle, batch_view, norm_view, raft::sqrt_op{});
+    } else {
+      raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>(
+        handle, batch_view, norm_view);


⚠️ Potential issue | 🟠 Major

Make cosine norm handling consistent or reject cosine here.

compute_batch_norms uses sqrt_op for CosineExpanded, but centroid norms are computed without sqrt_op. If cosine is supported in this path, centroid norms should use the same convention; if not, reject CosineExpanded before entering the loop.

🐛 Proposed fix if cosine is intended to be supported

if (need_compute_norms) { - raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>( - handle, centroids_const, centroid_norms_buf.view()); + if (metric == cuvs::distance::DistanceType::CosineExpanded) { + raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>( + handle, centroids_const, centroid_norms_buf.view(), raft::sqrt_op{}); + } else { + raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>( + handle, centroids_const, centroid_norms_buf.view()); + } centroid_norms_opt = raft::make_device_vector_view<const DataT, IndexT>( centroid_norms_buf.data_handle(), n_clusters); }

Also applies to: 802-809

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/src/cluster/detail/kmeans.cuh` around lines 741 - 759, The cosine norm convention is inconsistent: compute_batch_norms applies raft::sqrt_op for DistanceType::CosineExpanded while centroid norms are computed without sqrt; either make both use the same convention or disallow CosineExpanded here. Fix by updating the centroid norm computation (the code that computes norms for centroids/centroid_norms) to also use raft::sqrt_op when metric == cuvs::distance::DistanceType::CosineExpanded, or alternatively add an early check before the batching loop to throw/return if metric == CosineExpanded; ensure you reference compute_batch_norms, L2NormBatch, and the centroid norm computation path so both sides use identical sqrt behavior or the metric is rejected consistently.

tarang-jain · 2026-04-24T22:16:41Z

+
+    for (n_current_iter = 1; n_current_iter <= iter_params.max_iter; ++n_current_iter) {
+      if (n_current_iter > 1) {
+        RAFT_CUDA_TRY(cudaEventSynchronize(convergence_event));


The convergence event is recorded right after the inertia is computed, but we check it here before proceeding to the next iteration so that the operations after the convergence check are not blocked. The use of an event also means that only the operations up until the event need to be completed in order for it to be synchronized.

combine impls

66d7fd3

github-project-automation Bot added this to Unstructured Data Processing Apr 10, 2026

tarang-jain self-assigned this Apr 10, 2026

tarang-jain added improvement Improves an existing functionality non-breaking Introduces a non-breaking change cpp labels Apr 10, 2026

tarang-jain and others added 11 commits April 13, 2026 11:50

rm inertia_check

0a09e6f

change to warning

99a5730

style

a077406

add init_size param

d659875

Merge branch 'main' into combine-batch

ec2e8b7

docs

03a6473

Merge branch 'combine-batch' of https://github.com/tarang-jain/cuvs i…

42a8d9d

…nto combine-batch

rm direct cuda api calls

86af2fa

std::swap instead of raft::copy

d4e4e2c

cache batch norms

0819af5

centroid norms can also be cached per iteration

e0f079c

tarang-jain marked this pull request as ready for review April 14, 2026 01:10

tarang-jain requested review from a team as code owners April 14, 2026 01:10

tarang-jain and others added 4 commits April 13, 2026 18:11

mg n_iter

c2f7390

pre-commit

b9c3102

do not break c abi

e3956c1

Merge branch 'main' into combine-batch

986d78a

tarang-jain commented Apr 14, 2026

View reviewed changes

Comment thread c/include/cuvs/cluster/kmeans.h

tarang-jain mentioned this pull request Apr 15, 2026

Multi-GPU Batched KMeans #2017

Open

Merge branch 'main' into combine-batch

a8e1d26

viclafargue reviewed Apr 16, 2026

View reviewed changes

tarang-jain and others added 8 commits April 22, 2026 08:01

rel_tol

e76eaac

Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>

pass workspace

afbefdf

Merge branch 'combine-batch' of https://github.com/tarang-jain/cuvs i…

e62a63c

…nto combine-batch

style

e4f08bf

Merge branch 'main' of https://github.com/rapidsai/cuvs into combine-…

6e4a8f0

…batch

do not use batch scratch space; rm update_centroids

4a8a85c

move the debug log

bbf2a9f

add new suffixed param struct

410092c

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

tarang-jain added breaking Introduces a breaking change and removed non-breaking Introduces a non-breaking change labels Apr 22, 2026

dantegd requested changes Apr 22, 2026

View reviewed changes

tarang-jain added 2 commits April 22, 2026 14:36

address pr reviews

c515c1e

fix docstring

e8e63ab

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

fix wt_sum warning

30c457c

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

tarang-jain added 2 commits April 22, 2026 15:00

rm deprecationwarning and instead add FutureWarning:=

ab96623

unweighted to never materialize batch weights

269f23c

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

tarang-jain added 6 commits April 22, 2026 19:11

add cpp tests

80a22ca

update cpp tests

ac06b05

revert batch norms cache

0569340

increase zero cost threshold

8cac63a

apply cuda event plus re-add h_norm_cache

f6df4ae

rm cosine expanded stuff

9fc74b1

tarang-jain requested a review from dantegd April 24, 2026 21:51

tarang-jain commented Apr 24, 2026

View reviewed changes

		auto batch_workspace = rmm::device_uvector<char>(
		current_batch_sz, stream, raft::resource::get_workspace_resource(handle));

Conversation

tarang-jain commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Combine batched and regular k-means implementations

Uh oh!

copy-pr-bot Bot commented Apr 10, 2026

Uh oh!

Uh oh!

viclafargue left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarang-jain Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarang-jain Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tarang-jain commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

dantegd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tarang-jain commented Apr 10, 2026 •

edited

Loading

tarang-jain Apr 21, 2026 •

edited

Loading

tarang-jain Apr 20, 2026 •

edited

Loading

tarang-jain commented Apr 22, 2026 •

edited

Loading

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading