Skip to content

[Cleanup] Combine Batched and Regular KMeans Impl#2015

Open
tarang-jain wants to merge 40 commits intorapidsai:mainfrom
tarang-jain:combine-batch
Open

[Cleanup] Combine Batched and Regular KMeans Impl#2015
tarang-jain wants to merge 40 commits intorapidsai:mainfrom
tarang-jain:combine-batch

Conversation

@tarang-jain
Copy link
Copy Markdown
Contributor

@tarang-jain tarang-jain commented Apr 10, 2026

Combine batched and regular k-means implementations

  • Unified the batched (host-data) and regular (device-data) k-means fit into a single kmeans_fit template that works with both host and device mdspans via batch_load_iterator
  • Unified the device and host initialization paths in init_centroids
  • Removed the inertia_check parameter — inertia-based convergence checking now always runs. Zero clustering cost (perfect fit) logs a warning instead of asserting. This is needed because spectral clustering can cause all points to converge on the cluster centroids itself.
  • Added init_size parameter to control how many samples are drawn for KMeansPlusPlus initialization. Defaults to n_samples for device data, (3 * n_clusters) for host data
  • Replaced per-iteration centroid raft::copy with std::swap of buffer pointers
  • For streaming fit, precompute data norms once and cache them: host norms cached to a host buffer on the first iteration and copied back for subsequent iterations. process_batch no longer computes norms internally
  • Replaced raw cudaPointerGetAttributes call with raft::memory_type_from_pointer
  • Replaced cub::DeviceReduce::Sum calls with raft::linalg::mapThenSumReduce
  • Guarded weight normalization against overflow: apply (w / wt_sum) * n_samples via a composed op instead of precomputing a scale, so very small wt_sum values don't produce inf
  • Renamed checkWeight to weightSum and made it mdspan-based with an Accessor template: device reduce for device weights, host loop for host weights. Callers apply the scaling themselves
  • Eliminated batch_sums / batch_counts scratch buffers by accumulating directly into centroid_sums / weight_per_cluster via reset_sums=false in reduce_rows_by_key / reduce_cols_by_key, removing two per-batch raft::linalg::add kernels
  • Removed dead update_centroids helpers (both the detail and public template) — no remaining callers after the fit_main consolidation
  • Perf: remove multiple raft::sync_stream calls and add a CUDA Event to record if the convergence criteria is met. Convergence check is now done on device. Average per-iteration time with mandatory inertia check now matches previous benchmarks even when previously inertia check was disabled.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 10, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@tarang-jain tarang-jain self-assigned this Apr 10, 2026
@tarang-jain tarang-jain added improvement Improves an existing functionality non-breaking Introduces a non-breaking change cpp labels Apr 10, 2026
@tarang-jain tarang-jain marked this pull request as ready for review April 14, 2026 01:10
@tarang-jain tarang-jain requested review from a team as code owners April 14, 2026 01:10
Comment thread c/include/cuvs/cluster/kmeans.h
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Here are some comments.


auto minClusterAndDistance = raft::make_device_vector<raft::KeyValuePair<IndexT, DataT>, IndexT>(
handle, streaming_batch_size);
auto L2NormBatch = raft::make_device_vector<DataT, IndexT>(handle, streaming_batch_size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pams.streaming_batch_size = 0 by default in the data on device case, but nothing prevent a user from setting a value. This would allocate a smaller than n_samples L2NormBatch which would cause OOB writes (and later reads) during norm computation.

We should probably guard this with a check :
RAFT_EXPECTS(streaming_batch_size == n_samples || !data_on_device, ...)

Copy link
Copy Markdown
Contributor Author

@tarang-jain tarang-jain Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this so that for device arrays, we simply ignore the streaming_batch_size and use the entire dataset always.

Comment thread cpp/src/cluster/detail/kmeans.cuh Outdated
Comment on lines +661 to +663
auto init_sample =
raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features);
raft::matrix::sample_rows(handle, random_state, X, init_sample.view());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto init_sample =
raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features);
raft::matrix::sample_rows(handle, random_state, X, init_sample.view());
if (init_sample_size == n_samples && data_on_device) {
auto init_sample_const = raft::make_device_matrix_view<const DataT, IndexT>(X.data_handle(), n_samples, n_features);
// pass directly to kmeansPlusPlus / initScalableKMeansPlusPlus
} else {
auto init_sample = raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features);
raft::matrix::sample_rows(handle, random_state, X, init_sample.view());
// pass init_sample to kmeansPlusPlus / initScalableKMeansPlusPlus
}

If init_size = 0 in the data on device path, we basically double memory use by copying the dataset over. Let's skip this by creating a view on the dataset.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely skipped the sampling for the device path. That is how it was being done earlier. The init size is only used if the data is on host.

Comment on lines +731 to +732
auto batch_workspace = rmm::device_uvector<char>(
current_batch_sz, stream, raft::resource::get_workspace_resource(handle));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every call to process_batch allocates both this workspace and the device scalar below. Both buffers could be instantiated out of the process_batch function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the workspace buffer allocation outside the process_batch function.

Comment thread cpp/src/cluster/detail/kmeans.cuh Outdated
raft::matrix::sample_rows(handle, random_state, X, centroidsRawData);
} else if (iter_params.init == cuvs::cluster::kmeans::params::InitMethod::KMeansPlusPlus) {
IndexT default_init_size =
data_on_device ? n_samples : std::min(static_cast<IndexT>(3 * n_clusters), n_samples);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlikely to be an actual issue, but n_clusters could be casted before the multiplication to avoid any risk of integer overflow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gotten rid of the batching for the device path. So when a user sets a batch size for device mdspan, we just set it to n_samples and warn the user. We should definitely not be creating a new buffer just for the init sample if we can accommodate the entire input matrix on device already.

Comment thread cpp/src/cluster/detail/kmeans.cuh Outdated
Comment on lines +876 to +881
DataT curClusteringCost = DataT{0};
raft::copy(&curClusteringCost, clustering_cost.data_handle(), 1, stream);
raft::resource::sync_stream(handle, stream);

if (curClusteringCost == DataT{0}) {
RAFT_LOG_WARN("Zero clustering cost detected: all points coincide with their centroids.");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going from ASSERT to RAFT_LOG_WARN may indeed be useful for the spectral clustering case. However, removing the inertia_check option forces the sync at every iteration. Do we truly need to drop this option?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need the log and an assert might be better?

Early stopping (aka skipping iterations) is ultimately going to be the best way to extract perf here. Whether it's by explicitly computing inertia or just looking at the residuals of the centroids from the prior iteration.

Seems like inertia check / residuals could be done on gpu if we had to in order to avoid syncing so we would only need to sync in the final iteration, right?

Copy link
Copy Markdown
Contributor Author

@tarang-jain tarang-jain Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like inertia check / residuals could be done on gpu if we had to in order to avoid syncing so we would only need to sync in the final iteration, right?

Until the iteration has completed, the CPU should not start the next iteration. So all the operations on the GPU stream must complete to finish the iteration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like inertia check / residuals could be done on gpu if we had to in order to avoid syncing so we would only need to sync in the final iteration, right?

Yes, but this is throwing an error in the spectral clustering case wherein all the points converge on the centroids themselves. This is happening in one of the spectral tests and an assertion here is leading to an error, where instead it should simply return those centroids directly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need the log and an assert might be better?

Therefore, I had to change it to a warning instead of an assertion. Earlier those spectral tests were skipping the inertia check which was avoiding the assertion.

Comment on lines +638 to +646
} else {
std::vector<DataT> h_weights(n_samples);
auto d_view = raft::make_device_vector_view<const DataT, IndexT>(weight_ptr, n_samples);
auto h_view = raft::make_host_vector_view<DataT, IndexT>(h_weights.data(), n_samples);
raft::copy(handle, h_view, d_view);
raft::resource::sync_stream(handle);
for (IndexT i = 0; i < n_samples; ++i) {
wt_sum += h_weights[i];
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the data on device case since the data is already on device it would be much faster to sumreduce thanks to cub::DeviceReduce::Sum or raft::linalg::reduce. The summation would also have better precision since it is done in a tree fashion O(log N).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When its device accessible, I have changed that to a raft::linalg::mapThenSumReduce. I have also removed this function and directly updated checkWeight (changed its name to weightSum). We do the scaling after the weight sum is computed.

@@ -33,9 +33,10 @@ cdef extern from "cuvs/cluster/kmeans.h" nogil:
int batch_samples,
int batch_centroids,
bool inertia_check,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment saying the field is present but deprecated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think its necessary to add a comment here in the .pxd. The C header already has that information. And this file will be updated along with the C headers / src files.

Comment thread cpp/src/cluster/detail/kmeans.cuh Outdated
@tarang-jain
Copy link
Copy Markdown
Contributor Author

tarang-jain commented Apr 22, 2026

@viclafargue ABI wont break. We are allowed to add new members to structs and enums. However I have added the breaking label to this PR because downstream users of the C++ API will be affected directly.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Deprecates C/API inertia_check (kept for ABI), adds init_size to params and C/Python bindings, removes legacy host-batched implementation, and refactors KMeans into a unified mdspan-based streaming-batched pipeline with host overloads, weight-normalization changes, and related utility/signature updates.

Changes

Cohort / File(s) Summary
C API header
c/include/cuvs/cluster/kmeans.h
Marked inertia_check deprecated/ignored (ABI kept), fixed spacing, added int64_t init_size, and introduced struct cuvsKMeansParams_v1 (omits legacy inertia_check).
C API impl
c/src/cluster/kmeans.cpp
Conversion maps cuvsKMeansParams::init_size into internal params and forces internal inertia_check = false during C→C++ param translation.
C++ public params
cpp/include/cuvs/cluster/kmeans.hpp
Removed params::inertia_check; added params::init_size (int64_t, default 0) with host/device semantics and documented behavior.
Unified mdspan fit & host overloads
cpp/src/cluster/detail/kmeans.cuh, cpp/src/cluster/kmeans_impl.cuh
Introduced unified detail::kmeans_fit using mdspan, streaming-batched loop (ping‑pong centroids), host overloads, and new public host-matrix fit forwarding to mdspan implementation.
Removed batched impl
cpp/src/cluster/detail/kmeans_batched.cuh
Deleted legacy host-batched KMeans implementation (host init, weight scaling, per-batch accumulation, and previous fit driver).
Common utilities & batch processing
cpp/src/cluster/detail/kmeans_common.cuh, cpp/src/cluster/detail/minClusterDistanceCompute.cu
Replaced checkWeight with weightSum; added process_batch; extended min-cluster funcs to accept optional precomputed_centroid_norms; adjusted sync usage and related signatures.
Multi‑GPU changes
cpp/src/cluster/detail/kmeans_mg.cuh
Replaced two-pass CUB reduction with mapThenSumReduce; changed weight-normalization to composed ops; always compute and all-reduce clustering cost each iteration (warn on zero); removed inertia_check gating.
Removed helper
cpp/src/cluster/kmeans.cuh
Deleted update_centroids template helper (workspace/alloc wrapper removed).
Frontend instantiations
cpp/src/cluster/kmeans_fit_float.cu, cpp/src/cluster/kmeans_fit_double.cu
Removed include of batched header; host-matrix overloads now call public cuvs::cluster::kmeans::fit instead of internal batched fit.
Tests
cpp/tests/cluster/kmeans.cu
Batched test removed explicit params.inertia_check = true; uses default behavior for inertia checking.
Python bindings
python/cuvs/cuvs/cluster/kmeans/kmeans.pxd, python/cuvs/cuvs/cluster/kmeans/kmeans.pyx
Reordered C struct fields and added init_size; inertia_check treated as deprecated/ignored (emits warning); added init_size ctor arg and property; removed inertia_check property.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main objective: combining batched and regular KMeans implementations into a unified implementation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description accurately describes the key changes: unifying batched and regular k-means implementations, removing inertia_check parameter, adding init_size parameter, and various optimization and refactoring changes documented in the raw summary.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
cpp/src/cluster/detail/kmeans.cuh (1)

591-599: ⚠️ Potential issue | 🟡 Minor

Exact float comparison for wt_sum == n_samples — inconsistent with kmeans_predict.

kmeans_predict (lines 998-999) uses a relative-tolerance check:

const DataT rel_tol = n_samples * std::numeric_limits<DataT>::epsilon();
if (std::abs(wt_sum - n_samples) > rel_tol) { ... }

But kmeans_fit still compares with != on both line 595 (to emit the debug message) and line 696 (inside prepare_batch_weights, which governs whether the per-batch scaling op runs). With a large n_samples, summation imprecision means the != branch will almost always trigger even when user weights genuinely sum to n_samples, causing an unnecessary map/divide on every batch and every outer iteration.

🛠️ Proposed fix — mirror the predict path
-  DataT wt_sum = sample_weight.has_value() ? weightSum(handle, sample_weight.value())
-                                           : static_cast<DataT>(n_samples);
-  if (sample_weight.has_value() && wt_sum != static_cast<DataT>(n_samples)) {
+  DataT wt_sum = sample_weight.has_value() ? weightSum(handle, sample_weight.value())
+                                           : static_cast<DataT>(n_samples);
+  const DataT wt_rel_tol = static_cast<DataT>(n_samples) * std::numeric_limits<DataT>::epsilon();
+  const bool needs_wt_norm =
+    sample_weight.has_value() && std::abs(wt_sum - static_cast<DataT>(n_samples)) > wt_rel_tol;
+  if (needs_wt_norm) {
     RAFT_LOG_DEBUG(
       "[Warning!] KMeans: normalizing the user provided sample weight to sum up to %zu samples",
       static_cast<size_t>(n_samples));
   }

Then replace wt_sum != static_cast<DataT>(n_samples) on line 696 with the captured needs_wt_norm flag.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/cluster/detail/kmeans.cuh` around lines 591 - 599, In kmeans_fit,
avoid exact float comparison of wt_sum to n_samples: compute rel_tol = n_samples
* std::numeric_limits<DataT>::epsilon(); set a bool needs_wt_norm =
(std::abs(wt_sum - static_cast<DataT>(n_samples)) > rel_tol) (mirroring
kmeans_predict) and use that flag for the debug RAFT_LOG_DEBUG and to decide
whether to run the per-batch scaling in prepare_batch_weights; replace the
existing != checks on wt_sum with checks against needs_wt_norm and pass/use that
flag instead of recomputing or comparing wt_sum directly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@c/include/cuvs/cluster/kmeans.h`:
- Around line 123-199: Summary: cuvsKMeansParams_v1 is declared but not exposed
via the C API pattern, making it unreachable for C consumers; add the missing
typedef and lifecycle APIs or remove the struct until helpers land. Fix: add a
typedef cuvsKMeansParams_v1_t for cuvsKMeansParams_v1 and declare public C API
functions cuvsKMeansParamsCreate_v1(...) and
cuvsKMeansParamsDestroy_v1(cuvsKMeansParams_v1_t*) following the
signatures/semantics of existing cuvsKMeansParamsCreate/cuvsKMeansParamsDestroy,
include Doxygen comments for each new symbol (typedef and functions), and mark
any older/changed APIs with deprecation notes and update the migration guide per
repo guidelines; alternatively, if you choose to defer, remove
cuvsKMeansParams_v1 now to avoid dead surface area.
- Line 126: Remove the stray leading whitespace before the struct declaration so
the line starts with "struct cuvsKMeansParams_v1 {" (no leading spaces); update
the declaration in the header (symbol: cuvsKMeansParams_v1) to match the file's
code style/indentation conventions so the struct aligns with other top-level
declarations.

In `@cpp/include/cuvs/cluster/kmeans.hpp`:
- Around line 115-123: Restore a deprecated stub for the removed field in the
cuvs::cluster::kmeans::params struct by reintroducing a [[deprecated("ignored;
use init_size instead")]] bool inertia_check = false; member so existing C++
call sites compile with a warning (reference cuvs::cluster::kmeans::params and
inertia_check), and update the project docs/CHANGELOG with a short migration
note stating inertia_check is deprecated/ignored and that inertia-based
convergence is now always on (point users to use init_size instead); keep the C
API behavior unchanged.

In `@python/cuvs/cuvs/cluster/kmeans/kmeans.pyx`:
- Around line 79-82: Update the KMeans docstring for init_size to state that it
applies to both device and host data and document the differing
defaults/behaviors (on device default = n_samples when init_size==0; on host
default = min(3 * n_clusters, n_samples)) referencing the same semantics as
cpp/include/cuvs/cluster/kmeans.hpp; also restore acceptance of the
inertia_check keyword in KMeansParams.__init__ (class KMeansParams) so existing
callers do not raise TypeError, emit a DeprecationWarning and ignore the
provided value for one release (no-op) to match the C-struct ABI-preservation
strategy.

---

Duplicate comments:
In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 591-599: In kmeans_fit, avoid exact float comparison of wt_sum to
n_samples: compute rel_tol = n_samples * std::numeric_limits<DataT>::epsilon();
set a bool needs_wt_norm = (std::abs(wt_sum - static_cast<DataT>(n_samples)) >
rel_tol) (mirroring kmeans_predict) and use that flag for the debug
RAFT_LOG_DEBUG and to decide whether to run the per-batch scaling in
prepare_batch_weights; replace the existing != checks on wt_sum with checks
against needs_wt_norm and pass/use that flag instead of recomputing or comparing
wt_sum directly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: bc3ad388-89a6-4de0-87f7-fd8710a65b58

📥 Commits

Reviewing files that changed from the base of the PR and between f2bffb6 and 410092c.

📒 Files selected for processing (15)
  • c/include/cuvs/cluster/kmeans.h
  • c/src/cluster/kmeans.cpp
  • cpp/include/cuvs/cluster/kmeans.hpp
  • cpp/src/cluster/detail/kmeans.cuh
  • cpp/src/cluster/detail/kmeans_batched.cuh
  • cpp/src/cluster/detail/kmeans_common.cuh
  • cpp/src/cluster/detail/kmeans_mg.cuh
  • cpp/src/cluster/detail/minClusterDistanceCompute.cu
  • cpp/src/cluster/kmeans.cuh
  • cpp/src/cluster/kmeans_fit_double.cu
  • cpp/src/cluster/kmeans_fit_float.cu
  • cpp/src/cluster/kmeans_impl.cuh
  • cpp/tests/cluster/kmeans.cu
  • python/cuvs/cuvs/cluster/kmeans/kmeans.pxd
  • python/cuvs/cuvs/cluster/kmeans/kmeans.pyx
💤 Files with no reviewable changes (2)
  • cpp/src/cluster/kmeans.cuh
  • cpp/src/cluster/detail/kmeans_batched.cuh

Comment on lines +123 to 199
/**
* @brief Hyper-parameters for the kmeans algorithm
*/
struct cuvsKMeansParams_v1 {
cuvsDistanceType metric;

/**
* The number of clusters to form as well as the number of centroids to generate (default:8).
*/
int n_clusters;

/**
* Method for initialization, defaults to k-means++:
* - cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm
* to select the initial cluster centers.
* - cuvsKMeansInitMethod::Random (random): Choose 'n_clusters' observations (rows) at
* random from the input data for the initial centroids.
* - cuvsKMeansInitMethod::Array (ndarray): Use 'centroids' as initial cluster centers.
*/
cuvsKMeansInitMethod init;

/**
* Maximum number of iterations of the k-means algorithm for a single run.
*/
int max_iter;

/**
* Relative tolerance with regards to inertia to declare convergence.
*/
double tol;

/**
* Number of instance k-means algorithm will be run with different seeds.
*/
int n_init;

/**
* Oversampling factor for use in the k-means|| algorithm
*/
double oversampling_factor;

/**
* batch_samples and batch_centroids are used to tile 1NN computation which is
* useful to optimize/control the memory footprint
* Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0
* then don't tile the centroids
*/
int batch_samples;

/**
* if 0 then batch_centroids = n_clusters
*/
int batch_centroids;

/**
* Whether to use hierarchical (balanced) kmeans or not
*/
bool hierarchical;

/**
* For hierarchical k-means , defines the number of training iterations
*/
int hierarchical_n_iters;

/**
* Number of samples to process per GPU batch for the batched (host-data) API.
* When set to 0, defaults to n_samples (process all at once).
*/
int64_t streaming_batch_size;

/**
* Number of samples to draw for KMeansPlusPlus initialization.
* When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data,
* or n_samples for device data.
*/
int64_t init_size;
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

cuvsKMeansParams_v1 is declared but not reachable through the standard C API pattern.

The existing API exposes cuvsKMeansParams via cuvsKMeansParams_t typedef and cuvsKMeansParamsCreate/cuvsKMeansParamsDestroy. The new cuvsKMeansParams_v1 struct has:

  • No corresponding cuvsKMeansParams_v1_t typedef.
  • No cuvsKMeansParamsCreate_v1 / cuvsKMeansParamsDestroy_v1 (or equivalent) declarations.

As a result, C consumers cannot allocate/initialize/free a cuvsKMeansParams_v1 using the established pattern, and the struct effectively becomes dead surface area until 26.08. Please either add the companion typedef and lifecycle APIs now, or defer adding the v1 struct until the matching helpers land together.

As per coding guidelines, c/include/cuvs/**/*: public C API headers must include Doxygen documentation for all public functions/classes, and API changes require deprecation warnings and migration guide updates.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/include/cuvs/cluster/kmeans.h` around lines 123 - 199, Summary:
cuvsKMeansParams_v1 is declared but not exposed via the C API pattern, making it
unreachable for C consumers; add the missing typedef and lifecycle APIs or
remove the struct until helpers land. Fix: add a typedef cuvsKMeansParams_v1_t
for cuvsKMeansParams_v1 and declare public C API functions
cuvsKMeansParamsCreate_v1(...) and
cuvsKMeansParamsDestroy_v1(cuvsKMeansParams_v1_t*) following the
signatures/semantics of existing cuvsKMeansParamsCreate/cuvsKMeansParamsDestroy,
include Doxygen comments for each new symbol (typedef and functions), and mark
any older/changed APIs with deprecation notes and update the migration guide per
repo guidelines; alternatively, if you choose to defer, remove
cuvsKMeansParams_v1 now to avoid dead surface area.

/**
* @brief Hyper-parameters for the kmeans algorithm
*/
struct cuvsKMeansParams_v1 {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stray leading whitespace before struct cuvsKMeansParams_v1.

-/**
- * `@brief` Hyper-parameters for the kmeans algorithm
- */
- struct cuvsKMeansParams_v1 {
+/**
+ * `@brief` Hyper-parameters for the kmeans algorithm
+ */
+struct cuvsKMeansParams_v1 {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
struct cuvsKMeansParams_v1 {
/**
* `@brief` Hyper-parameters for the kmeans algorithm
*/
struct cuvsKMeansParams_v1 {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@c/include/cuvs/cluster/kmeans.h` at line 126, Remove the stray leading
whitespace before the struct declaration so the line starts with "struct
cuvsKMeansParams_v1 {" (no leading spaces); update the declaration in the header
(symbol: cuvsKMeansParams_v1) to match the file's code style/indentation
conventions so the struct aligns with other top-level declarations.

Comment on lines 115 to +123
/**
* If true, check inertia during iterations for early convergence.
* Number of samples to randomly draw for the KMeansPlusPlus initialization
* step. A random subset of this size is used for centroid seeding.
* When set to 0 the default depends on the data location:
* - Device data: n_samples (use the full dataset).
* - Host data: min(3 * n_clusters, n_samples).
* Default: 0.
*/
bool inertia_check = false;
int64_t init_size = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Breaking removal of inertia_check from the public C++ params struct lacks deprecation/migration notes.

The cuvs::cluster::kmeans::params::inertia_check field has been removed outright (replaced by init_size). Any downstream C++ code that sets or reads params.inertia_check will now fail to compile with a hard error, despite the PR being labeled non-breaking. The C API preserves the field for ABI compatibility (marked deprecated/ignored), but the C++ API does not offer the same grace period.

Please either:

  1. Retain a deprecated [[deprecated("ignored; use init_size instead")]] bool inertia_check = false; stub in the struct for one release cycle so existing call sites keep compiling with a warning, and
  2. Add a migration note in the project docs/CHANGELOG explaining that inertia_check is gone and that inertia-based convergence is now always on.

As per coding guidelines: "API changes flagged for docs/ updates" and "Breaking changes require deprecation warnings and migration guide updates".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/cuvs/cluster/kmeans.hpp` around lines 115 - 123, Restore a
deprecated stub for the removed field in the cuvs::cluster::kmeans::params
struct by reintroducing a [[deprecated("ignored; use init_size instead")]] bool
inertia_check = false; member so existing C++ call sites compile with a warning
(reference cuvs::cluster::kmeans::params and inertia_check), and update the
project docs/CHANGELOG with a short migration note stating inertia_check is
deprecated/ignored and that inertia-based convergence is now always on (point
users to use init_size instead); keep the C API behavior unchanged.

Comment on lines +79 to +82
init_size : int
Number of samples to draw for KMeansPlusPlus initialization with
host (out-of-core) data. When set to 0, uses the heuristic
min(3 * n_clusters, n_samples). Default: 0.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring understates init_size scope and Python API drops inertia_check without deprecation.

Two concerns:

  1. The docstring says init_size applies only to "host (out-of-core) data", but cpp/include/cuvs/cluster/kmeans.hpp documents it for both device and host data (defaulting to n_samples on device vs. min(3 * n_clusters, n_samples) on host when set to 0). Users on device data will not realize they can tune this parameter.

  2. The previously accepted inertia_check= keyword on KMeansParams.__init__ was removed outright. Existing Python code passing KMeansParams(inertia_check=...) will now raise TypeError: unexpected keyword argument. Consider accepting the kwarg for one release cycle (warn via DeprecationWarning and ignore) to match the ABI-preservation strategy used for the C struct.

📝 Suggested docstring adjustment
     init_size : int
-        Number of samples to draw for KMeansPlusPlus initialization with
-        host (out-of-core) data. When set to 0, uses the heuristic
-        min(3 * n_clusters, n_samples). Default: 0.
+        Number of samples to draw for the KMeansPlusPlus initialization
+        step. When set to 0, the default depends on data location:
+        ``n_samples`` for device data, ``min(3 * n_clusters, n_samples)``
+        for host data. Default: 0.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
init_size : int
Number of samples to draw for KMeansPlusPlus initialization with
host (out-of-core) data. When set to 0, uses the heuristic
min(3 * n_clusters, n_samples). Default: 0.
init_size : int
Number of samples to draw for the KMeansPlusPlus initialization
step. When set to 0, the default depends on data location:
``n_samples`` for device data, ``min(3 * n_clusters, n_samples)``
for host data. Default: 0.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuvs/cuvs/cluster/kmeans/kmeans.pyx` around lines 79 - 82, Update the
KMeans docstring for init_size to state that it applies to both device and host
data and document the differing defaults/behaviors (on device default =
n_samples when init_size==0; on host default = min(3 * n_clusters, n_samples))
referencing the same semantics as cpp/include/cuvs/cluster/kmeans.hpp; also
restore acceptance of the inertia_check keyword in KMeansParams.__init__ (class
KMeansParams) so existing callers do not raise TypeError, emit a
DeprecationWarning and ignore the provided value for one release (no-op) to
match the C-struct ABI-preservation strategy.

@tarang-jain tarang-jain added breaking Introduces a breaking change and removed non-breaking Introduces a non-breaking change labels Apr 22, 2026
Copy link
Copy Markdown
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR improves things nicely, the code get significant improvement, love to see this!

Comment thread cpp/src/cluster/detail/kmeans.cuh Outdated
Comment thread cpp/src/cluster/detail/kmeans.cuh Outdated
Comment on lines +857 to +866
DataT curClusteringCost = DataT{0};
raft::copy(&curClusteringCost, clustering_cost.data_handle(), 1, stream);
raft::resource::sync_stream(handle, stream);

if (curClusteringCost == DataT{0}) {
RAFT_LOG_WARN("Zero clustering cost detected: all points coincide with their centroids.");
} else if (n_current_iter > 1) {
DataT delta = curClusteringCost / priorClusteringCost;
if (delta > 1 - iter_params.tol) done = true;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that this block was gated by params.inertia_check which defaulted oto false, now it's run unconditionally which will force the device to host copy on every iteration, is this change intentional? I think we could avoid this copy, or make it conditional, or at least explicitly mention this change if it's intended.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intended. According to my discussion with @cjnolet we should always be early stopping. Yes this does mean that we now will always have to wait for one iteration to complete before the next one because the CPU should be made aware of whether or not it should proceed with the next iteration. Keeping this thread open to discussion if others have some inputs.

Comment on lines +718 to +736
comm.allreduce(&(clusterCostD.data()->value),
&(clusterCostD.data()->value),
1,
raft::comms::op_t::SUM,
stream);

DataT curClusteringCost = 0;
raft::copy(handle,
raft::make_host_scalar_view(&curClusteringCost),
raft::make_device_scalar_view(&(clusterCostD.data()->value)));

ASSERT(comm.sync_stream(stream) == raft::comms::status_t::SUCCESS,
"An error occurred in the distributed operation. This can result "
"from a failed rank");
if (curClusteringCost == (DataT)0.0) {
RAFT_LOG_WARN("Zero clustering cost detected: all points coincide with their centroids.");
} else if (n_iter[0] > 1) {
DataT delta = curClusteringCost / priorClusteringCost;
if (delta > 1 - params.tol) done = true;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the same change of behavior here with this now running unconditionally, so we need to do the same thing we do in the single gpu change I mentioned before.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viclafargue will be taking care of the iteration loop in kmeans_mg in #2017

Comment thread cpp/src/cluster/detail/kmeans.cuh Outdated
run_kmeanspp(X);
} else {
IndexT default_init_size =
std::min(static_cast<IndexT>(std::int64_t{3} * n_clusters), n_samples);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning for the change makes sense to me, but we should add a log/warning to alert users of the behavior change that otherwise will go silent for users that don't set init_size, and add it do the param doc in kmeans.hpp

Comment thread cpp/src/cluster/detail/kmeans.cuh
Comment on lines +879 to +906
auto centroids_const = raft::make_device_matrix_view<const DataT, IndexT>(
cur_centroids_ptr, n_clusters, n_features);

iter_inertia = DataT{0};
data_batches.reset();
weight_batches.reset();
auto wt_it = weight_batches.begin();
for (const auto& data_batch : data_batches) {
IndexT cur_batch_size = static_cast<IndexT>(data_batch.size());
const auto& wt_batch = *wt_it;
++wt_it;

auto batch_data_view = raft::make_device_matrix_view<const DataT, IndexT>(
data_batch.data(), cur_batch_size, n_features);

std::optional<raft::device_vector_view<const DataT, IndexT>> batch_sw = std::nullopt;
if (weight_ptr != nullptr) { batch_sw = prepare_batch_weights(wt_batch, cur_batch_size); }

DataT batch_cost = DataT{0};
cuvs::cluster::kmeans::cluster_cost(handle,
batch_data_view,
centroids_const,
raft::make_host_scalar_view(&batch_cost),
batch_sw);

iter_inertia += batch_cost;
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that clustering_cost is accumulated on device every iteration inside the Lloyd loop, the final accepted iteration's clustering_cost is exactly the inertia to report, no?

This extra post-loop batched pass (one full read of X per n_init) duplicates work that was just done if I'm not mistaken. If so, consider copying the last clustering_cost scalar to iter_inertia

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final inertia was being computed even before this PR with a cuvs::cluster::kmeans::cluster_cost call. The thing is that the centroids are modified in the end after the inertia computation. So one extra pass is needed with the final centroids.

Comment thread cpp/src/cluster/detail/kmeans.cuh Outdated
Comment thread cpp/src/cluster/detail/kmeans_common.cuh
params.inertia_check = true;
params.max_iter = 20;
params.init = cuvs::cluster::kmeans::params::Array;
params.max_iter = 20;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few new tests could be useful to consider:

  • a test that runs fit with host input and device input on the same data/seed and asserts bit-identical centroids/inertia
  • a test that exercises init_size (including 0 default and an explicit value)
  • a test for the zero-cost warning path (e.g., n_samples == n_clusters with distinct points), to lock down the new behavior

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one (bit-identical centroids and inertia with streaming host data and data on device) was already being tested.
I added tests for init_size and zero-cost paths as well as n_init > 0.

self.params.batch_centroids = batch_centroids
if inertia_check is not None:
self.params.inertia_check = inertia_check
if init_size is not None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing Python code that passes inertia_check=... will now raise TypeError. We should consider accepting and doing a deprecation warning for one release, then removing. Same applies to the docstring update

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the C API still has it, this wont raise an error, but I added a deprecation warning.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
cpp/src/cluster/detail/kmeans.cuh (1)

591-602: ⚠️ Potential issue | 🟠 Major

Missing wt_sum > 0 guard in kmeans_fit; inconsistent with kmeans_predict.

If the caller passes a sample_weight whose elements all sum to 0, wt_sum == 0 here, needs_wt_rescale becomes true (since |0 - n_samples| > wt_rel_tol), and prepare_batch_weights at lines 716-720 applies div_const_op<DataT>{wt_sum} → every batch weight becomes inf/NaN. Because wt_sum is captured once and reused for every batch of every Lloyd iteration across all n_init seeds, the corruption silently propagates into centroid_sums / weight_per_cluster and the final centroids, with no diagnostic.

Note that the peer kmeans_predict function already guards this at Line 1015 with RAFT_EXPECTS(wt_sum > DataT{0}, ...) — please apply the same guard here for consistent behavior between training and inference.

🛡️ Proposed fix
   DataT wt_sum           = sample_weight.has_value() ? weightSum(handle, sample_weight.value())
                                                      : static_cast<DataT>(n_samples);
+  RAFT_EXPECTS(wt_sum > DataT{0},
+               "KMeans: sum of sample weights must be strictly positive");
   const DataT wt_rel_tol = n_samples * std::numeric_limits<DataT>::epsilon();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/cluster/detail/kmeans.cuh` around lines 591 - 602, In kmeans_fit,
guard against a zero total sample weight like kmeans_predict does: after
computing wt_sum (and before using it / before setting needs_wt_rescale and
calling prepare_batch_weights), add a precondition check (e.g.,
RAFT_EXPECTS(wt_sum > DataT{0}, "KMeans: sample_weight must sum to > 0")) so
that wt_sum==0 is rejected early; this prevents prepare_batch_weights (and the
div_const_op captured wt_sum) from producing inf/NaN and keeps behavior
consistent with kmeans_predict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 591-602: In kmeans_fit, guard against a zero total sample weight
like kmeans_predict does: after computing wt_sum (and before using it / before
setting needs_wt_rescale and calling prepare_batch_weights), add a precondition
check (e.g., RAFT_EXPECTS(wt_sum > DataT{0}, "KMeans: sample_weight must sum to
> 0")) so that wt_sum==0 is rejected early; this prevents prepare_batch_weights
(and the div_const_op captured wt_sum) from producing inf/NaN and keeps behavior
consistent with kmeans_predict.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ec785235-cad6-4fe9-9a5c-1a941f244525

📥 Commits

Reviewing files that changed from the base of the PR and between 410092c and c515c1e.

📒 Files selected for processing (5)
  • cpp/include/cuvs/cluster/kmeans.hpp
  • cpp/src/cluster/detail/kmeans.cuh
  • cpp/src/cluster/detail/kmeans_common.cuh
  • cpp/src/cluster/detail/kmeans_mg.cuh
  • python/cuvs/cuvs/cluster/kmeans/kmeans.pyx
🚧 Files skipped from review as they are similar to previous changes (2)
  • python/cuvs/cuvs/cluster/kmeans/kmeans.pyx
  • cpp/src/cluster/detail/kmeans_common.cuh

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/cluster/detail/kmeans_common.cuh`:
- Around line 144-165: The function weightSum currently only checks the final
wt_sum > 0 but must validate each weight is finite and non-negative; update
weightSum to (1) in the host path iterate weights(i) and RAFT_EXPECTS each value
is finite and >= 0 before summing, and (2) in the device path add a device-side
reduction/check (e.g., use a device kernel/thrust/raft primitive) to verify all
weight.data_handle() elements are finite and >= 0 (and return an error via
RAFT_EXPECTS if any fail) before performing raft::linalg::mapThenSumReduce; keep
the final positive-sum RAFT_EXPECTS message but add a specific error for invalid
per-sample weights.

In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 703-714: The code currently dereferences the weight_batches
iterator (e.g., using const auto& wt_batch = *wt_it) even when weight_ptr ==
nullptr; move all iterator access of weight_batches (and any wt_it / wt_batch
dereferences) inside branches guarded by weight_ptr != nullptr and, for the
unweighted case, use the prefilled batch_weights_buf produced by
raft::matrix::fill and prepare_batch_weights lambda instead; update the loops
around the weight handling (the weight_batches iterator creation,
prepare_batch_weights lambda, and the loops that currently dereference wt_it —
including the similar sections at the other occurrences referenced in the
comment) so that when weight_ptr is null you never call *wt_it and always
operate on the filled batch_weights_buf.
- Around line 481-495: The reclustering block currently constructs a fresh
cuvs::cluster::kmeans::params (recluster_params) which loses caller settings
like metric, max_iter, tol, batch_samples, batch_centroids and verbosity; change
it to start from the existing params (copy params into recluster_params) and
then only override the fields that must differ (n_clusters, init, n_init) so
that when you call cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(...)
the original metric, max_iter, tol, batch_samples, batch_centroids and verbosity
are preserved during the weighted reclustering step.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 38e8e433-d81d-440d-b687-b4bfce7012bd

📥 Commits

Reviewing files that changed from the base of the PR and between e8e63ab and 30c457c.

📒 Files selected for processing (2)
  • cpp/src/cluster/detail/kmeans.cuh
  • cpp/src/cluster/detail/kmeans_common.cuh

Comment on lines +144 to +165
template <typename DataT, typename IndexT, typename Accessor>
DataT weightSum(
raft::resources const& handle,
raft::mdspan<const DataT, raft::vector_extent<IndexT>, raft::layout_right, Accessor> weight)
{
cudaStream_t stream = raft::resource::get_cuda_stream(handle);
auto wt_aggr = raft::make_device_scalar<DataT>(handle, 0);
auto n_samples = weight.extent(0);

size_t temp_storage_bytes = 0;
RAFT_CUDA_TRY(cub::DeviceReduce::Sum(
nullptr, temp_storage_bytes, weight.data_handle(), wt_aggr.data_handle(), n_samples, stream));

workspace.resize(temp_storage_bytes, stream);

RAFT_CUDA_TRY(cub::DeviceReduce::Sum(workspace.data(),
temp_storage_bytes,
weight.data_handle(),
wt_aggr.data_handle(),
n_samples,
stream));
DataT wt_sum = 0;
raft::copy(handle,
raft::make_host_scalar_view(&wt_sum),
raft::make_device_scalar_view(wt_aggr.data_handle()));
raft::resource::sync_stream(handle, stream);

if (wt_sum != n_samples) {
RAFT_LOG_DEBUG(
"[Warning!] KMeans: normalizing the user provided sample weight to "
"sum up to %d samples",
n_samples);

auto scale = static_cast<DataT>(n_samples) / wt_sum;
raft::linalg::map(
handle, weight, raft::mul_const_op<DataT>{scale}, raft::make_const_mdspan(weight));
auto n_samples = weight.extent(0);

DataT wt_sum = DataT{0};
if constexpr (raft::is_device_mdspan_v<decltype(weight)>) {
auto stream = raft::resource::get_cuda_stream(handle);
auto d_wt_sum = raft::make_device_scalar<DataT>(handle, DataT{0});
raft::linalg::mapThenSumReduce(
d_wt_sum.data_handle(), n_samples, raft::identity_op{}, stream, weight.data_handle());
raft::copy(&wt_sum, d_wt_sum.data_handle(), 1, stream);
raft::resource::sync_stream(handle);
} else {
for (IndexT i = 0; i < n_samples; ++i) {
wt_sum += weight(i);
}
}
RAFT_EXPECTS(wt_sum > DataT{0}, "invalid parameter (sum of sample weights must be positive)");
return wt_sum;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject negative sample weights, not just non-positive totals.

wt_sum > 0 still allows inputs like [-1, 2], which can make weight_per_cluster, centroids, and inertia invalid downstream. Please validate each weight is finite/non-negative before returning the sum.

🛡️ Suggested direction
   DataT wt_sum = DataT{0};
+  bool has_invalid_weight = false;
   if constexpr (raft::is_device_mdspan_v<decltype(weight)>) {
     auto stream   = raft::resource::get_cuda_stream(handle);
     auto d_wt_sum = raft::make_device_scalar<DataT>(handle, DataT{0});
+    auto d_invalid_count = raft::make_device_scalar<int>(handle, int{0});
     raft::linalg::mapThenSumReduce(
       d_wt_sum.data_handle(), n_samples, raft::identity_op{}, stream, weight.data_handle());
+    raft::linalg::mapThenSumReduce(
+      d_invalid_count.data_handle(),
+      n_samples,
+      [] __device__(DataT w) { return !(w >= DataT{0}) ? int{1} : int{0}; },
+      stream,
+      weight.data_handle());
     raft::copy(&wt_sum, d_wt_sum.data_handle(), 1, stream);
+    int invalid_count = 0;
+    raft::copy(&invalid_count, d_invalid_count.data_handle(), 1, stream);
     raft::resource::sync_stream(handle);
+    has_invalid_weight = invalid_count != 0;
   } else {
     for (IndexT i = 0; i < n_samples; ++i) {
+      has_invalid_weight |= !(weight(i) >= DataT{0});
       wt_sum += weight(i);
     }
   }
+  RAFT_EXPECTS(!has_invalid_weight, "invalid parameter (sample weights must be non-negative)");
   RAFT_EXPECTS(wt_sum > DataT{0}, "invalid parameter (sum of sample weights must be positive)");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/cluster/detail/kmeans_common.cuh` around lines 144 - 165, The
function weightSum currently only checks the final wt_sum > 0 but must validate
each weight is finite and non-negative; update weightSum to (1) in the host path
iterate weights(i) and RAFT_EXPECTS each value is finite and >= 0 before
summing, and (2) in the device path add a device-side reduction/check (e.g., use
a device kernel/thrust/raft primitive) to verify all weight.data_handle()
elements are finite and >= 0 (and return an error via RAFT_EXPECTS if any fail)
before performing raft::linalg::mapThenSumReduce; keep the final positive-sum
RAFT_EXPECTS message but add a specific error for invalid per-sample weights.

Comment on lines +481 to +495
cuvs::cluster::kmeans::params recluster_params;
recluster_params.n_clusters = params.n_clusters;
recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array;
recluster_params.n_init = 1;

auto weight_opt = std::make_optional(raft::make_const_mdspan(weight.view()));
cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(
handle,
recluster_params,
raft::make_const_mdspan(potentialCentroids),
weight_opt,
centroidsRawData,
inertia.view(),
n_iter.view(),
std::ref(workspace));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve user parameters during KMeans|| reclustering.

Starting from default kmeans::params drops caller settings such as metric, max_iter, tol, batch_samples, batch_centroids, and verbosity during the weighted reclustering step.

🐛 Proposed fix
-    cuvs::cluster::kmeans::params recluster_params;
-    recluster_params.n_clusters = params.n_clusters;
-    recluster_params.init       = cuvs::cluster::kmeans::params::InitMethod::Array;
-    recluster_params.n_init     = 1;
+    cuvs::cluster::kmeans::params recluster_params = params;
+    recluster_params.init                          = cuvs::cluster::kmeans::params::InitMethod::Array;
+    recluster_params.n_init                        = 1;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cuvs::cluster::kmeans::params recluster_params;
recluster_params.n_clusters = params.n_clusters;
recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array;
recluster_params.n_init = 1;
auto weight_opt = std::make_optional(raft::make_const_mdspan(weight.view()));
cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(
handle,
recluster_params,
raft::make_const_mdspan(potentialCentroids),
weight_opt,
centroidsRawData,
inertia.view(),
n_iter.view(),
std::ref(workspace));
cuvs::cluster::kmeans::params recluster_params = params;
recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array;
recluster_params.n_init = 1;
auto weight_opt = std::make_optional(raft::make_const_mdspan(weight.view()));
cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(
handle,
recluster_params,
raft::make_const_mdspan(potentialCentroids),
weight_opt,
centroidsRawData,
inertia.view(),
n_iter.view(),
std::ref(workspace));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/cluster/detail/kmeans.cuh` around lines 481 - 495, The reclustering
block currently constructs a fresh cuvs::cluster::kmeans::params
(recluster_params) which loses caller settings like metric, max_iter, tol,
batch_samples, batch_centroids and verbosity; change it to start from the
existing params (copy params into recluster_params) and then only override the
fields that must differ (n_clusters, init, n_init) so that when you call
cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(...) the original
metric, max_iter, tol, batch_samples, batch_centroids and verbosity are
preserved during the weighted reclustering step.

Comment thread cpp/src/cluster/detail/kmeans.cuh
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
cpp/src/cluster/detail/kmeans.cuh (1)

481-495: ⚠️ Potential issue | 🟠 Major

Preserve caller parameters during KMeans|| reclustering.

recluster_params starts from defaults, so weighted reclustering drops caller settings such as metric, max_iter, tol, batching, and verbosity. Start from params and only override the fields required for reclustering. This matches an earlier unresolved review thread.

🐛 Proposed fix
-    cuvs::cluster::kmeans::params recluster_params;
-    recluster_params.n_clusters = params.n_clusters;
-    recluster_params.init       = cuvs::cluster::kmeans::params::InitMethod::Array;
-    recluster_params.n_init     = 1;
+    cuvs::cluster::kmeans::params recluster_params = params;
+    recluster_params.n_clusters                    = params.n_clusters;
+    recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array;
+    recluster_params.n_init = 1;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/cluster/detail/kmeans.cuh` around lines 481 - 495, recluster_params
is being constructed from defaults which drops caller-specified options (metric,
max_iter, tol, batching, verbosity, etc.); instead initialize recluster_params
from the incoming params (e.g., copy params into recluster_params) and then only
override the fields required for reclustering: set recluster_params.n_clusters =
params.n_clusters, recluster_params.init =
cuvs::cluster::kmeans::params::InitMethod::Array, and recluster_params.n_init =
1 before calling cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(...);
keep the weight handling (weight_opt) and the rest of the arguments unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 741-759: The cosine norm convention is inconsistent:
compute_batch_norms applies raft::sqrt_op for DistanceType::CosineExpanded while
centroid norms are computed without sqrt; either make both use the same
convention or disallow CosineExpanded here. Fix by updating the centroid norm
computation (the code that computes norms for centroids/centroid_norms) to also
use raft::sqrt_op when metric == cuvs::distance::DistanceType::CosineExpanded,
or alternatively add an early check before the batching loop to throw/return if
metric == CosineExpanded; ensure you reference compute_batch_norms, L2NormBatch,
and the centroid norm computation path so both sides use identical sqrt behavior
or the metric is rejected consistently.
- Around line 622-637: The KMeans++ host init can proceed with fewer samples
than n_clusters when params.init_size > 0, causing degenerate seeding; before
allocating init_sample in kmeans.cuh, add a guard that if pams.init_size > 0 and
init_sample_size < n_clusters you reject early by throwing a clear exception
(e.g., throw std::invalid_argument) with a message referencing params.init_size
and required n_clusters; place this check just before the call to
raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features) so
the logic using init_sample never runs with too few rows.
- Around line 651-654: The Random init path constructs a new RngState from only
the seed, losing any non-default RNG type; instead, construct the RNG using
iter_params.rng_state (preserving its type) like the KMeans++ paths do, and then
pass that RNG into raft::matrix::sample_rows when sampling rows into
centroidsRawData from X (i.e., replace raft::random::RngState
random_state(iter_params.rng_state.seed) with constructing/initializing RngState
from iter_params.rng_state so sample_rows uses the configured rng type).

---

Duplicate comments:
In `@cpp/src/cluster/detail/kmeans.cuh`:
- Around line 481-495: recluster_params is being constructed from defaults which
drops caller-specified options (metric, max_iter, tol, batching, verbosity,
etc.); instead initialize recluster_params from the incoming params (e.g., copy
params into recluster_params) and then only override the fields required for
reclustering: set recluster_params.n_clusters = params.n_clusters,
recluster_params.init = cuvs::cluster::kmeans::params::InitMethod::Array, and
recluster_params.n_init = 1 before calling
cuvs::cluster::kmeans::detail::kmeans_fit<DataT, IndexT>(...); keep the weight
handling (weight_opt) and the rest of the arguments unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 44e402a4-4ebe-4e58-be48-8589e58d62cc

📥 Commits

Reviewing files that changed from the base of the PR and between 30c457c and 269f23c.

📒 Files selected for processing (2)
  • cpp/src/cluster/detail/kmeans.cuh
  • python/cuvs/cuvs/cluster/kmeans/kmeans.pyx

Comment on lines +622 to +637
IndexT default_init_size =
std::min(static_cast<IndexT>(std::int64_t{3} * n_clusters), n_samples);
IndexT init_sample_size = pams.init_size > 0
? std::min(static_cast<IndexT>(pams.init_size), n_samples)
: default_init_size;

if (pams.init_size <= 0 && init_sample_size < n_samples) {
RAFT_LOG_WARN(
"KMeans.fit: KMeans++ initialization is using a random subsample of %zu/%zu host rows "
"(params.init_size=0 defaults to min(3 * n_clusters, n_samples) for host data). "
"Set params.init_size to n_samples to use the full dataset for seeding.",
static_cast<size_t>(init_sample_size),
static_cast<size_t>(n_samples));
}

auto centroidsRawData = raft::make_device_matrix<DataT, IndexT>(handle, n_clusters, n_features);
init_sample = raft::make_device_matrix<DataT, IndexT>(handle, init_sample_size, n_features);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject KMeans++ init samples smaller than n_clusters.

When params.init_size > 0 but less than n_clusters, host KMeans++ samples too few rows and then tries to seed n_clusters centroids from that smaller sample. Reject this early to avoid degenerate duplicate/zero-probability seeding.

🐛 Proposed fix
       IndexT init_sample_size = pams.init_size > 0
                                   ? std::min(static_cast<IndexT>(pams.init_size), n_samples)
                                   : default_init_size;
+      RAFT_EXPECTS(init_sample_size >= n_clusters,
+                   "invalid parameter (init_size must be 0 or >= n_clusters)");
 
       if (pams.init_size <= 0 && init_sample_size < n_samples) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/cluster/detail/kmeans.cuh` around lines 622 - 637, The KMeans++ host
init can proceed with fewer samples than n_clusters when params.init_size > 0,
causing degenerate seeding; before allocating init_sample in kmeans.cuh, add a
guard that if pams.init_size > 0 and init_sample_size < n_clusters you reject
early by throwing a clear exception (e.g., throw std::invalid_argument) with a
message referencing params.init_size and required n_clusters; place this check
just before the call to raft::make_device_matrix<DataT, IndexT>(handle,
init_sample_size, n_features) so the logic using init_sample never runs with too
few rows.

Comment on lines +651 to +654
raft::random::RngState random_state(iter_params.rng_state.seed);

if (iter_params.init == cuvs::cluster::kmeans::params::InitMethod::Random) {
raft::matrix::sample_rows(handle, random_state, X, centroidsRawData);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Inspect RngState construction patterns around KMeans initialization.
rg -n -C3 'RngState random_state|RngState rng' --iglob '*kmeans*.cuh' --iglob '*kmeans*.cu'

Repository: rapidsai/cuvs

Length of output: 3467


🏁 Script executed:

#!/bin/bash
# Check RngState constructor signature
find . -type f \( -name "*.hpp" -o -name "*.h" -o -name "*.cuh" \) | xargs grep -l "RngState" | head -5

Repository: rapidsai/cuvs

Length of output: 266


🏁 Script executed:

#!/bin/bash
# Find and inspect RngState class definition
fd -e hpp -e h "rng" | xargs grep -l "class RngState" 2>/dev/null | head -3

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Look at the context around line 651 to understand iter_params
sed -n '640,665p' cpp/src/cluster/detail/kmeans.cuh

Repository: rapidsai/cuvs

Length of output: 1307


🏁 Script executed:

#!/bin/bash
# Search for RngState class definition in raft
find . -type f \( -name "*.hpp" -o -name "*.cuh" \) -path "*/raft/*" | xargs grep -l "class RngState" 2>/dev/null | head -5

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Look for RngState constructor definition
rg "class RngState|struct RngState" -A 20 --type cpp --type cuh 2>/dev/null | head -60

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Check the rng_state struct in kmeans params
rg "struct.*RngState|class.*RngState" -B 2 -A 10 cpp/include/cuvs/cluster/kmeans.hpp

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Check the rng_state field in kmeans params and its type
rg "rng_state" cpp/include/cuvs/cluster/kmeans.hpp -B 2 -A 2

Repository: rapidsai/cuvs

Length of output: 294


🏁 Script executed:

#!/bin/bash
# Check what RngState is and its definition location
rg "include.*random" cpp/src/cluster/detail/kmeans.cuh | head -20

Repository: rapidsai/cuvs

Length of output: 105


🏁 Script executed:

#!/bin/bash
# Verify the context of all RngState constructions to see the pattern
rg -n "RngState.*seed.*type" cpp/src/cluster/detail/kmeans.cuh

Repository: rapidsai/cuvs

Length of output: 216


🏁 Script executed:

#!/bin/bash
# Check the RngState default member initialization to see what type it has
rg "RngState.*rng_state" cpp/include/cuvs/cluster/kmeans.hpp -B 3 -A 3

Repository: rapidsai/cuvs

Length of output: 232


🏁 Script executed:

#!/bin/bash
# Look at how sample_rows is called with random_state to understand expectations
rg -n "sample_rows.*random_state" cpp/src/cluster/detail/kmeans.cuh -B 2 -A 2

Repository: rapidsai/cuvs

Length of output: 639


🏁 Script executed:

#!/bin/bash
# Check if there are any other places where Random init path is used
rg -n "InitMethod::Random" cpp/src/cluster/detail/kmeans.cuh -B 5 -A 5

Repository: rapidsai/cuvs

Length of output: 1171


Preserve the configured RNG type for row sampling.

This constructs RngState from the seed only, unlike the KMeans++ and KMeans++ subsampling paths at lines 144 and 316 that preserve params.rng_state.type. Non-default RNG configurations will be ignored for the Random initialization method.

Proposed fix
-    raft::random::RngState random_state(iter_params.rng_state.seed);
+    raft::random::RngState random_state(iter_params.rng_state.seed, iter_params.rng_state.type);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
raft::random::RngState random_state(iter_params.rng_state.seed);
if (iter_params.init == cuvs::cluster::kmeans::params::InitMethod::Random) {
raft::matrix::sample_rows(handle, random_state, X, centroidsRawData);
raft::random::RngState random_state(iter_params.rng_state.seed, iter_params.rng_state.type);
if (iter_params.init == cuvs::cluster::kmeans::params::InitMethod::Random) {
raft::matrix::sample_rows(handle, random_state, X, centroidsRawData);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/cluster/detail/kmeans.cuh` around lines 651 - 654, The Random init
path constructs a new RngState from only the seed, losing any non-default RNG
type; instead, construct the RNG using iter_params.rng_state (preserving its
type) like the KMeans++ paths do, and then pass that RNG into
raft::matrix::sample_rows when sampling rows into centroidsRawData from X (i.e.,
replace raft::random::RngState random_state(iter_params.rng_state.seed) with
constructing/initializing RngState from iter_params.rng_state so sample_rows
uses the configured rng type).

Comment thread cpp/src/cluster/detail/kmeans.cuh Outdated
Comment on lines +741 to +759
bool need_compute_norms = metric == cuvs::distance::DistanceType::L2Expanded ||
metric == cuvs::distance::DistanceType::L2SqrtExpanded ||
metric == cuvs::distance::DistanceType::CosineExpanded;
bool use_norm_cache = need_compute_norms && !data_on_device;
std::vector<DataT> h_norm_cache;
if (use_norm_cache) { h_norm_cache.resize(n_samples); }
bool norms_cached = false;

auto compute_batch_norms = [&](const DataT* batch_ptr, IndexT batch_size) {
auto batch_view =
raft::make_device_matrix_view<const DataT, IndexT>(batch_ptr, batch_size, n_features);
auto norm_view =
raft::make_device_vector_view<DataT, IndexT>(L2NormBatch.data_handle(), batch_size);
if (metric == cuvs::distance::DistanceType::CosineExpanded) {
raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>(
handle, batch_view, norm_view, raft::sqrt_op{});
} else {
raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>(
handle, batch_view, norm_view);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make cosine norm handling consistent or reject cosine here.

compute_batch_norms uses sqrt_op for CosineExpanded, but centroid norms are computed without sqrt_op. If cosine is supported in this path, centroid norms should use the same convention; if not, reject CosineExpanded before entering the loop.

🐛 Proposed fix if cosine is intended to be supported
       if (need_compute_norms) {
-        raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>(
-          handle, centroids_const, centroid_norms_buf.view());
+        if (metric == cuvs::distance::DistanceType::CosineExpanded) {
+          raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>(
+            handle, centroids_const, centroid_norms_buf.view(), raft::sqrt_op{});
+        } else {
+          raft::linalg::norm<raft::linalg::L2Norm, raft::Apply::ALONG_ROWS>(
+            handle, centroids_const, centroid_norms_buf.view());
+        }
         centroid_norms_opt = raft::make_device_vector_view<const DataT, IndexT>(
           centroid_norms_buf.data_handle(), n_clusters);
       }

Also applies to: 802-809

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/cluster/detail/kmeans.cuh` around lines 741 - 759, The cosine norm
convention is inconsistent: compute_batch_norms applies raft::sqrt_op for
DistanceType::CosineExpanded while centroid norms are computed without sqrt;
either make both use the same convention or disallow CosineExpanded here. Fix by
updating the centroid norm computation (the code that computes norms for
centroids/centroid_norms) to also use raft::sqrt_op when metric ==
cuvs::distance::DistanceType::CosineExpanded, or alternatively add an early
check before the batching loop to throw/return if metric == CosineExpanded;
ensure you reference compute_batch_norms, L2NormBatch, and the centroid norm
computation path so both sides use identical sqrt behavior or the metric is
rejected consistently.

@tarang-jain tarang-jain requested a review from dantegd April 24, 2026 21:51

for (n_current_iter = 1; n_current_iter <= iter_params.max_iter; ++n_current_iter) {
if (n_current_iter > 1) {
RAFT_CUDA_TRY(cudaEventSynchronize(convergence_event));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convergence event is recorded right after the inertia is computed, but we check it here before proceeding to the next iteration so that the operations after the convergence check are not blocked. The use of an event also means that only the operations up until the event need to be completed in order for it to be synchronized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Introduces a breaking change cpp improvement Improves an existing functionality

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants