Skip to content

feat(index): parallel segment search with rayon#115

Open
poyrazK wants to merge 6 commits into
mainfrom
feature/autocomplete-suggest
Open

feat(index): parallel segment search with rayon#115
poyrazK wants to merge 6 commits into
mainfrom
feature/autocomplete-suggest

Conversation

@poyrazK

@poyrazK poyrazK commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add rayon dependency for parallel document scoring in search()
  • Parallelize the scoring hot loop: doc_ids are iterated in parallel, source lookups happen inside each task via &self (avoiding Send+Sync requirement on IndexDocument)
  • Only (score, doc_id) tuples cross thread boundaries; source is looked up sequentially after scoring (O(hits), not O(all_docs))

Changes

  • 2 commits on top of feature/autocomplete-suggest (which is at main-level for the suggest/MLT work)

Summary by CodeRabbit

  • New Features

    • Improved search performance through parallel document scoring.
  • Documentation

    • Added architecture decision record for parallel segment search implementation.
    • Updated product roadmap with Phase 1b vertical scaling initiatives.
  • Tests

    • Added comprehensive benchmark suite measuring search performance across multiple query types and document volumes.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@poyrazK, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 34 minutes. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 32dc2832-20ad-45df-b5cf-b5830291802b

📥 Commits

Reviewing files that changed from the base of the PR and between 4f164da and bb47405.

📒 Files selected for processing (3)
  • rust/Cargo.toml
  • rust/crates/cloudsearch-index/benches/search.rs
  • rust/crates/cloudsearch-index/src/lib.rs
📝 Walkthrough

Walkthrough

This PR implements parallel BM25 document scoring in the search engine using Rayon, parallelizing scoring across document IDs while keeping source lookups sequential. It includes an ADR documenting the approach, workspace dependencies, a comprehensive Criterion benchmark suite, and the core parallel search implementation.

Changes

Parallel Segment Search Using Rayon

Layer / File(s) Summary
ADR and Roadmap
docs/adr/0005-parallel-segment-search.md, docs/roadmap.md
ADR 0005 documents the parallel scoring approach, design constraints, consequences, alternatives considered, and benchmark observations. Roadmap adds "Phase 1b - Vertical Scaling" positioning this work.
Dependencies and Benchmark Infrastructure
rust/Cargo.toml, rust/crates/cloudsearch-index/Cargo.toml, rust/crates/cloudsearch-index/benches/search.rs (setup)
Adds rayon 1.5 and criterion 0.5 to workspace. Configures cloudsearch-index to depend on rayon and criterion (dev). Declares search benchmark target and implements build_index helper for realistic index construction via Tokio single-threaded runtime.
Parallel Search Implementation
rust/crates/cloudsearch-index/src/lib.rs
Refactors IndexHandle::search to parallelize BM25 scoring over doc_ids using rayon's par_iter, filtering expired and MLT-excluded docs in parallel. Sequentially rehydrates scored hits with sources, computes aggregations, applies sorting and pagination, and updates SearchHit construction to compute highlights and sort values.
Search Performance Benchmarks
rust/crates/cloudsearch-index/benches/search.rs (tests)
Criterion benchmarks for MatchAll, term query, multi-field best-fields query, and range filter queries across 1K and 10K document indexes, with black_box hit capture and scaling expectations.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Search as IndexHandle::search
    participant Parallel as Rayon ParallelIterator
    participant Storage as Document Store
    
    Client->>Search: search(query)
    Search->>Parallel: par_iter(doc_ids) filter and score
    Parallel->>Parallel: BM25 scoring
    Parallel->>Search: scores and doc_ids
    Search->>Storage: Sequential source lookup
    Storage->>Search: sources with scores
    Search->>Search: Aggregations, sort, search_after
    Search->>Client: SearchResponse
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • poyrazK/cloudSearch#95: Both PRs modify IndexHandle::search's scoring pipeline; #95 implements end-to-end BM25 relevance scoring, while this PR parallelizes the scoring execution using Rayon.

Poem

🐰 With Rayon's threads, our searches now fly—
Doc scores computed in parallel, oh my!
Sources stay safe on the main thread's domain,
Performance gains flowing like spring's gentle rain. 🚀

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'feat(index): parallel segment search with rayon' directly and clearly summarizes the main change: introducing parallel document scoring in search using the rayon library.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/autocomplete-suggest

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Parallelizes the hot loop in search() that scores documents against the
query. Only doc_id (String, Send+Sync) crosses thread boundaries; source
lookups happen inside each parallel task via &self, avoiding the
Send+Sync requirement on IndexDocument (whose serde_json::Value field
contains RefCell).

The scoring phase now uses into_par_iter() on doc_ids, filtering expired
docs and MLT exclusions in parallel, then lookups source sequentially
after scoring (O(hits) not O(all_docs)).
@poyrazK poyrazK force-pushed the feature/autocomplete-suggest branch from f4bc042 to 2f4361f Compare June 11, 2026 18:09
poyrazK added 3 commits June 11, 2026 21:12
Benchmarks for search() across MatchAll, term, multi_match, and range
queries at 1k and 10k document counts.

Run with: cargo bench -p cloudsearch-index
@poyrazK poyrazK changed the base branch from feat/wal-storage-foundation to main June 11, 2026 18:38
poyrazK added a commit that referenced this pull request Jun 11, 2026
poyrazK added a commit that referenced this pull request Jun 11, 2026
@poyrazK poyrazK force-pushed the feature/autocomplete-suggest branch from ccd38de to 04689be Compare June 11, 2026 19:21
poyrazK added a commit that referenced this pull request Jun 11, 2026
@poyrazK poyrazK force-pushed the feature/autocomplete-suggest branch from 04689be to b3dcad0 Compare June 11, 2026 19:23
poyrazK added a commit that referenced this pull request Jun 11, 2026
@poyrazK poyrazK force-pushed the feature/autocomplete-suggest branch from b3dcad0 to 4f164da Compare June 11, 2026 19:31
poyrazK added a commit that referenced this pull request Jun 11, 2026
@poyrazK poyrazK force-pushed the feature/autocomplete-suggest branch from 4f164da to 6ecd339 Compare June 11, 2026 19:38

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
rust/crates/cloudsearch-index/src/lib.rs (1)

1029-1044: ⚡ Quick win

Avoid full source clones inside sort comparator.

Line 1029-1044 clones full documents for each comparison; compare on borrowed sort field values and doc_id instead to keep sort overhead low.

Suggested refactor
-            scored.sort_by(|(_, l_id, l_src), (_, r_id, r_src)| {
-                let lh = SearchHit {
-                    id: l_id.clone(),
-                    source: l_src.clone(),
-                    score: None,
-                    highlight: None,
-                    sort_values: None,
-                };
-                let rh = SearchHit {
-                    id: r_id.clone(),
-                    source: r_src.clone(),
-                    score: None,
-                    highlight: None,
-                    sort_values: None,
-                };
-                compare_hits(&lh, &rh, sort)
-            });
+            scored.sort_by(|(_, l_id, l_src), (_, r_id, r_src)| {
+                let l_val = l_src.get(&sort.field).and_then(comparable_value);
+                let r_val = r_src.get(&sort.field).and_then(comparable_value);
+
+                let ordering = match (l_val, r_val) {
+                    (Some(l), Some(r)) => compare_sort_values(&l, &r, sort),
+                    (None, Some(_)) => std::cmp::Ordering::Greater,
+                    (Some(_), None) => std::cmp::Ordering::Less,
+                    (None, None) => l_id.cmp(r_id),
+                };
+
+                if ordering == std::cmp::Ordering::Equal {
+                    l_id.cmp(r_id)
+                } else {
+                    ordering
+                }
+            });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 1029 - 1044, The
comparator passed to scored.sort_by is cloning entire SearchHit.source for every
comparison (in the closure that builds lh/rh and calls compare_hits), which is
expensive; change the comparator to compare only the doc_id and the specific
sort-field values by borrowing them instead of cloning the whole source.
Concretely, in the scored.sort_by closure avoid constructing full SearchHit with
cloned source—either extract the needed sort values and ids from l_src/r_src and
call a new helper like compare_hits_by_values(&l_id, &r_id, l_sort_vals,
r_sort_vals, sort) or change compare_hits to accept borrowed inputs (e.g., &str
id and &Value or &Vec<Value> sort_values) so you only clone the doc_id when
necessary and borrow sort fields from l_src/r_src. Ensure references to the
existing symbols: scored.sort_by closure, SearchHit construction site,
compare_hits (or a new compare_hits_by_values), and the sort variable are
updated accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/adr/0005-parallel-segment-search.md`:
- Around line 13-14: Replace the incorrect rationale in the ADR: remove the
claim that serde_json::Value “uses RefCell internally” and is not Send/Sync, and
instead state that serde_json::Value does implement Send + Sync; correct the
Rayon bound by noting that into_par_iter() on a Vec<T> requires T: Send (not T:
Send + Sync). Update the discussion around IndexDocument::source to explain the
actual constraint driving the design (e.g., that IndexDocument::source is
intentionally not moved across threads in current architecture or that
cloning/moving the Value is avoided for performance/ownership reasons), and
ensure the text aligns with the later statement that serde_json::Value “never
crosses thread boundaries.”

In `@rust/crates/cloudsearch-index/benches/search.rs`:
- Around line 58-67: The benchmark setup currently ignores Results from async
calls (e.g., catalog.initialize(), catalog.create_index(...), and the later
indexing calls), allowing silent failures; change these to propagate or fail
fast by handling the Result (use the ? operator if the surrounding function
returns Result, or call .expect("...")/.unwrap_or_else(|e| panic!(...)) with a
clear message) so any error from catalog.initialize(),
catalog.create_index(...), and the subsequent indexing calls is surfaced
immediately and aborts the benchmark.

---

Nitpick comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 1029-1044: The comparator passed to scored.sort_by is cloning
entire SearchHit.source for every comparison (in the closure that builds lh/rh
and calls compare_hits), which is expensive; change the comparator to compare
only the doc_id and the specific sort-field values by borrowing them instead of
cloning the whole source. Concretely, in the scored.sort_by closure avoid
constructing full SearchHit with cloned source—either extract the needed sort
values and ids from l_src/r_src and call a new helper like
compare_hits_by_values(&l_id, &r_id, l_sort_vals, r_sort_vals, sort) or change
compare_hits to accept borrowed inputs (e.g., &str id and &Value or &Vec<Value>
sort_values) so you only clone the doc_id when necessary and borrow sort fields
from l_src/r_src. Ensure references to the existing symbols: scored.sort_by
closure, SearchHit construction site, compare_hits (or a new
compare_hits_by_values), and the sort variable are updated accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a68d3fee-9724-4a58-b2fa-bb23b7915587

📥 Commits

Reviewing files that changed from the base of the PR and between a3fc0e4 and 4f164da.

⛔ Files ignored due to path filters (1)
  • rust/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • docs/adr/0005-parallel-segment-search.md
  • docs/roadmap.md
  • rust/Cargo.toml
  • rust/crates/cloudsearch-index/Cargo.toml
  • rust/crates/cloudsearch-index/benches/search.rs
  • rust/crates/cloudsearch-index/src/lib.rs

Comment on lines +13 to +14
The naive fix — using `into_par_iter()` on a `Vec<IndexDocument>` — is blocked by `serde_json::Value`. The `IndexDocument::source` field is `serde_json::Value`, which uses `RefCell` internally and is therefore neither `Send` nor `Sync`. Rayon requires all data processed in parallel to be `Send + Sync`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

Is Rust serde_json::Value SendandSync? Please provide official rustdoc/crate-source evidence.

💡 Result:

Yes. serde_json::Value is both Send and Sync. Rustdoc evidence: the serde_json::Value docs list auto-trait implementations “impl Send for Value” and “impl Sync for Value”. [1] Crate-source evidence (structure that supports these auto traits): Value is a recursive enum whose variants only contain bool, Number, String, Vec<Value>, and Map<String, Value> (where Map is the crate’s map type). [2] So the compiler can derive/implement Send and Sync for serde_json::Value, and the docs.rs page confirms this explicitly. [1]

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '1,120p' docs/adr/0005-parallel-segment-search.md

Repository: poyrazK/cloudSearch

Length of output: 4842


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "struct IndexDocument|IndexDocument|source: serde_json::Value|serde_json::Value" -S .

Repository: poyrazK/cloudSearch

Length of output: 44756


🌐 Web query:

serde_json Value impl Send Sync auto trait

💡 Result:

The serde_json::Value enum implements both Send and Sync auto traits [1][2]. These implementations allow Value to be safely transferred between threads (Send) and accessed concurrently by multiple threads through shared references (Sync). Because Value is composed of standard types that themselves implement these traits—such as String, BTreeMap, and Vec—it automatically satisfies the conditions for Send and Sync [3].

Citations:


🌐 Web query:

docs.rs serde_json::Value Send Sync

💡 Result:

serde_json::Value is Send and Sync. Docs.rs’ rendered trait impls show “impl Send for Value” and “impl Sync for Value” for the serde_json::value::Value type [1]. The serde-rs/json source defines serde_json::Value as a pub enum Value with #[derive(Clone, Eq, PartialEq, Hash)], consistent with the documented Send/Sync implementations [2].

Citations:


🌐 Web query:

rayon Vec into_par_iter Send bound T: Send

💡 Result:

Yes, the Rayon trait IntoParallelIterator and its associated ParallelIterator trait require the items being processed to implement the Send trait [1][2]. When you call into_par_iter() on a collection such as a Vec<T>, the implementation of IntoParallelIterator for that type requires that T (the item type) satisfies T: Send [1][3]. This is because Rayon's parallel iterators work by splitting the collection and potentially moving items to different threads within the Rayon thread pool to process them in parallel [4][5]. If your type T does not implement Send, you will not be able to call into_par_iter() on a Vec<T>, and the compiler will typically report that the trait bound T: Send is not satisfied [6][5]. Key points regarding Rayon's thread safety requirements: - Send: Required for into_par_iter() (which consumes the collection) and par_iter_mut() (which iterates over mutable references) because items are moved or sent between threads [5]. - Sync: Required for par_iter() (which iterates over shared references) because items are shared across threads [5]. If you are encountering errors with custom types or trait objects, you must ensure they implement Send (and Sync if using par_iter()) to satisfy these bounds [6][5]. For trait objects, this is often done by using Box<dyn Trait + Send> or by adding Send and Sync as supertraits to your trait definition [6][5].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail
sed -n '90,160p' rust/crates/cloudsearch-common/src/lib.rs

Repository: poyrazK/cloudSearch

Length of output: 1785


🏁 Script executed:

#!/bin/bash
set -euo pipefail
nl -ba docs/adr/0005-parallel-segment-search.md | sed -n '1,90p'

Repository: poyrazK/cloudSearch

Length of output: 106


🏁 Script executed:

#!/bin/bash
set -euo pipefail
nl -ba docs/adr/0005-parallel-segment-search.md | sed -n '8,20p'

Repository: poyrazK/cloudSearch

Length of output: 106


🏁 Script executed:

#!/bin/bash
set -euo pipefail
nl -ba rust/crates/cloudsearch-common/src/lib.rs | sed -n '110,160p'

Repository: poyrazK/cloudSearch

Length of output: 106


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "RefCell|Send\\b|Sync\\b|serde_json::Value" docs/adr/0005-parallel-segment-search.md
echo "---- ADR lines 1-30 ----"
cat -n docs/adr/0005-parallel-segment-search.md | sed -n '1,40p'
echo "---- ADR lines 30-80 ----"
cat -n docs/adr/0005-parallel-segment-search.md | sed -n '30,90p'

Repository: poyrazK/cloudSearch

Length of output: 6140


Fix the ADR’s incorrect serde_json::Value / Rayon thread-safety rationale.
docs/adr/0005-parallel-segment-search.md line 13 states that serde_json::Value “uses RefCell internally” and is neither Send nor Sync, but serde_json::Value implements both Send and Sync (per serde-rs/json source / docs.rs). Line 13 also says “Rayon requires all data processed in parallel to be Send + Sync”; for into_par_iter() on a Vec<T>, the key bound is T: Send (not necessarily Sync). Update the ADR to remove/replace the incorrect RefCell/Send+Sync blocker and describe the actual constraint driving the design (it should be consistent with the later claim that serde_json::Value “never crosses thread boundaries” for the real reason).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/adr/0005-parallel-segment-search.md` around lines 13 - 14, Replace the
incorrect rationale in the ADR: remove the claim that serde_json::Value “uses
RefCell internally” and is not Send/Sync, and instead state that
serde_json::Value does implement Send + Sync; correct the Rayon bound by noting
that into_par_iter() on a Vec<T> requires T: Send (not T: Send + Sync). Update
the discussion around IndexDocument::source to explain the actual constraint
driving the design (e.g., that IndexDocument::source is intentionally not moved
across threads in current architecture or that cloning/moving the Value is
avoided for performance/ownership reasons), and ensure the text aligns with the
later statement that serde_json::Value “never crosses thread boundaries.”

Comment on lines +58 to +67
let _: cloudsearch_common::Result<()> = catalog.initialize().await;
let _: cloudsearch_common::Result<cloudsearch_common::IndexMetadata> = catalog
.create_index(
"test",
CreateIndexRequest {
settings: IndexSettings::default(),
..Default::default()
},
)
.await;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on setup/indexing errors in benchmark initialization.

Lines 58-67 and Line 83/Line 86/Line 87 ignore Results, so setup failures can silently invalidate benchmark data and performance conclusions.

Suggested fix
-        let _: cloudsearch_common::Result<()> = catalog.initialize().await;
-        let _: cloudsearch_common::Result<cloudsearch_common::IndexMetadata> = catalog
+        catalog.initialize().await.expect("initialize catalog");
+        catalog
             .create_index(
                 "test",
                 CreateIndexRequest {
                     settings: IndexSettings::default(),
                     ..Default::default()
                 },
             )
-            .await;
+            .await
+            .expect("create index");

@@
-            let _: cloudsearch_common::Result<u64> = handle.index_document(doc).await;
+            handle.index_document(doc).await.expect("index document");
@@
-        let _: cloudsearch_common::Result<usize> = handle.refresh().await;
-        let _: cloudsearch_common::Result<cloudsearch_common::FlushResponse> = handle.flush().await;
+        handle.refresh().await.expect("refresh index");
+        handle.flush().await.expect("flush index");

Also applies to: 83-87

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-index/benches/search.rs` around lines 58 - 67, The
benchmark setup currently ignores Results from async calls (e.g.,
catalog.initialize(), catalog.create_index(...), and the later indexing calls),
allowing silent failures; change these to propagate or fail fast by handling the
Result (use the ? operator if the surrounding function returns Result, or call
.expect("...")/.unwrap_or_else(|e| panic!(...)) with a clear message) so any
error from catalog.initialize(), catalog.create_index(...), and the subsequent
indexing calls is surfaced immediately and aborts the benchmark.

poyrazK added a commit that referenced this pull request Jun 11, 2026
@poyrazK poyrazK force-pushed the feature/autocomplete-suggest branch from 6ecd339 to 0a363b8 Compare June 11, 2026 19:45
@poyrazK poyrazK force-pushed the feature/autocomplete-suggest branch from 0a363b8 to bb47405 Compare June 11, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant