Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions docs/adr/0004-autocomplete-suggest-index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# ADR 0004: Autocomplete Suggest Index

## Status
Accepted

## Date
2026-06-09

## Context

Users need fast autocomplete / as-you-type suggestions — returning completion candidates as the user types, before they finish a word. This is a read-heavy, latency-sensitive path (sub-10ms target). The feature should return term completions, not full documents.

Requirements:
- Sub-10ms latency for prefix lookup
- Multi-field support with per-field weights
- Score by term popularity across the corpus
- Completion suggestions (terms/phrases), not document results

## Decision

### Binary Serialization with Sorted Term Arrays

Each field's vocabulary is stored as a sorted `Vec<SuggestEntry>` (term, doc_freq, score) in a binary sidecar file (`suggest_{segment:020}.bin`). The format:

```
MAGIC (4) + VERSION (1) + PADDING (3) + FIELD_COUNT (4)
Per field: FIELD_NAME_LEN (4) + FIELD_NAME + TERM_COUNT (4)
Per term: STR_LEN (4) + TERM_BYTES + DOC_FREQ (4) + SCORE (4)
```

O(log n + m) prefix lookup via binary search (`find_first_prefix`) where n = vocabulary size, m = matching terms. No in-memory index build at query time.

### Atomic Writes via .tmp then Rename

Flush writes to `suggest_{seg:020}.tmp`, then atomically renames to `.bin`. Readers load from `.bin` only — never from `.tmp`. This guarantees readers always see consistent data.

### Per-Segment Sidecar Files

Each segment has its own suggest sidecar file. At query time, all segment sidecars are queried and results merged. This avoids rebuilding the entire suggest index on every flush — only the new segment's sidecar is written.

### doc_freq = Unique Document Count per Term

`doc_freq` counts the number of **unique documents** that contain each term in a field, not the number of token occurrences. This is the standard document-frequency semantics used by search engines.

### Segment Reader Management

- On `open_index`: all suggest sidecars loaded into `suggest_readers` Vec
- On `flush`: all suggest sidecars reloaded from manifest (not just the new one) to avoid reader accumulation
- On `merge`: all suggest sidecars reloaded after manifest update since old sidecars are invalidated

## Consequences

### Positive
- Sub-10ms lookup: binary search on sorted arrays is O(log n), no in-memory index build
- Atomic writes prevent readers from seeing partial data
- Per-segment sidecars mean flush only writes one new file, not the full vocabulary
- doc_freq semantics match standard IR practice

### Negative
- doc_freq is frozen at flush time — doesn't account for deletes or updates until next flush
- Each segment's suggest data is independent; cross-segment deduplication happens at query time (in-memory BTreeMap)
- Empty prefix returns all terms in lexical order (could be large); guarded to return empty instead

### Neutral
- The suggest sidecar is separate from the positions sidecar — two separate files per segment
- Segment readers are held in memory; memory usage grows with segment count × vocabulary size

## Alternatives Considered

### Alternative 1: In-Memory Trie
**Why rejected:** A trie would require rebuilding the entire suggest index on every flush. With large vocabularies this becomes expensive. The sorted-array binary search achieves the same O(log n + m) lookup while allowing per-segment incremental updates.

### Alternative 2: Generic B-Tree Index (e.g., RedBTree)
**Why rejected:** Adds a heavy dependency for a read-heavy, append-mostly workload. The binary serialized sorted arrays are simpler, have no runtime dependency, and serialize/deserialize cheaply.

### Alternative 3: Store suggest data inline in segment snapshot
**Why rejected:** Suggests are built during flush from the full document set; storing them inline in the segment snapshot would require re-reading all documents to rebuild suggests on every merge. Separate sidecar files allow incremental rebuilds from the merged document set only.
50 changes: 50 additions & 0 deletions docs/api-v1.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,56 @@ Current implementation notes:

- `POST /{index}/_search` — search with JSON body
- `GET /{index}/_search?q=...` — search with query string
- `POST /{index}/_suggest` — autocomplete suggestions for a prefix

### Suggest Request Shape

```json
{
"prefix": "elast",
"fields": { "title": 1.0, "body": 0.5 },
"size": 10,
"fuzzy": { "fuzziness": "AUTO" }
}
```

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `prefix` | `string` | required | The prefix to suggest completions for |
| `fields` | `map<string, float>` | `{}` | Fields to search with their weights (0 = excluded) |
| `size` | `integer` | `10` | Maximum number of suggestions to return |
| `fuzzy` | `object` | none | Optional fuzzy matching; omit for exact prefix only |

### Suggest Response Shape

```json
{
"suggestions": [
{
"text": "elastic",
"score": 0.6667,
"doc_freq": 2,
"field": "title"
}
]
}
```

| Field | Type | Description |
|-------|------|-------------|
| `text` | `string` | The completion suggestion (tokenized, lowercase) |
| `score` | `float` | Normalized popularity score (`doc_freq / n_docs`) |
| `doc_freq` | `integer` | Number of documents containing this term |
| `field` | `string?` | Which field contributed this suggestion |

### Implementation Notes

- Suggestions are built during flush from indexed text fields (type `keyword`)
- Each field's vocabulary is sorted and stored in a binary sidecar file for O(log n + m) prefix lookup
- `doc_freq` counts **unique documents** per term, not token occurrences
- Empty prefix (`""`) returns no results
- Scores are computed as `doc_freq / n_docs` where `n_docs` is the total documents at flush time
- Fuzzy matching uses edit distance when `fuzzy` is provided

## Observability API

Expand Down
23 changes: 23 additions & 0 deletions rust/crates/cloudsearch-api/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,7 @@ pub fn router_with_registry(registry: Arc<IndexRegistry>) -> Router {
"/{index}/_search",
get(search_index_get).post(search_index).put(search_index),
)
.route("/{index}/_suggest", post(suggest_index))
.route("/{index}/_settings", put(update_index_settings))
.route("/{index}/_snapshot", get(list_snapshots))
.route(
Expand Down Expand Up @@ -519,6 +520,28 @@ async fn multi_search(
Ok((StatusCode::OK, Json(MultiSearchResponse { responses })))
}

async fn suggest_index(
State(state): State<ApiState>,
Path(index): Path<String>,
Json(request): Json<cloudsearch_common::SuggestRequest>,
) -> Result<impl IntoResponse, ApiError> {
let started_at = Instant::now();

let handle = state.registry.index_handle(&index).await?;
let handle = handle.lock().await;

let result = handle.suggest(&request);

state.metrics().record_request(
"suggest",
"POST",
StatusCode::OK,
started_at.elapsed().as_secs_f64(),
);

Ok((StatusCode::OK, Json(result)))
}

async fn search_index_get(
State(state): State<ApiState>,
Path(index): Path<String>,
Expand Down
29 changes: 29 additions & 0 deletions rust/crates/cloudsearch-common/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -355,6 +355,35 @@ pub enum Fuzziness {
Exact(usize),
}

#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct SuggestRequest {
pub prefix: String,
#[serde(default)]
pub fields: BTreeMap<String, f32>,
#[serde(default = "default_suggest_size")]
pub size: usize,
#[serde(skip_serializing_if = "Option::is_none", default)]
pub fuzzy: Option<Fuzziness>,
}

fn default_suggest_size() -> usize {
10
}

#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct SuggestResponse {
pub suggestions: Vec<Suggestion>,
}

#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct Suggestion {
pub text: String,
pub score: f32,
pub doc_freq: u32,
#[serde(skip_serializing_if = "Option::is_none")]
pub field: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct TermQuery {
pub field: String,
Expand Down
Loading