Skip to content

feat: Multi-Bucket Split Sharding#36

Open
Darkheir wants to merge 1 commit intosekoiafrom
feat/multi_buckets
Open

feat: Multi-Bucket Split Sharding#36
Darkheir wants to merge 1 commit intosekoiafrom
feat/multi_buckets

Conversation

@Darkheir
Copy link
Copy Markdown
Collaborator

@Darkheir Darkheir commented Apr 28, 2026

Summary

Splits can now be distributed across multiple storage buckets for a single index. A new extra_index_uris configuration option allows specifying additional storage URIs alongside the existing index_uri. New splits are written to buckets using a round-robin strategy, and each split records which bucket it was stored in so that reads, merges, and garbage collection work correctly regardless of how the list evolves over time.

Motivation

Previously, all splits for an index were stored under a single index_uri. This change enables spreading data across multiple buckets for improved write throughput, storage isolation, or operational flexibility.

Configuration

version: 0.8
index_id: my-index
index_uri: s3://bucket-a/my-index
extra_index_uris:
  - s3://bucket-b/my-index
  - s3://bucket-c/my-index
  • index_uri remains required and acts as the primary storage location.
  • extra_index_uris is optional (defaults to empty — fully backward compatible).

How it works

  • Write path: The IndexingSplitStore holds all resolved storages and a BucketSelector (round-robin by default). Each new split is assigned a target bucket before staging. The chosen URI is persisted in SplitMetadata.storage_uri.
  • Read path: SearchJob and FetchDocsJob carry the per-split storage URI. Leaf requests are grouped by (index_uid, storage_uri) so splits in different buckets get separate requests. No proto changes were needed.
  • Merge: fetch_and_open_split takes &SplitMetadata and resolves the correct bucket via effective_storage_uri(). Merged output splits are assigned a bucket by the selector.
  • Garbage collection: Splits are grouped by their effective storage URI before deletion. Each group is resolved to the correct storage independently.
  • Backward compatibility: Existing splits have storage_uri: None and continue to be read from index_uri. No database migration is required — the field lives inside the existing split_metadata_json column.

Breaking changes

  • Indexes using extra_index_uris cannot be read by older Quickwit versions (the field is omitted from serialized JSON when empty, so indexes not using the feature are unaffected).

@Darkheir Darkheir changed the title feat: Multi buckets for an index feat: Multi-Bucket Split Sharding Apr 28, 2026
@Darkheir Darkheir force-pushed the feat/multi_buckets branch 2 times, most recently from cf655d7 to 3ce26eb Compare April 28, 2026 13:05
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
@Darkheir Darkheir force-pushed the feat/multi_buckets branch from 3ce26eb to 636ae07 Compare April 28, 2026 14:05
@Darkheir Darkheir marked this pull request as ready for review April 28, 2026 14:21
Copilot AI review requested due to automatic review settings April 28, 2026 14:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds multi-bucket split sharding to Quickwit indexes by introducing extra_index_uris and persisting the chosen bucket per split (SplitMetadata.storage_uri) so search, merge, GC, and tooling can always resolve the correct storage location.

Changes:

  • Add extra_index_uris to index config/template + metastore update flow, and persist per-split storage_uri with a fallback helper (effective_storage_uri).
  • Update indexing, merge, search/list APIs, CLI, janitor, and garbage collection to read/write/delete splits using the per-split effective storage URI.
  • Add round-robin bucket selection and an end-to-end integration test + docs updates.

Reviewed changes

Copilot reviewed 43 out of 43 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
quickwit/quickwit-serve/src/lib.rs Treat indexes as file-backed if any configured index URI (primary or extra) uses file/ram storage.
quickwit/quickwit-search/src/search_job_placer.rs Group jobs by (index_uid, storage_uri); refactor grouping helper to comparator-based API.
quickwit/quickwit-search/src/root.rs Carry per-split storage_uri through search + fetch-docs job paths and leaf request building.
quickwit/quickwit-search/src/list_terms.rs Route list-terms leaf requests per (index_uid, storage_uri) group.
quickwit/quickwit-search/src/list_fields.rs Route list-fields leaf requests per (index_uid, storage_uri) group.
quickwit/quickwit-proto/src/codegen/quickwit/quickwit.metastore.rs Add extra_index_uris field to UpdateIndexRequest codegen.
quickwit/quickwit-proto/protos/quickwit/metastore.proto Add extra_index_uris to metastore UpdateIndexRequest proto.
quickwit/quickwit-metastore/src/tests/index.rs Update metastore update-index tests to pass extra_index_uris.
quickwit/quickwit-metastore/src/split_metadata_version.rs Extend split metadata v0.8 serialization with optional storage_uri.
quickwit/quickwit-metastore/src/split_metadata.rs Add storage_uri to SplitMetadata + effective_storage_uri helper.
quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs Deserialize and apply extra_index_uris during update-index.
quickwit/quickwit-metastore/src/metastore/mod.rs Add (de)serialization support for extra_index_uris in UpdateIndexRequestExt.
quickwit/quickwit-metastore/src/metastore/index_metadata/mod.rs Persist extra_index_uris updates in index metadata; add unit test.
quickwit/quickwit-metastore/src/metastore/file_backed/mod.rs Deserialize and apply extra_index_uris during update-index.
quickwit/quickwit-metastore/src/metastore/file_backed/file_backed_index/mod.rs Thread extra_index_uris through file-backed index config updates.
quickwit/quickwit-janitor/src/actors/garbage_collector.rs Pass storage resolver through GC plumbing; adjust mocks.
quickwit/quickwit-janitor/src/actors/delete_task_service.rs Build an IndexingSplitStore with multiple storages + selector for delete pipeline.
quickwit/quickwit-janitor/src/actors/delete_task_planner.rs Build SearchJob using split effective storage URI.
quickwit/quickwit-janitor/src/actors/delete_task_pipeline.rs Use IndexingSplitStore instead of a single Storage in delete pipeline.
quickwit/quickwit-integration-tests/src/tests/multi_bucket_tests.rs New end-to-end integration test covering multi-bucket ingest + search.
quickwit/quickwit-integration-tests/src/tests/mod.rs Register the new multi-bucket test module.
quickwit/quickwit-indexing/src/split_store/mod.rs Export bucket selector API.
quickwit/quickwit-indexing/src/split_store/indexing_split_store.rs Support multiple storages + per-split read/write routing using effective URI.
quickwit/quickwit-indexing/src/split_store/bucket_selector.rs New round-robin bucket selector + tests.
quickwit/quickwit-indexing/src/models/split_attrs.rs Initialize new SplitMetadata.storage_uri field.
quickwit/quickwit-indexing/src/mature_merge.rs Resolve all configured storages and write merged outputs via selector.
quickwit/quickwit-indexing/src/lib.rs Re-export split-store cache and selector helpers.
quickwit/quickwit-indexing/src/actors/uploader.rs Select bucket per new split and persist SplitMetadata.storage_uri.
quickwit/quickwit-indexing/src/actors/merge_split_downloader.rs Fetch splits using split metadata (effective storage URI).
quickwit/quickwit-indexing/src/actors/indexing_service.rs Build multi-storage IndexingSplitStore for indexing pipelines.
quickwit/quickwit-indexing/src/actors/indexing_pipeline.rs Remove direct Storage from params; rely on IndexingSplitStore.
quickwit/quickwit-index-management/src/index.rs Validate connectivity for extra storages; pass resolver into GC flows.
quickwit/quickwit-index-management/src/garbage_collection.rs Group deletions per effective storage URI; resolve per-bucket storage for bulk delete.
quickwit/quickwit-config/src/index_template/serialize.rs Add extra_index_uris to index template (de)serialization.
quickwit/quickwit-config/src/index_template/mod.rs Add extra_index_uris to templates + validation; propagate into index configs.
quickwit/quickwit-config/src/index_config/serialize.rs Add extra_index_uris to index config schema; enforce “no removals” on update.
quickwit/quickwit-config/src/index_config/mod.rs Add extra_index_uris field + helper all_index_uris; include in fingerprinting.
quickwit/quickwit-common/src/uri.rs Add ordering to Protocol and Uri to enable grouping/sorting by URI.
quickwit/quickwit-cli/src/tool.rs Resolve the correct storage URI for a specific split when extracting.
quickwit/quickwit-cli/src/lib.rs Checklist now validates connectivity for extra index storages too.
docs/reference/rest-api.md Document extra_index_uris in create/update index REST payloads.
docs/configuration/storage-config.md Mention extra_index_uris as another place storage URIs can be used.
docs/configuration/index-config.md Document extra_index_uris and the multi-bucket split sharding behavior/caveat.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread quickwit/quickwit-integration-tests/src/tests/multi_bucket_tests.rs
Comment thread quickwit/quickwit-search/src/root.rs
Comment thread quickwit/quickwit-search/src/root.rs
Comment thread quickwit/quickwit-search/src/list_terms.rs
Comment thread quickwit/quickwit-search/src/list_fields.rs
Comment thread quickwit/quickwit-indexing/src/split_store/indexing_split_store.rs
Comment on lines +502 to +513
let failed_split_paths = all_storage_failures
.iter()
.map(|split_info| split_info.file_name.as_path())
.collect::<Vec<_>>();
error!(
error=?bulk_delete_error.error,
index_id=index_uid.index_id,
storage_uri=%uri,
"failed to delete split file(s) {:?} from storage",
PrettySample::new(&failed_split_paths, 5),
);
combined_storage_error = Some(bulk_delete_error);
Comment thread quickwit/quickwit-config/src/index_config/serialize.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants