enable partitioning for metrics data by mattmkim · Pull Request #6340 · quickwit-oss/quickwit

mattmkim · 2026-04-24T18:30:48Z

Description

Enables partitioning of metric (points and sketches) data. A partition_id will be added to split metadata (the split_metadata_json field).

Summary of functionality (basically mirrors logs):

Evaluates the index partition_key routing expression against RecordBatch rows.
Splits incoming metrics/sketch Arrow batches into partition-local batches in ParquetDocProcessor.
Carries all partition-local batches from one RawDocBatch in a single ProcessedParquetBatch, with one checkpoint delta covering the whole source batch.
Updates ParquetIndexer to maintain per-partition accumulators and flush the whole workbench together on threshold, force commit, timeout, or shutdown. (potential downside of this is the potential of having small partition batches/small files, but this greatly simplifies the implementation, and is what logs does).
Enforces max_num_partitions for metrics partitions and routes overflow partitions to OTHER_PARTITION_ID.
Updates ParquetPackager to write one split per flushed partition batch and persist partition_id into split metadata.

This PR keeps the new Arrow-backed routing and metrics pipeline changes behind the existing metrics feature. quickwit-doc-mapper now exposes its Arrow RoutingExprContext only when quickwit-doc-mapper/metrics is enabled, and quickwit-indexing/metrics wires that feature in alongside its existing optional arrow and quickwit-parquet-engine dependencies.

for @g-talbot : partition_id is not a column in postgres, it's in a JSON column, so we can't effectively filter/group by partition_id in postgres. but looking at how logs does compaction, they don't filter by partition_id from postgres, they just group file sin memory by partition_id after getting all the metadata back.

How was this PR tested?

Describe how you tested this PR.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fdafdcb627

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-28T14:45:32Z

+            DataType::Dictionary(_, value_type) if value_type.as_ref() == &DataType::Utf8 => {
+                let dict = column
+                    .as_any()
+                    .downcast_ref::<arrow::array::DictionaryArray<Int32Type>>()
+                    .expect("dictionary column should be DictionaryArray<Int32>");


Avoid panicking on non-Int32 dictionary keys

When a referenced routing column arrives as a valid Arrow Dictionary with a non-Int32 key type, this match accepts it because it only checks the value type, but the subsequent downcast to DictionaryArray<Int32Type> returns None and the expect panics. This can take down the doc processor for otherwise string-valued tag columns encoded with UInt8, Int16, etc.; either match DataType::Dictionary(Int32, Utf8) explicitly or handle the supported key types instead of treating all dictionaries as Int32.

Useful? React with 👍 / 👎.

g-talbot

LGTM'ing to keep you unblocked, but there's a codex comment and it's a real one that I had to deal with in the merger too.

Now that ParquetSplitMetadata has partition_id (from Matt's PR #6340), include it in CompactionScope so splits with different partitions are never merged together. Adds 2 new scope tests for partition isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…-partitioning' into matthew.kim/metrics-partitioning

chatgpt-codex-connector

💡 Codex Review

https://github.com/quickwit-oss/quickwit/blob/0ec77d339ba7d04f2f6466a2b029421167aa0124/quickwit-indexing/src/actors/metrics_pipeline/parquet_doc_processor.rs#L414-L418
Preserve partition ids across Arrow batches

When one workbench receives multiple Arrow IPC batches, applying this max_num_partitions cap independently inside partition_batch can misroute rows for partitions that are already open in the indexer. For example with max_num_partitions = 2, if the first IPC batch opens partitions A and B, and a later IPC batch first sees C, D, then A, the A rows are converted to OTHER_PARTITION_ID here before the indexer sees them; the indexer can no longer recover that they belonged to A, so the same routing key is split between its real partition and OTHER. The cap needs to be enforced with workbench/global partition state, or the doc processor should forward the true partition ids and let the indexer do the overflow routing.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

mattmkim added 3 commits April 24, 2026 14:21

partitioning for metrics

696ec80

fix checkpointing

207df83

lint

fdafdcb

mattmkim marked this pull request as ready for review April 28, 2026 14:33

mattmkim changed the title ~~[draft] partitioning for metrics~~ enable partitioning for metrics data Apr 28, 2026

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

g-talbot approved these changes Apr 28, 2026

View reviewed changes

g-talbot mentioned this pull request Apr 28, 2026

feat: add Parquet merge policy for compaction (Phase 2) #6351

Open

8 tasks

Merge branch 'main' into matthew.kim/metrics-partitioning

30e7679

mattmkim added 2 commits April 28, 2026 16:35

address comment, fix max num partition handling

6da0af6

Merge remote-tracking branch 'refs/remotes/origin/matthew.kim/metrics…

0ec77d3

…-partitioning' into matthew.kim/metrics-partitioning

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

move partitioning to indexer, not doc processor

a040ab0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable partitioning for metrics data#6340

enable partitioning for metrics data#6340
mattmkim wants to merge 7 commits intomainfrom
matthew.kim/metrics-partitioning

mattmkim commented Apr 24, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Uh oh!

g-talbot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mattmkim commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How was this PR tested?

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

g-talbot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mattmkim commented Apr 24, 2026 •

edited

Loading