Skip to content

enable partitioning for metrics data#6340

Open
mattmkim wants to merge 7 commits intomainfrom
matthew.kim/metrics-partitioning
Open

enable partitioning for metrics data#6340
mattmkim wants to merge 7 commits intomainfrom
matthew.kim/metrics-partitioning

Conversation

@mattmkim
Copy link
Copy Markdown
Contributor

@mattmkim mattmkim commented Apr 24, 2026

Description

Enables partitioning of metric (points and sketches) data. A partition_id will be added to split metadata (the split_metadata_json field).

Summary of functionality (basically mirrors logs):

  • Evaluates the index partition_key routing expression against RecordBatch rows.
  • Splits incoming metrics/sketch Arrow batches into partition-local batches in ParquetDocProcessor.
  • Carries all partition-local batches from one RawDocBatch in a single ProcessedParquetBatch, with one checkpoint delta covering the whole source batch.
  • Updates ParquetIndexer to maintain per-partition accumulators and flush the whole workbench together on threshold, force commit, timeout, or shutdown. (potential downside of this is the potential of having small partition batches/small files, but this greatly simplifies the implementation, and is what logs does).
  • Enforces max_num_partitions for metrics partitions and routes overflow partitions to OTHER_PARTITION_ID.
  • Updates ParquetPackager to write one split per flushed partition batch and persist partition_id into split metadata.

This PR keeps the new Arrow-backed routing and metrics pipeline changes behind the existing metrics feature. quickwit-doc-mapper now exposes its Arrow RoutingExprContext only when quickwit-doc-mapper/metrics is enabled, and quickwit-indexing/metrics wires that feature in alongside its existing optional arrow and quickwit-parquet-engine dependencies.

for @g-talbot : partition_id is not a column in postgres, it's in a JSON column, so we can't effectively filter/group by partition_id in postgres. but looking at how logs does compaction, they don't filter by partition_id from postgres, they just group file sin memory by partition_id after getting all the metadata back.

How was this PR tested?

Describe how you tested this PR.

@mattmkim mattmkim marked this pull request as ready for review April 28, 2026 14:33
@mattmkim mattmkim changed the title [draft] partitioning for metrics enable partitioning for metrics data Apr 28, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fdafdcb627

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +63 to +67
DataType::Dictionary(_, value_type) if value_type.as_ref() == &DataType::Utf8 => {
let dict = column
.as_any()
.downcast_ref::<arrow::array::DictionaryArray<Int32Type>>()
.expect("dictionary column should be DictionaryArray<Int32>");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid panicking on non-Int32 dictionary keys

When a referenced routing column arrives as a valid Arrow Dictionary with a non-Int32 key type, this match accepts it because it only checks the value type, but the subsequent downcast to DictionaryArray<Int32Type> returns None and the expect panics. This can take down the doc processor for otherwise string-valued tag columns encoded with UInt8, Int16, etc.; either match DataType::Dictionary(Int32, Utf8) explicitly or handle the supported key types instead of treating all dictionaries as Int32.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@g-talbot g-talbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM'ing to keep you unblocked, but there's a codex comment and it's a real one that I had to deal with in the merger too.

g-talbot added a commit that referenced this pull request Apr 28, 2026
Now that ParquetSplitMetadata has partition_id (from Matt's PR #6340),
include it in CompactionScope so splits with different partitions are
never merged together. Adds 2 new scope tests for partition isolation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/quickwit-oss/quickwit/blob/0ec77d339ba7d04f2f6466a2b029421167aa0124/quickwit-indexing/src/actors/metrics_pipeline/parquet_doc_processor.rs#L414-L418
P2 Badge Preserve partition ids across Arrow batches

When one workbench receives multiple Arrow IPC batches, applying this max_num_partitions cap independently inside partition_batch can misroute rows for partitions that are already open in the indexer. For example with max_num_partitions = 2, if the first IPC batch opens partitions A and B, and a later IPC batch first sees C, D, then A, the A rows are converted to OTHER_PARTITION_ID here before the indexer sees them; the indexer can no longer recover that they belonged to A, so the same routing key is split between its real partition and OTHER. The cap needs to be enforced with workbench/global partition state, or the doc processor should forward the true partition ids and let the indexer do the overflow routing.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants