Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fdafdcb627
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| DataType::Dictionary(_, value_type) if value_type.as_ref() == &DataType::Utf8 => { | ||
| let dict = column | ||
| .as_any() | ||
| .downcast_ref::<arrow::array::DictionaryArray<Int32Type>>() | ||
| .expect("dictionary column should be DictionaryArray<Int32>"); |
There was a problem hiding this comment.
Avoid panicking on non-Int32 dictionary keys
When a referenced routing column arrives as a valid Arrow Dictionary with a non-Int32 key type, this match accepts it because it only checks the value type, but the subsequent downcast to DictionaryArray<Int32Type> returns None and the expect panics. This can take down the doc processor for otherwise string-valued tag columns encoded with UInt8, Int16, etc.; either match DataType::Dictionary(Int32, Utf8) explicitly or handle the supported key types instead of treating all dictionaries as Int32.
Useful? React with 👍 / 👎.
g-talbot
left a comment
There was a problem hiding this comment.
LGTM'ing to keep you unblocked, but there's a codex comment and it's a real one that I had to deal with in the merger too.
Now that ParquetSplitMetadata has partition_id (from Matt's PR #6340), include it in CompactionScope so splits with different partitions are never merged together. Adds 2 new scope tests for partition isolation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-partitioning' into matthew.kim/metrics-partitioning
There was a problem hiding this comment.
💡 Codex Review
https://github.com/quickwit-oss/quickwit/blob/0ec77d339ba7d04f2f6466a2b029421167aa0124/quickwit-indexing/src/actors/metrics_pipeline/parquet_doc_processor.rs#L414-L418
Preserve partition ids across Arrow batches
When one workbench receives multiple Arrow IPC batches, applying this max_num_partitions cap independently inside partition_batch can misroute rows for partitions that are already open in the indexer. For example with max_num_partitions = 2, if the first IPC batch opens partitions A and B, and a later IPC batch first sees C, D, then A, the A rows are converted to OTHER_PARTITION_ID here before the indexer sees them; the indexer can no longer recover that they belonged to A, so the same routing key is split between its real partition and OTHER. The cap needs to be enforced with workbench/global partition state, or the doc processor should forward the true partition ids and let the indexer do the overflow routing.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Description
Enables partitioning of metric (points and sketches) data. A
partition_idwill be added to split metadata (thesplit_metadata_jsonfield).Summary of functionality (basically mirrors logs):
partition_keyrouting expression against RecordBatch rows.This PR keeps the new Arrow-backed routing and metrics pipeline changes behind the existing
metricsfeature.quickwit-doc-mappernow exposes its ArrowRoutingExprContextonly whenquickwit-doc-mapper/metricsis enabled, andquickwit-indexing/metricswires that feature in alongside its existing optionalarrowandquickwit-parquet-enginedependencies.for @g-talbot :
partition_idis not a column in postgres, it's in a JSON column, so we can't effectively filter/group by partition_id in postgres. but looking at how logs does compaction, they don't filter bypartition_idfrom postgres, they just group file sin memory bypartition_idafter getting all the metadata back.How was this PR tested?
Describe how you tested this PR.