feat: legacy multi-RG input adapter for ColumnPageStream (PR-5)#6408
Draft
g-talbot wants to merge 2 commits intogtt/column-page-stream-traitfrom
Draft
feat: legacy multi-RG input adapter for ColumnPageStream (PR-5)#6408g-talbot wants to merge 2 commits intogtt/column-page-stream-traitfrom
g-talbot wants to merge 2 commits intogtt/column-page-stream-traitfrom
Conversation
af78c8f to
2714921
Compare
e169002 to
1af3a64
Compare
2714921 to
61b6310
Compare
1af3a64 to
65a686d
Compare
61b6310 to
4ae07e7
Compare
Buffers a legacy parquet file fully, decodes via Arrow, concatenates into a single RecordBatch preserving order, and re-encodes as a single-row-group parquet stream that StreamingParquetReader can serve through the ColumnPageStream contract. Carries forward the original file's key_value_metadata and sorting_columns so downstream consumers (merge engine, metadata readers) see an identical logical view. This unblocks the merge engine's column-major streaming path on files where the original RG layout is misaligned with the sort prefix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`test_data_roundtrip_through_adapter` checks row count and schema names through the streaming path; that catches dropped rows but not value-level corruption (e.g. a hypothetical dictionary key XOR or column-value swap during the decode/concat/re-encode chain). Adds `test_reencode_preserves_arrays_byte_equal` calling `reencode_as_single_row_group` directly against a multi-RG fixture that includes dict-encoded columns and nulls, and asserts each column equals the oracle byte-for-byte. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
65a686d to
d2330a0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
LegacyInputAdapterexposes legacy multi-row-group parquet files through PR-5a'sColumnPageStreamso PR-6's merge engine doesn't have to special-case them.ParquetRecordBatchReaderBuilder→arrow::compute::concat_batches(preserves order — does NOT re-sort) → re-encode as single-row-group viaStreamingParquetWriter(set_max_row_group_row_count(num_rows + 1)) → wrap in privateInMemoryByteSource→ expose via innerStreamingParquetReader.key_value_metadata(qh.*keys) and RG0sorting_columnsto the consolidated output.MAX_LEGACY_INPUT_BYTES) — production legacy files are well under 1 GiB.qh.rg_partition_prefix_len == 0 && num_row_groups > 1. (PR-6 implements the dispatch; this PR provides the adapter as one of the trait impls.)Stack
Test plan
test_empty_multi_rg_input— empty 2-RG fixture consolidates to 0 rowstest_multi_rg_consolidates_to_single_rg— 3-RG fixture → 1 RG, all pages atrg_idx == 0test_data_roundtrip_through_adapter— adapter exposes oracle row count + schematest_reencode_preserves_arrays_byte_equal— byte-equal data round trip throughreencode_as_single_row_group(added on top of agent's initial commit to close a value-corruption test gap)test_kv_metadata_preserved—qh.sort_fieldsandqh.window_start_secscarried throughtest_sorting_columns_preserved— RG0'ssorting_columnspreserved on the consolidated RGtest_dict_and_null_columns_preserved— data-pagenum_valuessums match per column for dict + nullable columnstest_io_failure_surfaces_as_io_error— buffered GET failure surfaces asLegacyAdapterError::Iocarrying the originalio::Errortest_satisfies_column_page_stream_trait— drains via&mut dyn ColumnPageStreamand confirms idempotent EOFCI gates locally green: clippy
--workspace --all-features --testswith-Dwarnings, nightlyfmt --check,cargo doc --no-deps,cargo machete, license headers, log format, typos. 9/9 adapter tests + 401/401 crate tests pass.🤖 Generated with Claude Code