parquet: optimize CachedArrayReader byte-array coalescing by ClSlaid · Pull Request #9743 · apache/arrow-rs

ClSlaid · 2026-04-16T17:00:00Z

When CachedArrayReader builds output from multiple cached batches, the old path materialized filtered byte arrays and then concatenated them. Replace that path for Utf8/Binary arrays with a direct coalescer that builds offsets, values, and validity in one output array, while keeping the existing generic MutableArrayData path for other types.

Add a dedicated CachedArrayReader benchmark and a sparse string regression test so this path is measured directly and covered independently of broader parquet reader benchmarks.

Benchmark vs main:

cached_array_reader/utf8_sparse_cross_batch_4m_rows/consume_batch: 11.949 ms -> 4.153 ms (-65.2%)
arrow_reader_clickbench/sync/Q24 (same filter/projection as ClickBench Q26): 28.377 ms -> 28.443 ms (+0.2%, no measurable change)

Which issue does this PR close?

Closes [parquet] reduce the time spent in CachedArrayReader #9060.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

When CachedArrayReader builds output from multiple cached batches, the old path materialized filtered byte arrays and then concatenated them. Replace that path for Utf8/Binary arrays with a direct coalescer that builds offsets, values, and validity in one output array, while keeping the existing generic MutableArrayData path for other types. Add a dedicated CachedArrayReader benchmark and a sparse string regression test so this path is measured directly and covered independently of broader parquet reader benchmarks. Benchmark vs main: - cached_array_reader/utf8_sparse_cross_batch_4m_rows/consume_batch: 11.949 ms -> 4.153 ms (-65.2%) - arrow_reader_clickbench/sync/Q24 (same filter/projection as ClickBench Q26): 28.377 ms -> 28.443 ms (+0.2%, no measurable change) Signed-off-by: cl <cailue@apache.org>

ClSlaid · 2026-04-17T09:00:43Z

@alamb I've tried to optimize with GPT 5.4, the improvement is not that obvious in the original test case you gave. So I let it wrote a new benchmark and optimized on it.

However, I'm still not really confident about the current implementation, so please have a look.

github-actions bot added the parquet Changes to the parquet crate label Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet: optimize CachedArrayReader byte-array coalescing#9743

parquet: optimize CachedArrayReader byte-array coalescing#9743
ClSlaid wants to merge 1 commit intoapache:mainfrom
ClSlaid:issue-9060-cached-array-reader-byte-coalescer

ClSlaid commented Apr 16, 2026

Uh oh!

ClSlaid commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ClSlaid commented Apr 16, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

ClSlaid commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant