Resolve MIN/MAX from Parquet metadata for Single-mode aggregates and CAST projections#21651
Resolve MIN/MAX from Parquet metadata for Single-mode aggregates and CAST projections#21651Dandandan wants to merge 7 commits intoapache:mainfrom
Conversation
…CAST projections
Two changes enable the AggregateStatistics optimizer to resolve
MIN/MAX (and COUNT) from file metadata instead of scanning data:
1. `take_optimizable` now handles `Single`/`SinglePartitioned`
aggregate modes, not just `Final` wrapping `Partial`. This matters
for single-partition scans where the planner skips the
Partial→Final split.
2. `project_statistics` now propagates min/max statistics through
CAST expressions (recursively), so that projections like
`CAST(CAST(col AS Int32) AS Date32)` preserve the underlying
column statistics instead of returning unknown.
Together these turn ClickBench Q6
(`SELECT MIN("EventDate"), MAX("EventDate") FROM hits`) from a full
column scan into a metadata-only lookup (PlaceholderRowExec).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmark clickbench_partitioned |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing aggregate-stats-single-mode-and-cast (38736c2) to 5c653be (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing aggregate-stats-single-mode-and-cast (aa1db03) to 5c653be (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing aggregate-stats-single-mode-and-cast (aa1db03) to 5c653be (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing aggregate-stats-single-mode-and-cast (aa1db03) to 5c653be (merge-base) diff using: tpch File an issue against this benchmark runner |
…T-specific code Replace the CAST-specific statistics propagation with a generic approach using PhysicalExpr::evaluate_bounds(). This works for any expression that implements evaluate_bounds — including CAST, negation, and arithmetic with literals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aa1db03 to
077c595
Compare
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
The PhysicalExpr trait no longer exposes as_any() after upstream changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alamb
left a comment
There was a problem hiding this comment.
Thanks @Dandandan - I think this transformation is not correct for some cases (see inlined comments)
…essions Interval arithmetic treats each column reference as an independent value, so applying evaluate_bounds to multi-column expressions (e.g. `a - b`) or to an expression with the same column referenced twice (e.g. `col * col`) produces intervals that are not valid min/max of the expression over the actual data. Folding MIN/MAX aggregates from those intervals — as the aggregate statistics optimizer does for Single-mode aggregates — returns incorrect results. Only propagate bounds when the expression contains at most one Column node in its tree. Also simplify the propagation helper now that exactness is determined by the single column's stats. Adds unit and sqllogictest coverage (the multi-column test from apache#21681 plus a same-column-twice variant). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds sqllogictest coverage for monotonic single-column single-occurrence projections that the aggregate-statistics optimizer now folds from parquet metadata: `-col`, `col + 1`, and `CAST(col AS BIGINT)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
I believe it is now correct, I added a number of tests to check correctness. |
Which issue does this PR close?
Related to improving ClickBench performance (metadata-only query resolution)
Rationale for this change
ClickBench Q6 (
SELECT MIN("EventDate"), MAX("EventDate") FROM hits) was doing a full column scan despite the answer being available in Parquet row group statistics. Two issues prevented theAggregateStatisticsoptimizer from firing:take_optimizablemissedSinglemode — it only matched theFinal → Partialpair, not single-partition scans.project_statisticsreturnedunknownfor any non-Column/Literal expression, discarding Parquet min/max through casts likeCAST(CAST(EventDate AS Int32) AS Date32).This now avoids scanning any columns, going from ~6ms to ~1.5ms
What changes are included in this PR?
aggregate_statistics.rs:take_optimizablenow also matchesSingle/SinglePartitionedaggregates.projection.rs: Addedproject_column_statistics_through_expr()which propagates min/max statistics throughCastExpr.Result: Q6 now resolves entirely from Parquet metadata (zero I/O).
Are these changes tested?
Yes — existing tests pass, ClickBench sqllogictest updated with new expected plan for Q6.
Are there any user-facing changes?
Scalar
MIN/MAXaggregates over CAST projections now resolve from file metadata when statistics are available, avoiding unnecessary I/O.🤖 Generated with Claude Code