Skip to content

Column projection pushdown for GROUP BY#162

Closed
poyrazK wants to merge 5 commits into
mainfrom
column-projection-pushdown
Closed

Column projection pushdown for GROUP BY#162
poyrazK wants to merge 5 commits into
mainfrom
column-projection-pushdown

Conversation

@poyrazK

@poyrazK poyrazK commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Summary

  • Column projection pushdown: scan reads only required columns for GROUP BY queries
  • Q1 GROUP BY: 161k -> 2.68M rows/s (16-17x speedup)
  • Gap vs DuckDB closes from 385x to 21x (10k)

Commits

  1. Add read_batch(col_indices) overload declaration to ColumnarTable
  2. Implement projection-aware read_batch(col_indices) in ColumnarTable
  3. Add set_required_columns() to VectorizedSeqScanOperator for column projection
  4. Propagate required column indices from GROUP BY to scan in build_vectorized_plan

Tests

  • All 39 vectorized_operator_tests pass
  • All 50 cloudSQL_tests pass

Summary by CodeRabbit

  • Refactor
    • Optimized vectorized query execution to selectively read only necessary columns during GROUP BY and aggregate operations, reducing disk I/O overhead and improving query performance.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 93346df9-7408-48f0-971a-57312b6e803f

📥 Commits

Reviewing files that changed from the base of the PR and between e46a1f6 and 5cfb3a9.

📒 Files selected for processing (4)
  • include/executor/vectorized_operator.hpp
  • include/storage/columnar_table.hpp
  • src/executor/query_executor.cpp
  • src/storage/columnar_table.cpp

📝 Walkthrough

Walkthrough

This PR implements column projection for vectorized scans. The storage layer gains a selective read_batch overload accepting column indices. The scan operator conditionally uses this overload in sequential and parallel paths when required columns are configured. Query planning derives which columns are needed for GROUP BY aggregates and configures the scan operator accordingly.

Changes

Column Projection for Vectorized Scans

Layer / File(s) Summary
Storage layer selective column read interface
include/storage/columnar_table.hpp, src/storage/columnar_table.cpp
New read_batch overload accepts a list of column indices and reads only those columns from disk, appending them into a pre-initialized reduced-schema output batch in the subset's order.
Scan operator column filtering state and execution
include/executor/vectorized_operator.hpp
VectorizedSeqScanOperator stores required column indices and reduced output schema. Sequential scan conditionally initializes batches from reduced schema and calls column-selective read_batch. Parallel scan allocates per-thread batches from either full or reduced schema and submits tasks with either full or selective read_batch overload based on whether column filtering is active.
Query planning column requirement derivation
src/executor/query_executor.cpp
build_vectorized_plan casts the vectorized scan root to VectorizedSeqScanOperator and during GROUP BY/aggregate planning, derives required column indices from group-by and aggregate input expressions, constructs the output schema with aggregate columns, and calls set_required_columns on the scan operator to enforce column projection.

Sequence Diagram

sequenceDiagram
  participant QueryExecutor
  participant VectorizedSeqScanOperator
  participant VectorBatch
  participant ColumnarTable
  QueryExecutor->>VectorizedSeqScanOperator: set_required_columns(col_indices)
  QueryExecutor->>VectorizedSeqScanOperator: next_batch()
  VectorizedSeqScanOperator->>VectorBatch: initialize from reduced_schema_
  VectorizedSeqScanOperator->>ColumnarTable: read_batch(start_row, batch_size, col_indices)
  ColumnarTable->>VectorBatch: append filtered columns
  VectorizedSeqScanOperator->>QueryExecutor: return reduced-width batch
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A scan so smart, it reads with care,
Only columns needed, nothing spare,
From storage vaults to operator's hand,
Projection flows across the land!
~Yours in vectorized efficiency 🌟

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch column-projection-pushdown
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch column-projection-pushdown

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@poyrazK

poyrazK commented Jun 11, 2026

Copy link
Copy Markdown
Owner Author

Closed - using v2 branch based from main instead

@poyrazK poyrazK closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant