Add strategy-focused InList benchmarks by geoffreyclaude · Pull Request #21648 · apache/datafusion

geoffreyclaude · 2026-04-15T18:08:21Z

Which issue does this PR close?

Part of Further improve performance of IN list evaluation #19241.
This PR was originally proposed as the first commit in the broader IN LIST optimization series in IN LIST optims #19390.

Rationale for this change

IN LIST has become the target of several specialized execution strategies, but the existing benchmark coverage in datafusion/physical-expr/benches/in_list.rs is mostly end-to-end and historical in nature. That broad coverage is useful for regression tracking, but it is not ideal for answering more focused questions such as:

how a specific strategy behaves as the list size crosses a threshold
whether the fast path helps both hit-heavy and miss-heavy workloads
how null handling affects the strategy-specific implementations
how two-stage string filters behave in their worst-case collision patterns

This PR adds a dedicated strategy benchmark harness for IN LIST so future performance work can be evaluated against a stable, repeatable, strategy-focused corpus.

This PR does not change InList execution behavior. It only adds benchmark coverage.

What changes are included in this PR?

Adds datafusion/physical-expr/benches/in_list_strategy.rs
Registers the new benchmark target in datafusion/physical-expr/Cargo.toml
Keeps the existing benches/in_list.rs benchmark suite intact for broader historical comparison
Adds targeted Criterion coverage for the main IN LIST strategy families, including:
- bitmap-style paths for narrow integer cases
- branchless and hash/probe-style paths for primitive values at different list-size thresholds
- reinterpretation-heavy cases such as floats and timestamps
- string and string-view cases, including stage-2 / prefix-collision stress inputs
- null-heavy and NOT IN scenarios
- dictionary and fixed-size-binary coverage used by the broader IN LIST implementation

Are these changes tested?

Yes. I validated this PR with:

cargo fmt --all
cargo clippy -p datafusion-physical-expr --all-targets --all-features -- -D warnings
cargo test -p datafusion-physical-expr in_list --lib

Are there any user-facing changes?

No user-facing changes. This PR only adds benchmark coverage for development and performance evaluation.

adriangb

The comments refer to optimizations which don't exist. Can we maybe reword this to be more general or refer to these as "cases" and things like "short strings, large list", "large strings, short list", etc.

CI failures are unrelated.

geoffreyclaude · 2026-04-15T20:02:20Z

The comments refer to optimizations which don't exist. Can we maybe reword this to be more general or refer to these as "cases" and things like "short strings, large list", "large strings, short list", etc.

CI failures are unrelated.

Good remark, I've reworked the comments and naming to avoid refering to the still hypothetical optims in a new commit.

Add a new in_list_strategy benchmark file with targeted coverage of each optimization strategy, without replacing the existing in_list benchmarks which are kept intact for historical comparison. (cherry picked from commit d6e645d)

adriangb · 2026-04-15T22:03:22Z

Thanks! I've rebased and plan to merge this once CI passes

adriangb · 2026-04-16T18:34:02Z

Thanks @geoffreyclaude !

github-actions bot added the physical-expr Changes to the physical-expr crates label Apr 15, 2026

geoffreyclaude mentioned this pull request Apr 15, 2026

IN LIST optims #19390

Open

adriangb approved these changes Apr 15, 2026

View reviewed changes

geoffreyclaude force-pushed the perf/in_list_benchmarks branch 2 times, most recently from cffffd5 to 78c9d26 Compare April 15, 2026 19:36

geoffreyclaude and others added 2 commits April 15, 2026 17:03

Add strategy-focused InList benchmarks

d512064

Add a new in_list_strategy benchmark file with targeted coverage of each optimization strategy, without replacing the existing in_list benchmarks which are kept intact for historical comparison. (cherry picked from commit d6e645d)

Clarify InList benchmark case descriptions

fa93228

adriangb force-pushed the perf/in_list_benchmarks branch from 78c9d26 to fa93228 Compare April 15, 2026 22:03

adriangb added this pull request to the merge queue Apr 16, 2026

Merged via the queue into apache:main with commit ef9a80c Apr 16, 2026
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add strategy-focused InList benchmarks#21648

Add strategy-focused InList benchmarks#21648
adriangb merged 2 commits intoapache:mainfrom
geoffreyclaude:perf/in_list_benchmarks

geoffreyclaude commented Apr 15, 2026 •

edited

Loading

Uh oh!

adriangb left a comment

Uh oh!

geoffreyclaude commented Apr 15, 2026

Uh oh!

adriangb commented Apr 15, 2026

Uh oh!

adriangb commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

geoffreyclaude commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

geoffreyclaude commented Apr 15, 2026

Uh oh!

adriangb commented Apr 15, 2026

Uh oh!

adriangb commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

geoffreyclaude commented Apr 15, 2026 •

edited

Loading