Skip to content

Initialize TopK from file / rowgroup / .. statistics #21691

@Dandandan

Description

@Dandandan

We could initialize the TopK statistics from column stats (at least for single columns) and make the initial threshold much tighter based on min/max statistics (at file / rowgroup/page level):

  • We have a file/rowgroup with more than K (from TopK) amount of rows
  • We have a single sort column (directly after scan)
  • We can initialize/update the TopK using max (or min) statistics
  • Also, if the new bound is smaller / bigger than the current TopK, we could update it to the tighter bound

This I think might help making initial threshold much tighter instead of having to read all the first row groups using not-initialized TopK.

Originally posted by @Dandandan in #21580 (comment)

Metadata

Metadata

Assignees

Labels

performanceMake DataFusion faster

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions