Initialize TopK from file / rowgroup / .. statistics

We could initialize the TopK statistics from column stats (at least for single columns) and make the initial threshold much tighter based on min/max statistics (at file / rowgroup/page level):

* We have a file/rowgroup with more than K (from TopK) amount of rows
* We have a single sort column (directly after scan)
* We can initialize/update the TopK using max (or min) statistics
* Also, if the new bound is smaller / bigger than the current TopK, we could update it to the tighter bound

This I think might help making initial threshold much tighter instead of having to read all the first row groups using not-initialized TopK.

_Originally posted by @Dandandan in https://github.com/apache/datafusion/issues/21580#issuecomment-4266553697_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize TopK from file / rowgroup / .. statistics #21691

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Initialize TopK from file / rowgroup / .. statistics #21691

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions