Dynamic pruning filters from TopK state #15037

adriangb · 2025-03-05T22:15:40Z

Is your feature request related to a problem or challenge?

From discussion with @alamb yesterday the idea came up of optimizing queries like select * from data order by timestamp desc limit 10 for the case where the data is not perfectly sorted by timestamp but mostly follows a sorted pattern.

You can imagine this data gets created if multiple sources with clock skews, network delays, etc. are writing data and you don't do anything fancy to guarantee perfect sorting by timestamp (i.e. you naively write out the data to Parquet, maybe do some compaction, etc.). The point is that 99% of yesterday's files have a timestamp smaller than 99% of today's files but there may be a couple seconds of overlap between files. To be concrete, let's say this is our data:

file	min	max
1	1	10
2	9	19
3	20	31
4	30	35

Currently DataFusion will exhaustively open each file, read the timestamp column and feed it into a TopK.
I think we can do a lot better if we:

Use file stats to decide which files to work on first. In this case it makes sense to start with file 4 and 3 (assuming we have parallelism of 2).
Let's say that between those two we have 10 rows, so we've already filled up our TopK. The only way more things would get added to our TopK is if they are greater than the smallest item already seen (let's say that's 20, the smallest value in file 3).
Now we know just from statistics that we can skip files 2 and 1 because neither of them can have any timestamp > 20.

Extrapolating this to scenarios where you have years worth / TBs of data and want a limit 5 would yield orders of magnitude improvement I think.

@alamb mentioned this sounds similar to Dynamic Filters, I assume this must be a known technique (or my analysis may be completely wrong 😆 ) but I don't know what it would be called.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2025-03-05T22:21:39Z

@alamb mentioned this sounds similar to Dynamic Filters, I assume this must be a known technique (or my analysis may be completely wrong 😆 ) but I don't know what it would be called.

There was a talk at CIDR this year that mentioned this:

Sponsor Talk 3: The Fine Art of Work Skipping
Stefan Mandl, Snowflake

It seems they wrote a blog about it too here: https://www.snowflake.com/en/engineering-blog/optimizing-top-k-aggregation-snowflake/

adriangb · 2025-03-05T22:30:17Z

Nice to know I'm not totally off on the idea 😄

alamb · 2025-03-05T22:31:59Z

Nice to know I'm not totally off on the idea 😄

Not at all!

alamb · 2025-03-12T10:46:30Z

BTW I am pretty sure DuckDB is using this technique and why they are so much faster on ClickBench Q23:

Make ClickBench Q23 Go Faster #15177

adriangb · 2025-03-17T11:41:00Z

Does anyone have a handle on how we might implement this? I was thinking we’d need to add a method to exec operators called apply_filter but that basically sends down the additional filter and by default it gets forwarded to children until it hits an exec that knows what to do with it (eg DataSourceExec). But I’m not very clear beyond that.

alamb · 2025-03-18T20:06:08Z

Does anyone have a handle on how we might implement this? I was thinking we’d need to add a method to exec operators called apply_filter but that basically sends down the additional filter and by default it gets forwarded to children until it hits an exec that knows what to do with it (eg DataSourceExec). But I’m not very clear beyond that.

To begin with I would suggest:

Make a new PhysicalExpr named something like TopKRuntimeFilter
Add a physical optimizer pass that runs after all other passes (so the structure doesn't change) that finds TopK nodes and tries to find connected Scans the (start with some basic rules, don't try and go past joins, etc)
Add TopKRuntimeFilter to those scans

Then the trick will be to figure out how to share the TopKHeap created in the TopK operator

datafusion/datafusion/physical-plan/src/topk/mod.rs

Line 259 in 8c8b245

struct TopKHeap {

With the TopKRuntimeFilter

And then orchestrate concurrent access to it somehow

Closes apache#15037

adriangb · 2025-03-20T15:13:26Z

@alamb I implemented something like that in #15301

alamb · 2025-03-20T15:52:14Z

Thanks @adriangb -- I will try and review it asap (hopefully tomorrow afternoon or tomorrow)

Closes apache#15037

adriangb added the enhancement New feature or request label Mar 5, 2025

alamb mentioned this issue Mar 12, 2025

Make ClickBench Q23 Go Faster #15177

Open

adriangb added a commit to pydantic/datafusion that referenced this issue Mar 19, 2025

First draft of dynamic filters for TopK

661635e

Closes apache#15037

adriangb linked a pull request Mar 19, 2025 that will close this issue

Add dynamic pruning filters from TopK state #15301

Open

adriangb added a commit to pydantic/datafusion that referenced this issue Mar 20, 2025

First draft of dynamic filters for TopK

41ecc69

Closes apache#15037

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic pruning filters from TopK state #15037

Dynamic pruning filters from TopK state #15037

adriangb commented Mar 5, 2025

alamb commented Mar 5, 2025

adriangb commented Mar 5, 2025

alamb commented Mar 5, 2025

alamb commented Mar 12, 2025

adriangb commented Mar 17, 2025

alamb commented Mar 18, 2025

adriangb commented Mar 20, 2025

alamb commented Mar 20, 2025

Dynamic pruning filters from TopK state #15037

Dynamic pruning filters from TopK state #15037

Comments

adriangb commented Mar 5, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Mar 5, 2025

adriangb commented Mar 5, 2025

alamb commented Mar 5, 2025

alamb commented Mar 12, 2025

adriangb commented Mar 17, 2025

alamb commented Mar 18, 2025

adriangb commented Mar 20, 2025

alamb commented Mar 20, 2025