-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Introduce selection vector repartitioning #15423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This starts to look nice @goldmedal |
87a84f9
to
3a151dd
Compare
After rebasing to the main, the hash join no longer uses |
#15339 It looks like the join plan is being changed. |
also c.c. @zebsme |
You should be able to get the test back by also setting |
Thanks. It works. I also added the test for the normal hash partition for RepartitionExec. |
I'm working on HashAggregate goldmedal#3 based on this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! @goldmedal
Makes sense to me 👍 |
# TODO: The selection vector partitioning should be used for the hash join. | ||
# After fix https://github.com/apache/datafusion/issues/15382 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't implement the planner for the hash join to avoid making this PR huge and complex. I think #15382 will implement the required parts.
Isn't it always better partitioning on this selection vectors in case of hash-rep 🤔 What is the reason of keeping the old strategy ? |
I think to support this selection vector, the executors need to be updated to interpret an additional metadata column. However, since executors are part of the public interface that some downstream projects might depend on directly, it's better to ensure backward compatibility. |
I agree with @2010YOUY01. It would be a breaking change for |
The CI failure isn't related to this PR. It could be fixed by #15493 |
b28bf32
to
c6b33da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces selection vector repartitioning to enable efficient hash partitioning without data copying. Key changes include:
- Adding a configuration option (prefer_hash_selection_vector_partitioning_agg) to control partitioning mode.
- Introducing a new physical hash partitioning variant (HashSelectionVector) and updating the related enums and expressions.
- Updating repartitioning, join, aggregate, optimizer and test code to support the new repartitioning mode.
Reviewed Changes
Copilot reviewed 64 out of 64 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
datafusion/physical-plan/src/windows/window_agg_exec.rs | Updated distribution logic to include the new HashPartitionMode parameter. |
datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs | Updated repartitioning to use the new HashPartitionMode parameter. |
datafusion/physical-plan/src/repartition/mod.rs | Added support for HashSelectionVector in repartitioning and updated tests. |
datafusion/physical-plan/src/joins/* | Updated join repartitioning to propagate the HashPartitionMode parameter. |
datafusion/physical-plan/src/aggregates/* and datafusion/physical-plan/src/aggregates/mod.rs | Adjusted aggregate mode pattern matches to account for the new mode parameter. |
datafusion/physical-optimizer/src/* | Updated optimizer and output requirements handling to include the new mode. |
datafusion/physical-expr/src/partitioning.rs | Enhanced display and partition count functions to support HashSelectionVector. |
datafusion/core/* | Updated tests and planner behavior to reflect the new repartitioning mechanism. |
datafusion/common/src/config.rs | Added new config flag to control hash selection vector partitioning in aggregates. |
Comments suppressed due to low confidence (1)
datafusion/physical-plan/src/repartition/mod.rs:391
- Using unwrap() here may lead to a panic if RecordBatch creation fails. Consider propagating the error using the '?' operator or handling the error explicitly.
let batch = RecordBatch::try_new_with_options(Arc::clone(&schema), columns, &options).unwrap();
I did the benchmarks for HashAggregate #15383 (comment) |
c6b33da
to
d10ec22
Compare
Which issue does this PR close?
Rationale for this change
It's a pre-work of #15382 and #15383.
What changes are included in this PR?
datafusion.optimizer.prefer_hash_selection_vector_partitioning
RepartitionExec
HashPartitionMode
to decide the hash partition behavior when planning physical plans.Are these changes tested?
Are there any user-facing changes?