-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement GroupsAccumulator for count(DISTINCT)
aggr
#15324
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Thank you @waynexia, I'm planning to check it out at most tomorrow. I have a question in advance before reviewing -- have you been considering to implement groups accumulator for specialized cases of DistinctCountAccumulator (primitive/native types and bytes)? I'm asking because, as for me, it looks a bit odd (though I haven't rechecked performance results, and perhaps GroupsAccumulatorAdapter introduces some insane overhead), that switching from native Rust types to ScalarValue still gives x5 faster execution, while groups accumulator in this case, if I'm not mistaken, does basically the same as the GroupsAccumulatorAdapter -- storing separate states (hashsets) in the vector is already implemented in the adapter. |
Here are my thoughts on why 5X given implementations are very similar: |
#[derive(Debug)] | ||
pub struct DistinctCountGroupsAccumulator { | ||
/// One HashSet per group to track distinct values | ||
distinct_sets: Vec<HashSet<ScalarValue, RandomState>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if a single HashSet<(u64, ScalarValue), RandomState>>
(i.e. also index by group id rather than create a new HashSet
per group) might be faster? It will use less memory and intuitively should be more cache friendly.
@@ -752,10 +761,245 @@ impl Accumulator for DistinctCountAccumulator { | |||
} | |||
} | |||
|
|||
/// GroupsAccumulator for COUNT DISTINCT operations | |||
#[derive(Debug)] | |||
pub struct DistinctCountGroupsAccumulator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow-up, this could be specialized for types as well (e.g. PrimitveDistinctCountGroupsAccumulator
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also using the HashTable
API would probably give some further gains
https://docs.rs/hashbrown/latest/hashbrown/struct.HashTable.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really nice 🚀 gave some hints for further performance improvements
It would be worthwhile to run the |
I am running the benchmarks now and will report BTW if there is no existing coverage we can add the one from this description into clickbench_extended perhaps |
🤔 my measurements show Q3 getting quite a bit slower. I will rerun to test
|
Ahh, I reproduced the same result. And I also observed a regression on q0:
(BTW, how do you get the comparison output |
I am using |
I'll look into the regression this weekend. I suspect the reason is the improvement from grouping is way less than specialized non-group accumulator 🤔 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets try to get no regressions in the extended benchmark first
Quick update: made some progress, but still need a few days to refine it. I've implemented a primitive aggregator and it does work |
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
I (am very excited!) just realized we may have overcomplicated things: we specialize in array types to compute hashes and store the value, but we neither need a dedicated hash function (wrapped as xxx set in previous implementation) nor need to store the origin value. We only need to do two things for Thus I tried another way to rewrite this aggregator, use a uniform accumulator for all types. Do one dispatch for each update to dispatch the actual hash implementation (and this can be eliminated by extracting a type parameter for accumulator). Throw the origin value and only store the hashes in state. This can not only save memory, but also gain a good performance:
p.s. I changed a machine to run them Some follow-up things:
|
}) | ||
Ok(Box::new(DistinctCountAccumulator { | ||
values: HashSet::default(), | ||
random_state: RandomState::with_seeds(1, 2, 3, 4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we only store hashes now, we need a fixed random state for reproducible hash across different accumulators. But I don't know how to choose a group of proper seeds...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... I think we can't store only the hashes as it won't account for hash collisions (two values mapping to the same value). There is not a seed that prevents that (as the number of possible values in e.g. a u64
is much smaller than e.g. a string with 20 bytes)
i think this is still a work in progress, so marking it as a draft to clean up the review queue |
Signed-off-by: Ruihang Xia [email protected]
Which issue does this PR close?
Related to #5472
Rationale for this change
Implement group accumulator for distinct count aggr fn. In
hits.parquet
dataset from clickbench, it can gain ~5x performance improve for query likeselect "RegionID", COUNT("UserID"), COUNT(DISTINCT "UserID") as u FROM hits GROUP BY "RegionID" ORDER BY u DESC LIMIT 10;
:After:
Before:
For queries with only one distinct count (like q5 from clickbench), optimize rule
single_distinct_to_groupby
will rewrite the distinct column to group by column, which avoids the need for this group accumulator. For scenarios exceeding that rule, this group accumulator can improve a lot.What changes are included in this PR?
implement
GroupsAccumulator
for distinct count.Are these changes tested?
yes
Are there any user-facing changes?
no