-
Notifications
You must be signed in to change notification settings - Fork 1.4k
STRING_AGG missing functionality #14412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
59ac6aa
to
fb56b81
Compare
fb56b81
to
808e417
Compare
Close/reopen to rerun CI checks |
808e417
to
4abf8a0
Compare
@@ -5568,6 +5573,16 @@ SELECT STRING_AGG(x,',') FROM strings WHERE g > 100 | |||
---- | |||
NULL | |||
|
|||
query T | |||
SELECT STRING_AGG(DISTINCT x,',') FROM strings WHERE g > 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing space between the x
and ','
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♂️ It's the format that all the previous STRING_AGG tests were following, I kept it for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consistency always wins over correctness!
b36a9c5
to
c022671
Compare
e204059
to
f6be8e4
Compare
pub struct StringAgg { | ||
signature: Signature, | ||
array_agg: ArrayAgg, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I've seen this pattern much in DataFusion codebase about reusing another function as a building block. Any opinions about this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems good to me 👍
f6be8e4
to
babc94b
Compare
#[cfg(test)] | ||
mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding that having unit tests near the function definitions itself is significantly more ergonomic than writing sql logic tests for a couple of reasons:
- We can tests the accumulators in isolation, allowing for finer grained control about batch updating, state generation, merging different states from different accumulators, etc...
- The time it takes since a developer makes a code change, until the appropriate test is run is reduced significantly:
cargo test --lib string_agg::tests --manifest-path 1.40s user 1.53s system 132% cpu 2.204 total
VS
cargo test --test sqllogictests 33.61s user 7.17s system 365% cpu 11.164 total
Measured on a Mac M3
Not saying that we should not have sql logic tests, I think those are a must, but maybe having some testing tooling for folks to be able to contribute unit tests also here could improve DX.
I see that this is not an stablished pattern though, and I'm wondering what are people's take on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think that we should strive that sqllogictests cover all the "end user" visible behavior -- since they function as integration style test and make sure the functionality is all hooked up together correctly (not just working in isolation)
Unit tests / in module tests are good to cover cases that are hard to cover in sqllogictests
I think there is some more back story here: https://datafusion.apache.org/contributor-guide/testing.html#sqllogictests-tests
In terms of cycle times, sqllogictests do have the benefit you can update them without any code changes (so writing / updating them is sometimes faster than code), though you are right that testing code changes requires a recompilation
@logan-keede has been working on improving the build performance recently. Hopefully this will get better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 it looks there's still some cards that can be played for making compilation times faster then. 👍 thanks for all that info!
@geoffreyclaude (who is helping review) when you think this PR is ready, can you please ping me and I'll give it a review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty complete to me, you've covered all my initial suggestions!
@alamb all good for me! You can give the final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @gabotechs and @geoffreyclaude -- I think this implementation is very nice, clean and well tested. 🏆
pub struct StringAgg { | ||
signature: Signature, | ||
array_agg: ArrayAgg, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems good to me 👍
Which issue does this PR close?
distinct
andorder by
clause forstring_agg
aggregate function #8260.Rationale for this change
See #14413 first.
Complete the missing functionality of the STRING_AGG function.
What changes are included in this PR?
Adds support for DISTINCT and ORDER_BY clauses by reusing the existing ARRAY_AGG functionality and building the whole STRING_AGG aggregation function on top of it. This way, the full STRING_AGG functionality is automatically implemented [almost] for free.
The rationale for reusing the ARRAY_AGG functionality is because both functions are very similar, with just two minor diferences:
In order to have the full STRING_AGG functionality, some small addition is also needed for the ARRAY_AGG function, as the current implementation is missing support for DISTINCT + ORDER BY. See #14413.
Are these changes tested?
Yes, both in unit tests and sqllogictests.
Are there any user-facing changes?
Users will be able to issue STRING_AGG calls with DISTINCT and ORDER BY clauses.