Skip to content

STRING_AGG missing functionality #14412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 15, 2025

Conversation

gabotechs
Copy link
Contributor

@gabotechs gabotechs commented Feb 2, 2025

Which issue does this PR close?

Rationale for this change

See #14413 first.

Complete the missing functionality of the STRING_AGG function.

What changes are included in this PR?

Adds support for DISTINCT and ORDER_BY clauses by reusing the existing ARRAY_AGG functionality and building the whole STRING_AGG aggregation function on top of it. This way, the full STRING_AGG functionality is automatically implemented [almost] for free.

The rationale for reusing the ARRAY_AGG functionality is because both functions are very similar, with just two minor diferences:

  • STRING_AGG works only with strings, while ARRAY_AGG works with any type.
  • The return of STRING_AGG is the same as ARRAY_AGG, but with the resulting array of strings joined by a delimiter.

In order to have the full STRING_AGG functionality, some small addition is also needed for the ARRAY_AGG function, as the current implementation is missing support for DISTINCT + ORDER BY. See #14413.

Are these changes tested?

Yes, both in unit tests and sqllogictests.

Are there any user-facing changes?

Users will be able to issue STRING_AGG calls with DISTINCT and ORDER BY clauses.

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Feb 2, 2025
@gabotechs gabotechs force-pushed the string-agg-missing-functionality branch from 59ac6aa to fb56b81 Compare February 2, 2025 17:34
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Feb 2, 2025
@gabotechs gabotechs force-pushed the string-agg-missing-functionality branch from fb56b81 to 808e417 Compare February 2, 2025 17:37
@gabotechs gabotechs changed the title String agg missing functionality STRING_AGG missing functionality Feb 3, 2025
@alamb
Copy link
Contributor

alamb commented Feb 4, 2025

Close/reopen to rerun CI checks

@@ -5568,6 +5573,16 @@ SELECT STRING_AGG(x,',') FROM strings WHERE g > 100
----
NULL

query T
SELECT STRING_AGG(DISTINCT x,',') FROM strings WHERE g > 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing space between the x and ','.

Copy link
Contributor Author

@gabotechs gabotechs Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️ It's the format that all the previous STRING_AGG tests were following, I kept it for consistency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consistency always wins over correctness!

@gabotechs gabotechs force-pushed the string-agg-missing-functionality branch 2 times, most recently from b36a9c5 to c022671 Compare February 20, 2025 16:42
@gabotechs gabotechs force-pushed the string-agg-missing-functionality branch 4 times, most recently from e204059 to f6be8e4 Compare April 3, 2025 07:55
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 3, 2025
Comment on lines 84 to 87
pub struct StringAgg {
signature: Signature,
array_agg: ArrayAgg,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I've seen this pattern much in DataFusion codebase about reusing another function as a building block. Any opinions about this approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems good to me 👍

@gabotechs gabotechs force-pushed the string-agg-missing-functionality branch from f6be8e4 to babc94b Compare April 3, 2025 08:07
Comment on lines +251 to +252
#[cfg(test)]
mod tests {
Copy link
Contributor Author

@gabotechs gabotechs Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding that having unit tests near the function definitions itself is significantly more ergonomic than writing sql logic tests for a couple of reasons:

  • We can tests the accumulators in isolation, allowing for finer grained control about batch updating, state generation, merging different states from different accumulators, etc...
  • The time it takes since a developer makes a code change, until the appropriate test is run is reduced significantly:
cargo test --lib string_agg::tests --manifest-path   1.40s user 1.53s system 132% cpu 2.204 total

VS

cargo test --test sqllogictests  33.61s user 7.17s system 365% cpu 11.164 total

Measured on a Mac M3

Not saying that we should not have sql logic tests, I think those are a must, but maybe having some testing tooling for folks to be able to contribute unit tests also here could improve DX.

I see that this is not an stablished pattern though, and I'm wondering what are people's take on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think that we should strive that sqllogictests cover all the "end user" visible behavior -- since they function as integration style test and make sure the functionality is all hooked up together correctly (not just working in isolation)

Unit tests / in module tests are good to cover cases that are hard to cover in sqllogictests

I think there is some more back story here: https://datafusion.apache.org/contributor-guide/testing.html#sqllogictests-tests

In terms of cycle times, sqllogictests do have the benefit you can update them without any code changes (so writing / updating them is sometimes faster than code), though you are right that testing code changes requires a recompilation

@logan-keede has been working on improving the build performance recently. Hopefully this will get better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 it looks there's still some cards that can be played for making compilation times faster then. 👍 thanks for all that info!

@alamb
Copy link
Contributor

alamb commented Apr 4, 2025

@geoffreyclaude (who is helping review) when you think this PR is ready, can you please ping me and I'll give it a review?

Copy link
Contributor

@geoffreyclaude geoffreyclaude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty complete to me, you've covered all my initial suggestions!

@geoffreyclaude
Copy link
Contributor

@alamb all good for me! You can give the final review.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gabotechs and @geoffreyclaude -- I think this implementation is very nice, clean and well tested. 🏆

Comment on lines 84 to 87
pub struct StringAgg {
signature: Signature,
array_agg: ArrayAgg,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems good to me 👍

@alamb alamb merged commit cde8690 into apache:main Apr 15, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement distinct and order by clause for string_agg aggregate function
3 participants