Skip to content

Conversation

rishvin
Copy link
Contributor

@rishvin rishvin commented Sep 5, 2025

Which issue does this PR close?

Addresses Part of #1941

Rationale for this change

Introduces map_to_list which converts a MapArray to ListArray<Struct<K, V>.

This PR has been split from original PR #2221. Please see that for details.

What changes are included in this PR?

A new map_to_list physical function.

How are these changes tested?

Added unit tests.

@rishvin
Copy link
Contributor Author

rishvin commented Sep 5, 2025

@comphead / @andygrove/ @parthchandra split the second function.

@codecov-commenter
Copy link

codecov-commenter commented Sep 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.62%. Comparing base (f09f8af) to head (da1a397).
⚠️ Report is 488 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2312      +/-   ##
============================================
+ Coverage     56.12%   57.62%   +1.49%     
- Complexity      976     1297     +321     
============================================
  Files           119      147      +28     
  Lines         11743    13497    +1754     
  Branches       2251     2390     +139     
============================================
+ Hits           6591     7777    +1186     
- Misses         4012     4451     +439     
- Partials       1140     1269     +129     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rishvin rishvin changed the title chore: [1941-Part2]: Introduce map_to_list scalar function feat: [1941-Part2]: Introduce map_to_list scalar function Sep 7, 2025
@comphead
Copy link
Contributor

comphead commented Sep 9, 2025

Sorry @rishvin it looks like some conflicts in there

@rishvin rishvin force-pushed the 1941-part2-introduce-map-to-list-function branch from d90c182 to da1a397 Compare September 10, 2025 03:23
@rishvin
Copy link
Contributor Author

rishvin commented Sep 10, 2025

Sorry @rishvin it looks like some conflicts in there

Thanks @comphead , resolved the conflict. I forced pushed to change my commit message from chore to feat.

@rishvin
Copy link
Contributor Author

rishvin commented Sep 11, 2025

Spark SQL Tests / spark-sql-sql_core-1/ubuntu-24.04/spark-3.5.6/java-11 (pull_request)

This failure looks unrelated to the PR changes.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rishvin what exactly spark sql function this PR addresses?
I checked https://spark.apache.org/docs/latest/api/sql/search.html?q=map I dont see map_to_list

Are you referring to https://spark.apache.org/docs/latest/api/sql/#map_entries ?
If so this function already supported

@rishvin
Copy link
Contributor Author

rishvin commented Sep 12, 2025

Thanks @rishvin what exactly spark sql function this PR addresses? I checked https://spark.apache.org/docs/latest/api/sql/search.html?q=map I dont see map_to_list

Are you referring to https://spark.apache.org/docs/latest/api/sql/#map_entries ? If so this function already supported

Hey @comphead : This is not spark sql function. We are using it internally to support grouping on map type.

I know there has been some confusion around MapSort lately, hence providing the full context here regarding the original PR: #2221. There is also subtle difference between map_to_list and map_entries, mentioned later.

Spark4.0+ when doing grouping on Map type applies a logical rule to wrap the grouping column with MapSort expression. This expression class sorts the Map elements by keys, such that the output performed after evaluation is an ordered map. This ordered map can then be supplied for grouping because it will produce the correct result. For eg. these 2 map elements - {a: 1, b: 2} and {b:2, a:1}, regardless of ordering of keys in the map are essentially the same, hence should be part of the same group. The MapSort will sort them as {a: 1, b: 2} and groupby will put them in the same group {a: 1, b: 2}.

Today, writing only the equivalent MapSort expression in Comet along will not work. This is because DataFusion does not directly support grouping on Map type. If we attempt to pass a Map type to DataFusion, it will throw with Not implemented ... error. However, DataFusion does support grouping on a more canonical form like List. So, if we change the Map type to List type, then Datafusion will work and give the correct result. To do so, I added this internal function map_to_list that will convert the Map type to the List type before passing it to the grouping expression. This translation from map to list type is done here, when building the grouping expressions.

But now passing map_to_list to the grouping expression has one problem, the output grouping keys produced by the HashAggregate will be a List instead of Map. The result produced cannot be directly consumed by the downstream operators because they expect the grouping key type to be Map, however, we are passing them List type. We will hit schema-mismatch exception. So, we must convert the List type back to Map again. This translation is done by map_from_list internal function, which convert the output grouping keys back to Map type. This function is implemented in part3 here. And the conversion of List to Map type happens here by applying a projection on top of HashAggregate.

These conversion are totally done in native code, localized to HashAggregate. Since, we maintain the schema requirements, the result produced by the HashAggregate can be consumed downstream Spark operators also.

--
Regarding the second part of using map_entries:

I checked the implementation in Datafusion, here, which has one difference. It doesn't carry forward the field names. It is creating key field with name key and value field with name value. This means, if the original map has field names as keys and values (plural), they will not be retained when doing this conversion. Also, it does not carry forward sorted flag which is very specific to Map type.

The map_to_list addresses this by cloning the incoming Map's field and piggybacking the sorted flag in the metadata hashmap of the field. This is being done here.

And when converting the list back to map, the original field and metadata is consumed to construct the map type back. This is done here.

Retaining these metadata information is important, otherwise, we may run into schema mismatch issues. I ran into some schema related issue when implementing this.

So, I think we may be able to use map_entries but we should retain the field and map specific metadata.

Let me know if you have any further questions and your thoughts around this.

@comphead
Copy link
Contributor

Thanks @rishvin appreciate if you can take #2388 fill up the context and we can go through PRs once again

@rishvin rishvin marked this pull request as draft September 13, 2025 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants