[SPARK-54317][PYTHON][CONNECT] Unify Arrow conversion logic for Classic and Connect toPandas #53045

Yicong-Huang · 2025-11-13T21:56:27Z

What changes were proposed in this pull request?

This PR merges the Arrow conversion code paths between Spark Connect and Classic Spark by extracting shared logic into a reusable helper function _convert_arrow_table_to_pandas.

Why are the changes needed?

This unifies optimizations from two separate PRs:

[SPARK-53967] (Classic): Avoid intermediate pandas DataFrame creation by converting Arrow columns directly to Series
[SPARK-54183] (Connect): Same optimization implemented for Spark Connect

Does this PR introduce any user-facing change?

No. This is a pure refactoring with no API or behavior changes.

How was this patch tested?

Ran existing Arrow test suite: python/pyspark/sql/tests/arrow/test_arrow.py

Was this patch authored or co-authored using generative AI tooling?

Co-Generated-by Cursor with Claude 4.5 Sonnet

Extract shared conversion logic into _convert_arrow_table_to_pandas helper function in conversion.py to avoid code duplication between Classic and Connect. Key changes: - Add _convert_arrow_table_to_pandas helper function in conversion.py - Update Classic toPandas to handle empty tables explicitly (SPARK-51112) - Only apply self_destruct options when table has rows - Connect imports the shared helper from conversion.py This unifies the optimizations from SPARK-53967 and SPARK-54183: - Avoid intermediate pandas DataFrame during conversion - Convert Arrow columns directly to Series with type converters - Better memory efficiency with self_destruct on non-empty tables Co-authored-by: cursor

gaogaotiantian · 2025-11-13T23:02:34Z

python/pyspark/sql/connect/client/core.py

-                )
+            pdf = _convert_arrow_table_to_pandas(
+                table,
+                schema.fields,


I would highly recommend to avoid using positional argument here. Maybe table is fine as it's obvious. For the rest of the argument I think it's better to put their correponding keyword there.

usually yes adding the keyword is ideal. however for this case all the parameters are named the same as the argument. Adding the keyword would be a bit verbose.

pdf = _convert_arrow_table_to_pandas( table, schema_fields=schema.fields, temp_col_names=temp_col_names, timezone=timezone, struct_in_pandas=struct_in_pandas, error_on_duplicated_field_names=error_on_duplicated_field_names, pandas_options=pandas_options, )

The problem is when in the future, you need to add an extra field, you have to count where you should add it. Also it wouldn't be obvious if someone uses the wrong order.

pdf = _convert_arrow_table_to_pandas( table, schema.fields, temp_col_names, timezone, error_on_duplicated_field_names, struct_in_pandas, pandas_options, )

It's super unobvious that the code above is wrong - and when you look at it only, you don't know how to fix it.

thanks! changed to use and require keyword arguments

gaogaotiantian · 2025-11-14T22:51:58Z

python/pyspark/sql/pandas/conversion.py



+def _convert_arrow_table_to_pandas(
+    *,


nit: I think it's okay to have the table before *. Also I'm don't feel it's necessary to have the * here just because it does not seem like a pattern elsewhere. I think it's a good habit to call with keyword arguments, but enforcing it normally is only required on user facing interfaces where they could easily make a mistake.

as it is an internal helper method, I think either way is fine?

I have removed *.

Yeah I don't think it's a big deal hence the nit. I personally prefer the flexibility of the function and more strict style on caller site.

Thanks! that's a good habit. especially appreciate for those nit comments!

holdenk · 2025-11-15T00:45:50Z

python/pyspark/sql/connect/client/core.py

-                        for arrow_col, field in zip(table.columns, schema.fields)
-                    ],
-                    axis="columns",
-                )


Could we not also unify the # Restore original column names (including duplicates) pdf.columns = schema.names else: # empty columns pdf = table.to_pandas(**pandas_options) logic?

Yicong-Huang added 2 commits November 13, 2025 13:50

revert change on types.py

4ac2656

github-actions bot added SQL PYTHON CONNECT labels Nov 13, 2025

gaogaotiantian reviewed Nov 13, 2025

View reviewed changes

Yicong-Huang added 2 commits November 14, 2025 10:19

revert change on types.py

28e755b

fix: handle comments

2037aa8

Yicong-Huang requested a review from gaogaotiantian November 14, 2025 18:21

fix: mypy

a76043d

gaogaotiantian reviewed Nov 14, 2025

View reviewed changes

Remove asterisk from function parameters

331eb41

gaogaotiantian approved these changes Nov 14, 2025

View reviewed changes

holdenk reviewed Nov 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54317][PYTHON][CONNECT] Unify Arrow conversion logic for Classic and Connect toPandas #53045

[SPARK-54317][PYTHON][CONNECT] Unify Arrow conversion logic for Classic and Connect toPandas #53045

Yicong-Huang commented Nov 13, 2025

Uh oh!

gaogaotiantian Nov 13, 2025

Uh oh!

Yicong-Huang Nov 13, 2025

Uh oh!

gaogaotiantian Nov 13, 2025 •

edited

Loading

Uh oh!

Yicong-Huang Nov 14, 2025

Uh oh!

gaogaotiantian Nov 14, 2025

Uh oh!

Yicong-Huang Nov 14, 2025

Uh oh!

Yicong-Huang Nov 14, 2025

Uh oh!

gaogaotiantian Nov 14, 2025

Uh oh!

Yicong-Huang Nov 14, 2025

Uh oh!

holdenk Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		def _convert_arrow_table_to_pandas(
		*,

[SPARK-54317][PYTHON][CONNECT] Unify Arrow conversion logic for Classic and Connect toPandas #53045

Are you sure you want to change the base?

[SPARK-54317][PYTHON][CONNECT] Unify Arrow conversion logic for Classic and Connect toPandas #53045

Conversation

Yicong-Huang commented Nov 13, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gaogaotiantian Nov 13, 2025 •

edited

Loading