-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54317][PYTHON][CONNECT] Unify Arrow conversion logic for Classic and Connect toPandas #53045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-54317][PYTHON][CONNECT] Unify Arrow conversion logic for Classic and Connect toPandas #53045
Conversation
Extract shared conversion logic into _convert_arrow_table_to_pandas helper function in conversion.py to avoid code duplication between Classic and Connect. Key changes: - Add _convert_arrow_table_to_pandas helper function in conversion.py - Update Classic toPandas to handle empty tables explicitly (SPARK-51112) - Only apply self_destruct options when table has rows - Connect imports the shared helper from conversion.py This unifies the optimizations from SPARK-53967 and SPARK-54183: - Avoid intermediate pandas DataFrame during conversion - Convert Arrow columns directly to Series with type converters - Better memory efficiency with self_destruct on non-empty tables Co-authored-by: cursor
| ) | ||
| pdf = _convert_arrow_table_to_pandas( | ||
| table, | ||
| schema.fields, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would highly recommend to avoid using positional argument here. Maybe table is fine as it's obvious. For the rest of the argument I think it's better to put their correponding keyword there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually yes adding the keyword is ideal. however for this case all the parameters are named the same as the argument. Adding the keyword would be a bit verbose.
pdf = _convert_arrow_table_to_pandas(
table,
schema_fields=schema.fields,
temp_col_names=temp_col_names,
timezone=timezone,
struct_in_pandas=struct_in_pandas,
error_on_duplicated_field_names=error_on_duplicated_field_names,
pandas_options=pandas_options,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is when in the future, you need to add an extra field, you have to count where you should add it. Also it wouldn't be obvious if someone uses the wrong order.
pdf = _convert_arrow_table_to_pandas(
table,
schema.fields,
temp_col_names,
timezone,
error_on_duplicated_field_names,
struct_in_pandas,
pandas_options,
)It's super unobvious that the code above is wrong - and when you look at it only, you don't know how to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! changed to use and require keyword arguments
|
|
||
|
|
||
| def _convert_arrow_table_to_pandas( | ||
| *, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it's okay to have the table before *. Also I'm don't feel it's necessary to have the * here just because it does not seem like a pattern elsewhere. I think it's a good habit to call with keyword arguments, but enforcing it normally is only required on user facing interfaces where they could easily make a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as it is an internal helper method, I think either way is fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed *.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't think it's a big deal hence the nit. I personally prefer the flexibility of the function and more strict style on caller site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! that's a good habit. especially appreciate for those nit comments!
| for arrow_col, field in zip(table.columns, schema.fields) | ||
| ], | ||
| axis="columns", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we not also unify the # Restore original column names (including duplicates) pdf.columns = schema.names else: # empty columns pdf = table.to_pandas(**pandas_options) logic?
What changes were proposed in this pull request?
This PR merges the Arrow conversion code paths between Spark Connect and Classic Spark by extracting shared logic into a reusable helper function
_convert_arrow_table_to_pandas.Why are the changes needed?
This unifies optimizations from two separate PRs:
Does this PR introduce any user-facing change?
No. This is a pure refactoring with no API or behavior changes.
How was this patch tested?
Ran existing Arrow test suite:
python/pyspark/sql/tests/arrow/test_arrow.pyWas this patch authored or co-authored using generative AI tooling?
Co-Generated-by Cursor with Claude 4.5 Sonnet