Skip to content

Conversation

@Yicong-Huang
Copy link
Contributor

What changes were proposed in this pull request?

This PR merges the Arrow conversion code paths between Spark Connect and Classic Spark by extracting shared logic into a reusable helper function _convert_arrow_table_to_pandas.

Why are the changes needed?

This unifies optimizations from two separate PRs:

  • [SPARK-53967] (Classic): Avoid intermediate pandas DataFrame creation by converting Arrow columns directly to Series
  • [SPARK-54183] (Connect): Same optimization implemented for Spark Connect

Does this PR introduce any user-facing change?

No. This is a pure refactoring with no API or behavior changes.

How was this patch tested?

Ran existing Arrow test suite: python/pyspark/sql/tests/arrow/test_arrow.py

Was this patch authored or co-authored using generative AI tooling?

Co-Generated-by Cursor with Claude 4.5 Sonnet

Extract shared conversion logic into _convert_arrow_table_to_pandas helper
function in conversion.py to avoid code duplication between Classic and Connect.

Key changes:
- Add _convert_arrow_table_to_pandas helper function in conversion.py
- Update Classic toPandas to handle empty tables explicitly (SPARK-51112)
- Only apply self_destruct options when table has rows
- Connect imports the shared helper from conversion.py

This unifies the optimizations from SPARK-53967 and SPARK-54183:
- Avoid intermediate pandas DataFrame during conversion
- Convert Arrow columns directly to Series with type converters
- Better memory efficiency with self_destruct on non-empty tables

Co-authored-by: cursor
)
pdf = _convert_arrow_table_to_pandas(
table,
schema.fields,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would highly recommend to avoid using positional argument here. Maybe table is fine as it's obvious. For the rest of the argument I think it's better to put their correponding keyword there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually yes adding the keyword is ideal. however for this case all the parameters are named the same as the argument. Adding the keyword would be a bit verbose.

pdf = _convert_arrow_table_to_pandas(
    table,
    schema_fields=schema.fields,
    temp_col_names=temp_col_names,
    timezone=timezone,
    struct_in_pandas=struct_in_pandas,
    error_on_duplicated_field_names=error_on_duplicated_field_names,
    pandas_options=pandas_options,
)

Copy link
Contributor

@gaogaotiantian gaogaotiantian Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is when in the future, you need to add an extra field, you have to count where you should add it. Also it wouldn't be obvious if someone uses the wrong order.

pdf = _convert_arrow_table_to_pandas(
    table,
    schema.fields,
    temp_col_names,
    timezone,
    error_on_duplicated_field_names,
    struct_in_pandas,
    pandas_options,
)

It's super unobvious that the code above is wrong - and when you look at it only, you don't know how to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! changed to use and require keyword arguments



def _convert_arrow_table_to_pandas(
*,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it's okay to have the table before *. Also I'm don't feel it's necessary to have the * here just because it does not seem like a pattern elsewhere. I think it's a good habit to call with keyword arguments, but enforcing it normally is only required on user facing interfaces where they could easily make a mistake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it is an internal helper method, I think either way is fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed *.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think it's a big deal hence the nit. I personally prefer the flexibility of the function and more strict style on caller site.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! that's a good habit. especially appreciate for those nit comments!

for arrow_col, field in zip(table.columns, schema.fields)
],
axis="columns",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not also unify the # Restore original column names (including duplicates) pdf.columns = schema.names else: # empty columns pdf = table.to_pandas(**pandas_options) logic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants