Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

materialize-iceberg: serialize pyspark command inputs to a file #2499

Merged
merged 1 commit into from
Mar 7, 2025

Conversation

williamhbaker
Copy link
Member

@williamhbaker williamhbaker commented Mar 7, 2025

Description:

The maximum argument length for an EMR job is actually quite limited, around 10k characters, so if the command input for a job gets very long it will fail. This will happen if any number of significant bindings is associated with a transaction or even a single binding with a large number of fields, since all of the fields and their types need to be provided to the script in a serialized form, in addition to the query to execute.

The fix here is to write out the input to a temporary cloud storage file and read that in the PySpark script. Rather than providing the input as an argument, the input is now a URI to the input file.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

The maximum argument length for an EMR job is actually quite limited, around 10k
characters, so if the command input for a job gets very long it will fail. This
will happen if any number of significant bindings is associated with a
transaction or even a single binding with a large number of fields, since all of
the fields and their types need to be provided to the script in a serialized
form, in addition to the query to execute.

The fix here is to write out the input to a temporary cloud storage file and
read that in the PySpark script. Rather than providing the input as an argument,
the input is now a URI to the input file.
@williamhbaker williamhbaker requested a review from Alex-Bair March 7, 2025 14:21
Copy link
Member

@Alex-Bair Alex-Bair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@williamhbaker williamhbaker merged commit d2eb328 into main Mar 7, 2025
56 of 58 checks passed
@williamhbaker williamhbaker deleted the wb/iceberg-cmd-file branch March 7, 2025 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants