materialize-iceberg: serialize pyspark command inputs to a file #2499
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
The maximum argument length for an EMR job is actually quite limited, around 10k characters, so if the command input for a job gets very long it will fail. This will happen if any number of significant bindings is associated with a transaction or even a single binding with a large number of fields, since all of the fields and their types need to be provided to the script in a serialized form, in addition to the query to execute.
The fix here is to write out the input to a temporary cloud storage file and read that in the PySpark script. Rather than providing the input as an argument, the input is now a URI to the input file.
Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
(anything that might help someone review this PR)
This change is