materialize-iceberg: column type migrations #2540

williamhbaker · 2025-03-16T21:06:13Z

Description:

Adds support for column type migrations to materialize-iceberg. See commit messages for more details.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

I manually tested all of these migrations, and added as much unit test coverage as I could. Our SQL materializations have a test suite that covers migration scenarios, but this is the first materialization that doesn't use the materialize-sql framework, and so it can't use those same tests. We should make a more generalized set of tests for that and I'll probably get around to it at some point, but didn't do that right now since I'm not totally sure what it would look like (I want to get another materialization or two using the Materializer interface), and it's just a big pain to do the equivalent of "INSERT INTO" on an Iceberg table for putting in test value, so that will be a project unto itself - perhaps for if we pursue a compute-less delta updates mode.

This change is

Mainly this adds support for column migrations as a concept in the materializer `RunApply` function. The set of fields to be migrated is computed and included in the struct of common updates to be applied for a binding. A few other changes in this commit that are related to some extent, and/or opportunistic changes I wanted to make and couldn't easily split into separate commits: * For namespace creation, often there is a "default" namespace that is set in the endpoint configuration. Although this is usually used by one of the bindings and will be created if it doesn't exist because of that, sometimes no bindings will use it if they all have an alternative namespace set. It can be important for this namespace to exist if, for example, metadata tables must be created in it, so it needs some representation outside the InfoSchema and resource configurations. This changes the endpoint config itself to an interface that must be able to say what the default namespace is, if there is one. * The prior handling for connectors that want to do some extra stuff during `Apply` was a little ugly, and also insufficient for cases where they need to do the extra stuff _before_ the typical apply actions are run. I think it's common enough for these extra actions to be needed (creating a checkpoints table for some SQL materializations, or uploading PySpark scripts for Iceberg) that a specific method for doing them is warranted. So here it is, and it's called `Setup`. * Field compatibility & descriptions are moved to a `MappedTyper` interface to separate concerns a little more. * Fixed the generated "comment" for a field that is a projection of the root document since it looked weird having the pointer be empty.

Adds a random suffix to integration test table names so that more than one CI job run can run them at a time. Otherwise concurrent job runs will delete the testing tables out from under each other and cause the tests to fail. This happens for many of our materialization integration tests, but it's particularly bad for these Iceberg tests since they take so long to run. Sometimes it can be legitimately difficult to get a build to pass CI if many different branches are trying to run the tests.

I recently added this `getHighestFieldID` function since the `(*Schema).HighestFieldID` did not work with schemas containing logical types like `decimal`. It is true that the custom version works where the other one doesn't, but neither one is the correct approach for figuring out which field ID to start with when adding a field to an Iceberg table schema. Since there could have been fields that previously existed but were since deleted in the table, it's not strictly correct to assume that the next highest field ID that has never been used before can be determined from the current schema by just looking at the numbers. There's an Iceberg table property that tracks the highest field ID on the table itself though, and using that should always work. This commit removes the `getHighestFieldID` function - the next commit will update the related schema calculation functions, among other things, as part of supporting column migrations.

Adds support for materialize-iceberg to do column type migrations for all known scenarios of schema "widening" without requiring a backfill to re-build the table. The Spark SQL queries for doing this are straightforward. A new PySpark script is added called `exec.py` that just runs a single query and that is used for running a migration query. This kind of script might be useful for other things in the future as well, so it's not called `migrate.py` or something like that. Most of the work for adding this capability is from needing to initialize all the EMR job running apparatus in `Apply` now to run migration Spark jobs, in addition to in `Open` for running transactions. So there's a lot of code that got moved around to support that, as well as the core changes for supporting column migrations.

williamhbaker added 4 commits March 16, 2025 16:59

williamhbaker force-pushed the wb/iceberg-migrations branch from e0fd676 to ded4a69 Compare March 17, 2025 03:35

williamhbaker requested review from mdibaiee and Alex-Bair March 17, 2025 03:54

williamhbaker marked this pull request as ready for review March 17, 2025 03:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

materialize-iceberg: column type migrations #2540

materialize-iceberg: column type migrations #2540

williamhbaker commented Mar 16, 2025 •

edited

Loading

materialize-iceberg: column type migrations #2540

Are you sure you want to change the base?

materialize-iceberg: column type migrations #2540

Conversation

williamhbaker commented Mar 16, 2025 • edited Loading

williamhbaker commented Mar 16, 2025 •

edited

Loading