-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add iceberg.default-new-tables-gc.enabled config property #27359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
9b10773 to
c52b655
Compare
iceberg.default-new-tables-gc.enabled will set gc.enabled flag for new tables created by Trino.
c52b655 to
94dcc5d
Compare
| return defaultNewTablesGcEnabled; | ||
| } | ||
|
|
||
| @Config("iceberg.default-new-tables-gc.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iceberg.default-new-tables.gc.enabled
|
|
||
| // If user doesn't set gc.enabled, we need to set it to defaultNewTablesGcEnabled value | ||
| if (!extraProperties.containsKey(GC_ENABLED) && defaultNewTablesGcEnabled != GC_ENABLED_DEFAULT) { | ||
| propertiesBuilder.put(GC_ENABLED, Boolean.toString(defaultNewTablesGcEnabled)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note about relying on GC_ENABLED property of iceberg is that it will also disallow expire_snapshots.
Is that something desirable ?
I can imagine that some users just want protection on DROP, but with this they can't run the regular maintenance operation of expiring old snapshots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess they could enable GC for expire_snapshots. Users often have side spark anyway. Unless there is another property that would just limit DROP, then I think it's reasonable to expect data admins to know how to handle this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a check in iceberg library org.apache.iceberg.RemoveSnapshots#RemoveSnapshots
ValidationException.check(
PropertyUtil.propertyAsBoolean(base.properties(), GC_ENABLED, GC_ENABLED_DEFAULT),
"Cannot expire snapshots: GC is disabled (deleting files may corrupt other tables)");
Using Spark or something else won't help with that.
I'm not sure how admin deals with this, adding and removing property before/after every expire_snapshots is cumbersome and removes the DROP protection for the duration of the command. Also, most ppl have automated maintenance services, and we wouldn't want ppl to have to go and modify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a breaking change to expire_snapshots?
But if other engine disable the gc flag, it's also a problem to Trino, user not able to expire snapshots due a unsupported or exposed property
| .put("iceberg.file-format", format.name()) | ||
| // Only allow some extra properties. Add "sorted_by" so that we can test that the property is disallowed by the connector explicitly. | ||
| .put("iceberg.allowed-extra-properties", "extra.property.one,extra.property.two,extra.property.three,sorted_by") | ||
| .put("iceberg.allowed-extra-properties", "extra.property.one,extra.property.two,extra.property.three,sorted_by,gc.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use BaseIcebergConnectorSmokeTest or BaseTrinoCatalogTest when we verify catalog behavior. I recommend adding at least one positive test case to BaseIcebergConnectorSmokeTest.
| } | ||
|
|
||
| @Test | ||
| public void testDefaultGcEnabledTablePropertyOnCreateAndCtas() throws IOException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Put throws IOException on a new line. Same for other tests.
| return defaultNewTablesGcEnabled; | ||
| } | ||
|
|
||
| @Config("iceberg.default-new-tables-gc.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you document this property in iceberg.md?
| private int planningThreads = Math.min(Runtime.getRuntime().availableProcessors(), 16); | ||
| private int fileDeleteThreads = Runtime.getRuntime().availableProcessors() * 2; | ||
| private List<String> allowedExtraProperties = ImmutableList.of(); | ||
| private boolean defaultNewTablesGcEnabled = GC_ENABLED_DEFAULT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The common approach is to restrain referencing in the config files values from dependent libraries because this may trigger a silent change in the default configuration of the connector.
Please consider using rather true instead
| } | ||
|
|
||
| @Config("iceberg.default-new-tables-gc.enabled") | ||
| @ConfigDescription("Default value for Iceberg property gc.enabled when creating new tables") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add as well corresponding documentation in iceberg.md as well.
I noticed that the documentation for gc.enabled (introduced by apache/iceberg#1796 ) table property is unfortunately missing as well on apache/iceberg - As a reference point you could use apache/iceberg#9231
Property to disable garbage collection operations such as expiring snapshots or removing orphan files
It seems that in the meantime the table property has as well repercussions on whether to delete dropped table's files as well.
| LOG.warn(e, "Failed to delete table data referenced by metadata"); | ||
| } | ||
| deleteTableDirectory(fileSystemFactory.create(session), schemaTableName, table.location()); | ||
| if (propertyAsBoolean(table.properties(), GC_ENABLED, GC_ENABLED_DEFAULT)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls replace GC_ENABLED_DEFAULT with true - we don't have control on the value of GC_ENABLED_DEFAULT when the iceberg library gets updated.
| log.warn(e, "Failed to delete table data referenced by metadata"); | ||
| } | ||
| deleteTableDirectory(fileSystemFactory.create(session), schemaTableName, metastoreTable.getStorage().getLocation()); | ||
| if (propertyAsBoolean(metadata.properties(), GC_ENABLED, GC_ENABLED_DEFAULT)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (propertyAsBoolean(metadata.properties(), GC_ENABLED, GC_ENABLED_DEFAULT)) { | |
| if (propertyAsBoolean(metadata.properties(), GC_ENABLED, true)) { |
| LOG.warn(e, "Failed to delete table data referenced by metadata"); | ||
| } | ||
| deleteTableDirectory(fileSystemFactory.create(session), schemaTableName, table.location()); | ||
| if (propertyAsBoolean(table.properties(), GC_ENABLED, GC_ENABLED_DEFAULT)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (propertyAsBoolean(table.properties(), GC_ENABLED, GC_ENABLED_DEFAULT)) { | |
| if (propertyAsBoolean(table.properties(), GC_ENABLED, true)) { |
| PartitionSpec partitionSpec = parsePartitionFields(schema, getPartitioning(materializedViewProperties)); | ||
| SortOrder sortOrder = parseSortFields(schema, getSortOrder(materializedViewProperties)); | ||
| Map<String, String> properties = createTableProperties(new ConnectorTableMetadata(storageTableName, columns, materializedViewProperties, Optional.empty()), _ -> false); | ||
| Map<String, String> properties = createTableProperties(new ConnectorTableMetadata(storageTableName, columns, materializedViewProperties, Optional.empty()), _ -> false, GC_ENABLED_DEFAULT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Map<String, String> properties = createTableProperties(new ConnectorTableMetadata(storageTableName, columns, materializedViewProperties, Optional.empty()), _ -> false, GC_ENABLED_DEFAULT); | |
| Map<String, String> properties = createTableProperties(new ConnectorTableMetadata(storageTableName, columns, materializedViewProperties, Optional.empty()), _ -> false, true); |
| String tableLocation = getTableLocation(tableMetadata.getProperties()) | ||
| .orElseGet(() -> defaultTableLocation(session, tableMetadata.getTable())); | ||
| Transaction transaction = IcebergUtil.newCreateTableTransaction(this, tableMetadata, session, false, tableLocation, _ -> false); | ||
| Transaction transaction = IcebergUtil.newCreateTableTransaction(this, tableMetadata, session, false, tableLocation, _ -> false, GC_ENABLED_DEFAULT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Transaction transaction = IcebergUtil.newCreateTableTransaction(this, tableMetadata, session, false, tableLocation, _ -> false, GC_ENABLED_DEFAULT); | |
| Transaction transaction = IcebergUtil.newCreateTableTransaction(this, tableMetadata, session, false, tableLocation, _ -> false, true); |
| } | ||
|
|
||
| public static Transaction newCreateTableTransaction(TrinoCatalog catalog, ConnectorTableMetadata tableMetadata, ConnectorSession session, boolean replace, String tableLocation, Predicate<String> allowedExtraProperties) | ||
| public static Transaction newCreateTableTransaction(TrinoCatalog catalog, ConnectorTableMetadata tableMetadata, ConnectorSession session, boolean replace, String tableLocation, Predicate<String> allowedExtraProperties, boolean defaultNewTablesGcEnabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please separate this long list of parameters on new lines for increased readability.
| newDirectExecutorService(), | ||
| newDirectExecutorService()); | ||
| newDirectExecutorService(), | ||
| GC_ENABLED_DEFAULT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| GC_ENABLED_DEFAULT); | |
| true); |
| newDirectExecutorService(), | ||
| newDirectExecutorService()); | ||
| newDirectExecutorService(), | ||
| GC_ENABLED_DEFAULT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| GC_ENABLED_DEFAULT); | |
| true); |
Why should the Trino Iceberg connector be proactive in setting the Assuming that we find a good rationale on the question above, I suggest rather to handle the implementation of the config property rather differently.
|
It's not proactive. It only sets it when the config toggle is different than |
| assertThat(ctasMetadata.properties().getOrDefault("gc.enabled", "true")).isEqualTo("true"); | ||
|
|
||
| // collect data files and verify they are removed on drop | ||
| var ctasDataFiles = getAllDataFilesFromTableDirectory(ctasTable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid to use var ?
| String ctasTableLocation = getTableLocation(ctasTable); | ||
| String ctasMetadataLocation = getLatestMetadataLocation(fileSystem, ctasTableLocation); | ||
| TableMetadata ctasMetadata = TableMetadataParser.read(FILE_IO_FACTORY.create(fileSystem), ctasMetadataLocation); | ||
| assertThat(ctasMetadata.properties().getOrDefault("gc.enabled", "")).isEqualTo(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use isEmpty()
|
|
||
| // If user doesn't set gc.enabled, we need to set it to defaultNewTablesGcEnabled value | ||
| if (!extraProperties.containsKey(GC_ENABLED) && defaultNewTablesGcEnabled != GC_ENABLED_DEFAULT) { | ||
| propertiesBuilder.put(GC_ENABLED, Boolean.toString(defaultNewTablesGcEnabled)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a breaking change to expire_snapshots?
But if other engine disable the gc flag, it's also a problem to Trino, user not able to expire snapshots due a unsupported or exposed property
|
After discussion with @raunaqmorarka and @ebyhr we figured that Alternatives are:
|
iceberg.default-new-tables-gc.enabledwill setgc.enabledflag for new tables created by Trino.Description
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: