-
Notifications
You must be signed in to change notification settings - Fork 314
Add properties from TableMetadata into Table entity internalProperties #2735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add properties from TableMetadata into Table entity internalProperties #2735
Conversation
Hi @collado-mike, what would be the utility of persisting just these fields? I think in order to optimize loadTable / commit, we would really need to store the entire metadata somewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks reasonable to me, and the code looks good.
Would just prefer to use constants to avoid typos at call sites.
runtime/service/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java
Outdated
Show resolved
Hide resolved
runtime/service/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java
Outdated
Show resolved
Hide resolved
This doesn't aim to optimize those APIs, but simply adds the fields for other possible utilities - e.g., within Polaris, knowing whether the schema has changed. Now, it's true that this could be used to help optimize the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
PolarisEntitySubType.ICEBERG_TABLE, tableIdentifier, newLocation) | ||
PolarisEntitySubType.ICEBERG_TABLE, | ||
tableIdentifier, | ||
Map.of(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If we're using the builder pattern, why have rich constructor parameters? Why not call .setABC()
?
In this case the Map
is empty, so this parameter is redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is in setting the internalProperties
map in the builder methods because we have these helper methods, such as setMetadataLocation
that aren't fields themselves, but modify entries in the map. If we do call the following, we're fine
builder
.setInternalProperties(props)
.setMetadataLocation(newLocation)
.build();
but if we reverse the order and call the following, we're broken:
builder
.setMetadataLocation(newLocation)
.setInternalProperties(props)
.build();
It's not obvious from the caller's perspective, but setMetadataLocation
modifies the underlying map, then the setInternalProperties
call completely overwrites the map, losing the value set in the previous call.
With the existing constructor, it's impossible to order the setting of the metadataLocation
and the internalProperties
via the builder methods. If we used the builder for all properties all the time, it would be fine, but because we pass in the newLocation
parameter as a constructor arg, if we set the internalProperties
field using the builder method, we lose the location parameter we just passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, TIL 🤔 However, having this kind of effects in the codebase it pretty risky in the long term maintenance perspective, IMHO. Would it be reasonable to refactor the builders / related code to allow for more intuitive usage (in another PR, of course)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking about making metadataLocation
and other settable map entries into distinct fields in the Builder so that they can just be added to the map in the build
call, but... yeah, a future PR
In what API is it useful to specifically know if the schema has changed? We already have etag support for loadTable to know if any metadata has changed. |
There aren't any APIs that use this information today. As stated in the PR description, the intention is to be able to look at the persistence layer to understand the state of the table without loading the entire In part, this helps move us toward being able to support TableMetadata caching, which you reference above, but this is useful in itself and doesn't need to wait on that implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
catalog.buildTable(TABLE, SCHEMA).create(); | ||
catalog.loadTable(TABLE).newFastAppend().appendFile(FILE_A).commit(); | ||
Table afterAppend = catalog.loadTable(TABLE); | ||
EntityResult schemaResult = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EntityResult schemaResult = | |
EntityResult namespaceResult = |
[nit] Let's use namespace
here to be consistent with polaris' terminology
} | ||
|
||
@Test | ||
public void testTableInternalPropertiesStoredOnCommit() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Non-blocker] Since we are testing the extraction of metadata fields, it would be good if we could add/or parametrize this test with different format-version (1,2,3). Technically the TableMetadata
should be format-version agnostic, but a sanity check would be useful in case there's some upstream bug : )
I'm still confused about this point. In the example you gave, the UI wants to load a table schema and these fields could help do that. How would that work? What API would the UI call? Wouldn't it make sense to design the API first before making persistence changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it makes sense to make entity changes that aren't visible to any API without any clear pathway to making them so.
What API? Not every field stored in the persistence layer serves a public-facing API. For example, we store cleanup task id in the internal properties field and it is never exposed through a public-facing API. If we serve TableMetadata from persistence rather than cloud storage, there will be zero impact to any public-facing API. The UI may or may not solely rely on existing Polaris management APIs. Or we may build custom APIs strictly for the UI and query the database directly. internalProperties is specifically used to serve internal purposes. I don't understand your question about which public-facing API needs this information. |
@eric-maynard , what's the justification for blocking this PR? I didn't merge on Friday even when I had received approval, as I was addressing your questions, so I'm not really sure why the block. I see no concern about the change being unsafe or incorrect. Is there a technical concern here? |
It's helpful to be able to look at the Table entities in persistence and immediately know the state of the current snapshot, current schema, partition scheme, etc. without having to load the whole TableMetadata. This copies some of the primitive properties from the TableMetadata structure sticks them into the Table PolarisEntity as internalProperties.
Currently, there are no other properties being stored except the metadata file location, the parent namespace, and the last notification timestamp. Currently, as is, we'll drop any other properties that are added to the internalProperties map. I went this approach to avoid the properties map always being additive, but happy to hear if folks think we should defer to always copying the previous map, then overwriting properties.