-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Output metrics about remove_orphan_files execution #26661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the Iceberg connector's remove_orphan_files
procedure to output execution metrics. Instead of returning no information, the procedure now returns a table with metrics including manifest file processing count, valid files found, files scanned, and files deleted.
- Modified the
executeTableExecute
method signature across the SPI to returnMap<String, Long>
instead ofvoid
- Enhanced the
remove_orphan_files
implementation to collect and return execution metrics - Updated query result handling to display metrics as a two-column table with metric names and values
Reviewed Changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java | Changed executeTableExecute return type from void to Map<String, Long> |
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java | Implemented metrics collection and return for remove_orphan_files procedure |
core/trino-main/src/main/java/io/trino/operator/SimpleTableExecuteOperator.java | Updated operator to build result pages from returned metrics |
core/trino-main/src/main/java/io/trino/sql/planner/plan/SimpleTableExecuteNode.java | Changed from single output symbol to list of symbols for metric name/value columns |
docs/src/main/sphinx/connector/iceberg.md | Added documentation for the new metrics output format |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java
Outdated
Show resolved
Hide resolved
void finishTableExecute(Session session, TableExecuteHandle handle, Collection<Slice> fragments, List<Object> tableExecuteState); | ||
|
||
void executeTableExecute(Session session, TableExecuteHandle handle); | ||
Map<String, Long> executeTableExecute(Session session, TableExecuteHandle handle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Map<String, Long> instead of something more generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like some Connector specific handle that has a method to get a Page and list of symbols ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is running directly inside SimpleTableExecuteOperator, so we don't need anything complex.
There isn't a need for anything more generic here at the moment, the use case of all the iceberg procedures where we want to add this is satisfied by this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we could expose additional metrics in double or other type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but it could run for other cases as well. I can image that other table procedures will output some kind of an information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've had one BIGINT field all these years and failed to even populate that properly.
So no, I don't think there are many use cases waiting for a more generic framework.
I've gone through all the iceberg procedures and everything worth showing in the output is an integer. See Outputs in https://iceberg.apache.org/docs/latest/spark-procedures/
There is nothing stopping anyone from using something more generic like io.trino.spi.metrics.Metric
in the future. At the moment, I find it to be unnecessary.
3577a9d
to
b63db28
Compare
* and return procedure execution metrics that will be populated in the query output. | ||
*/ | ||
default void executeTableExecute(ConnectorSession session, ConnectorTableExecuteHandle tableExecuteHandle) | ||
default Map<String, Long> executeTableExecute(ConnectorSession session, ConnectorTableExecuteHandle tableExecuteHandle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to limit ourselves to Long
or - can be metrics be additional type like double ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or should we return them in Row
type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its limited to long
for now, as that's what the current procedures need
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Outdated
Show resolved
Hide resolved
List<MaterializedRow> expectedRows = expectedResults.getMaterializedRows(); | ||
|
||
if (compareUpdate) { | ||
if (compareUpdate && !actualResults.getUpdateType().equals(Optional.of("ALTER TABLE EXECUTE"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of skipping here, can we introduce something assertAlter
or a dedicated method - which would set compareUpdate
to false ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no need for a special assertAlter
method. We should be able to assert a more detailed output with assertUpdate itself.
This branch is written with the assumption that an update query can only produce a row count output and nothing else. Since ALTER commands did not output anything before, this branch was never executed before.
Now with the more detailed output, we need to skip over this branch to allow the assertion on detailed results.
- Description | ||
* - `processed_manifests_count` | ||
- The count of manifest files read by remove_orphan_files. | ||
* - `valid_files_count` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the meaning of valid
here may be unclear to users. It doesn’t seem very meaningful to expose it directly, could we add more details to clarify what it represents/contains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed it to active_files_count
and clarified that these are files belonging to snapshots that have not been expired yet.
```text | ||
metric_name | metric_value | ||
----------------------------+-------------- | ||
processed_manifests_count | 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed the output in Spark looks different from this: https://iceberg.apache.org/docs/latest/spark-procedures/#output_7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the spark output is printing deleted file names. I don't want to do that since that is potentially a large list. The main use case here is that we want to provide a summary of what the procedure did. Right now you get no clue what/if the procedure did anything, unless you run metadata queries before and after and search for differences to the table.
b63db28
to
c75a156
Compare
alter table lineitem execute remove_orphan_files(retention_threshold => '0d'); metric_name | metric_value ----------------------------+-------------- processed_manifests_count | 2 active_files_count | 98 scanned_files_count | 97 deleted_files_count | 0
c75a156
to
0a72e6b
Compare
Description
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: