Skip to content

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Sep 18, 2025

Description

alter table lineitem execute remove_orphan_files(retention_threshold => '0d');

            metric_name         | metric_value
    ----------------------------+--------------
     processed_manifests_count  |            2
     valid_files_count          |           98
     scanned_files_count        |           97
     deleted_files_count        |            0

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Iceberg
* Produce metrics about execution in the output of remove_orphan_files procedure. ({issue}`26661`)

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the Iceberg connector's remove_orphan_files procedure to output execution metrics. Instead of returning no information, the procedure now returns a table with metrics including manifest file processing count, valid files found, files scanned, and files deleted.

  • Modified the executeTableExecute method signature across the SPI to return Map<String, Long> instead of void
  • Enhanced the remove_orphan_files implementation to collect and return execution metrics
  • Updated query result handling to display metrics as a two-column table with metric names and values

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java Changed executeTableExecute return type from void to Map<String, Long>
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Implemented metrics collection and return for remove_orphan_files procedure
core/trino-main/src/main/java/io/trino/operator/SimpleTableExecuteOperator.java Updated operator to build result pages from returned metrics
core/trino-main/src/main/java/io/trino/sql/planner/plan/SimpleTableExecuteNode.java Changed from single output symbol to list of symbols for metric name/value columns
docs/src/main/sphinx/connector/iceberg.md Added documentation for the new metrics output format

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

void finishTableExecute(Session session, TableExecuteHandle handle, Collection<Slice> fragments, List<Object> tableExecuteState);

void executeTableExecute(Session session, TableExecuteHandle handle);
Map<String, Long> executeTableExecute(Session session, TableExecuteHandle handle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Map<String, Long> instead of something more generic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like some Connector specific handle that has a method to get a Page and list of symbols ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is running directly inside SimpleTableExecuteOperator, so we don't need anything complex.
There isn't a need for anything more generic here at the moment, the use case of all the iceberg procedures where we want to add this is satisfied by this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we could expose additional metrics in double or other type ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but it could run for other cases as well. I can image that other table procedures will output some kind of an information

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had one BIGINT field all these years and failed to even populate that properly.
So no, I don't think there are many use cases waiting for a more generic framework.
I've gone through all the iceberg procedures and everything worth showing in the output is an integer. See Outputs in https://iceberg.apache.org/docs/latest/spark-procedures/
There is nothing stopping anyone from using something more generic like io.trino.spi.metrics.Metric in the future. At the moment, I find it to be unnecessary.

* and return procedure execution metrics that will be populated in the query output.
*/
default void executeTableExecute(ConnectorSession session, ConnectorTableExecuteHandle tableExecuteHandle)
default Map<String, Long> executeTableExecute(ConnectorSession session, ConnectorTableExecuteHandle tableExecuteHandle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to limit ourselves to Long or - can be metrics be additional type like double ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or should we return them in Row type ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its limited to long for now, as that's what the current procedures need

List<MaterializedRow> expectedRows = expectedResults.getMaterializedRows();

if (compareUpdate) {
if (compareUpdate && !actualResults.getUpdateType().equals(Optional.of("ALTER TABLE EXECUTE"))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of skipping here, can we introduce something assertAlter or a dedicated method - which would set compareUpdate to false ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no need for a special assertAlter method. We should be able to assert a more detailed output with assertUpdate itself.
This branch is written with the assumption that an update query can only produce a row count output and nothing else. Since ALTER commands did not output anything before, this branch was never executed before.
Now with the more detailed output, we need to skip over this branch to allow the assertion on detailed results.

@ebyhr ebyhr requested a review from martint September 18, 2025 13:19
- Description
* - `processed_manifests_count`
- The count of manifest files read by remove_orphan_files.
* - `valid_files_count`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the meaning of valid here may be unclear to users. It doesn’t seem very meaningful to expose it directly, could we add more details to clarify what it represents/contains

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed it to active_files_count and clarified that these are files belonging to snapshots that have not been expired yet.

```text
metric_name | metric_value
----------------------------+--------------
processed_manifests_count | 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the output in Spark looks different from this: https://iceberg.apache.org/docs/latest/spark-procedures/#output_7.

Copy link
Member Author

@raunaqmorarka raunaqmorarka Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the spark output is printing deleted file names. I don't want to do that since that is potentially a large list. The main use case here is that we want to provide a summary of what the procedure did. Right now you get no clue what/if the procedure did anything, unless you run metadata queries before and after and search for differences to the table.

alter table lineitem execute remove_orphan_files(retention_threshold => '0d');

        metric_name         | metric_value
----------------------------+--------------
 processed_manifests_count  |            2
 active_files_count         |           98
 scanned_files_count        |           97
 deleted_files_count        |            0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

4 participants