Skip to content

Equality delete column constraints are not enforced #12971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
weijiii opened this issue May 5, 2025 · 8 comments
Open
2 of 3 tasks

Equality delete column constraints are not enforced #12971

weijiii opened this issue May 5, 2025 · 8 comments
Labels
bug Something isn't working Specification Issues that may introduce spec changes.

Comments

@weijiii
Copy link

weijiii commented May 5, 2025

Apache Iceberg version

1.8.1

Query engine

None

Please describe the bug 🐞

From the equality-delete-files doc, only primitive types are allowed to be used as equality delete columns, excluding FLOAT and DOUBLE types.

We bumped into a use case where downstream is generating delete files directly with DeleteWriteBuilder and the delete column is not supported. This issue was only discovered when reading the table.

It seems that DeleteWriteBuilder::buildEqualityWriter for all the formats should check on the constraints based on the schema and equalityFieldIds

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@weijiii weijiii added the bug Something isn't working label May 5, 2025
@manuzhang manuzhang added the Specification Issues that may introduce spec changes. label May 5, 2025
@JeonDaehong
Copy link
Contributor

Hello, just checking—are you planning to work on a PR for this yourself?

@RussellSpitzer
Copy link
Member

RussellSpitzer commented May 8, 2025 via email

@weijiii
Copy link
Author

weijiii commented May 8, 2025

Do we say in the spec you can’t use those types? I was checking and didn’t
see anything specific

@RussellSpitzer I was referencing the equality-delete doc which says it has the same constraints as the identifier-fields with the exception that optional columns and fields nested under optional structs are allowed. The identifier-fields doc says that only primitive fields are allowed, and I assume this means they should of primitive type (struct is a nested type)? Do I understand the doc correctly?

@weijiii
Copy link
Author

weijiii commented May 8, 2025

Hello, just checking—are you planning to work on a PR for this yourself?

@JeonDaehong Yes I would like to give it a try. Thanks for checking!

@stevenzwu
Copy link
Contributor

stevenzwu commented May 8, 2025

The problem is that the identifier fields rule (e.g. excluding FLOAT and DOUBLE types) is not enforced in your case? Yes, I would agree that it should be enforced, as floating numbers can exhibit differences across platforms and hence are not suitable for.

Schema class already validate the field types of identifier fields.
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/Schema.java#L157

It seems that DeleteWriteBuilder::buildEqualityWriter for all the formats should check on the constraints based on the schema and equalityFieldIds

I am not sure we should add such enforcement to Parquet/Orc writer.

It seems that In your case someone is producing the equality delete file with custom code using Iceberg Java SDK. Is the identifier fields set for the table schema? Validation above should fail such schema change. If a writer is not conforming to the spec, it is an implementation bug of the writer.

@JeonDaehong
Copy link
Contributor

Hello, just checking—are you planning to work on a PR for this yourself?

@JeonDaehong Yes I would like to give it a try. Thanks for checking!

I was cheering for the PR back then ! :D

@weijiii
Copy link
Author

weijiii commented May 8, 2025

The problem is that the identifier fields rule (e.g. excluding FLOAT and DOUBLE types) is not enforced in your case? Yes, I would agree that it should be enforced, as floating numbers can exhibit differences across platforms and hence are not suitable for.

Schema class already validate the field types of identifier fields. https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/Schema.java#L157

It seems that DeleteWriteBuilder::buildEqualityWriter for all the formats should check on the constraints based on the schema and equalityFieldIds

I am not sure we should add such enforcement to Parquet/Orc writer.

It seems that In your case someone is producing the equality delete file with custom code using Iceberg Java SDK. Is the identifier fields set for the table schema? Validation above should fail such schema change. If a writer is not conforming to the spec, it is an implementation bug of the writer.

@stevenzwu If we directly use the file format writers to write the delete files, then yes I think FLOAT / DOUBLE would be allowed right now. The case I am talking about is a struct type though. Could you help clarify if I am understanding the spec correctly, that nested types are also not allowed for both the identifier fields and equality delete columns? From identifier-fields doc

A schema can optionally track the set of primitive fields that identify rows in a table, 

Or is it allowed when that struct type is optional?

Is it allowed to have an equality delete, e.g.

row  = (1, 2, 'string')

equality_ids=[1]

 1: row            | 
-------------------|
 (1, ,2, 'string') |

if column row is optional?

@weijiii
Copy link
Author

weijiii commented May 8, 2025

It seems that In your case someone is producing the equality delete file with custom code using Iceberg Java SDK. Is the identifier fields set for the table schema? Validation above should fail such schema change. If a writer is not conforming to the spec, it is an implementation bug of the writer.

Ah interesting I take it you mean that when we use RowDelta::addDeletes (assuming this is the correct API) to add a DeleteFile, the identifier-field-ids would also be configured if not already? And this would constitute a schema change? I am fairly new to the code base so I will have to read more to get familiar here 😄 Is it required to set the identifier fields for any equality deletes to work?

Schema class already validate the field types of identifier fields.
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/Schema.java#L157

But I am curious isn't this method validating the constraints for the identifier fields? From the equality-delete spec it says there are some exceptions right?

The column restrictions for columns used in equality delete files are the same as those for [identifier fields](https://iceberg.apache.org/spec/?h=equality#identifier-field-ids) with the exception that optional columns and columns nested under optional structs are allowed (if a parent struct column is null it implies the leaf column is null).

FWIW the case I am seeing is directly using the delete writer, e.g. you can see this usage for Trino testing; I was working on this Trino PR to allow equality deletes on columns of struct type (row type in Trino).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Specification Issues that may introduce spec changes.
Projects
None yet
Development

No branches or pull requests

5 participants