Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyIceberg appending data creates snapshots incompatible with Athena/Spark #1424

Open
1 of 3 tasks
Samreay opened this issue Dec 11, 2024 · 5 comments
Open
1 of 3 tasks

Comments

@Samreay
Copy link
Contributor

Samreay commented Dec 11, 2024

Apache Iceberg version

0.8.0

Please describe the bug 🐞

We append data to our iceberg table using the Table.overwrite function, and this is saving out snapshots which have IDs that cannot be parsed by athena's OPTIMIZE command, or Sparks:

java.lang.IllegalArgumentException: Cannot parse to a long value: snapshot-id: 9223372036854775808
        at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:446)
        at org.apache.iceberg.util.JsonUtil.getLong(JsonUtil.java:139)
        at org.apache.iceberg.SnapshotParser.fromJson(SnapshotParser.java:116)
        at org.apache.iceberg.TableMetadataParser.fromJson(TableMetadataParser.java:478)

Java's long max value is 9223372036854775807
PyIceberg (or something under the hood, it might not be pyiceberg) has created a snapshot with value 9223372036854775808, literally 1+MAX_VALUE

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@kevinjqliu
Copy link
Contributor

hi @Samreay thanks for reporting this issue! Very odd that its 1+MAX_VALUE.

I took a look at the write path and didn't see anything that stood put that would cause this issue.
Could you post the metadata json file associated with this snapshot-id so we can debug further?

@Fokko
Copy link
Contributor

Fokko commented Dec 16, 2024

I'm also not seeing how this could happen, to test this, I also ran this script:

Python 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.26.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import uuid
   ...: 
   ...: while True:
   ...:     rnd_uuid = uuid.uuid4()
   ...:     snapshot_id = int.from_bytes(
   ...:         bytes(lhs ^ rhs for lhs, rhs in zip(rnd_uuid.bytes[0:8], rnd_uuid.bytes[8:16])), byteorder="little", signed=True
   ...:     )
   ...:     snapshot_id = snapshot_id if snapshot_id >= 0 else snapshot_id * -1
   ...:     if snapshot_id > 9223372036854775807:
   ...:         print("Boom")
   ...: 

I think we should add some checks to ensure that the snapshot_id is in the range of [0, 9223372036854775807] or raise a ValueError instead. Would you be interested in contributing that @Samreay ? I think this should happen in the SnapshotProducer

@kevinjqliu
Copy link
Contributor

The above replicates the logic of _generate_snapshot_id

def _generate_snapshot_id() -> int:
"""Generate a new Snapshot ID from a UUID.
Returns: An 64 bit long
"""
rnd_uuid = uuid.uuid4()
snapshot_id = int.from_bytes(
bytes(lhs ^ rhs for lhs, rhs in zip(rnd_uuid.bytes[0:8], rnd_uuid.bytes[8:16])), byteorder="little", signed=True
)
snapshot_id = snapshot_id if snapshot_id >= 0 else snapshot_id * -1
return snapshot_id

@Samreay
Copy link
Contributor Author

Samreay commented Dec 17, 2024

I'll see if I can track down the snapshot metadata. I'm also not sure how this would happen, but we've been exclusively using pyiceberg to create, remove, and append data to our iceberg tables. Granted, the creation is using the glue catalog, so I suppose there's potential for Amazon to be muddying the waters here.

@kevinjqliu
Copy link
Contributor

I think the snapshot id is generated on the client side. So its possible only if glue is also committing the table.

If you can share the metadata json, that would be helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants