-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Labels
Description
Describe the bug
I can write a table with a column whose entries together are larger than 2**31-1.
Reading them fails with an overflow error.
To Reproduce
from pathlib import Path
import pyarrow
import tempfile
import arro3.io
schema = pyarrow.schema([pyarrow.field("html", pyarrow.binary())])
sizes = [5068563] * 500
new_table = pyarrow.Table.from_pylist([{"html": b"0" * size} for size in sizes], schema)
with tempfile.TemporaryDirectory() as tmp_dir:
tmp = Path(tmp_dir) / "evil.parquet"
arro3.io.write_parquet(new_table, tmp)
print(len(arro3.io.read_parquet(tmp).read_all()))
This throws the following error:
Traceback (most recent call last):
File "overflow.py", line 87, in <module>
print(len(arro3.io.read_parquet(tmp).read_all()))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception: C Data interface error: Parquet argument error: Parquet error: index overflow decoding byte array
Expected behavior
Should print 500.
Additional context
arro3 is just a thin wrapper around arro-rs, which I'm using for convenience.
I'm using arro3-io version 0.5.1 and python 3.12.
arro3 uses arrow-rs 54.1.0 and our internal library uses arrow-rs 55.1.0. The problem is present with both.
pyarrow can read the file just fine, that's what makes me think that this might be an error in arrow-rs.