Skip to content

Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB) #7973

@ben-freist

Description

@ben-freist

Describe the bug
I can write a table with a column whose entries together are larger than 2**31-1.
Reading them fails with an overflow error.

To Reproduce

from pathlib import Path
import pyarrow
import tempfile
import arro3.io

schema = pyarrow.schema([pyarrow.field("html", pyarrow.binary())])
sizes = [5068563] * 500

new_table = pyarrow.Table.from_pylist([{"html": b"0" * size} for size in sizes], schema)

with tempfile.TemporaryDirectory() as tmp_dir:
    tmp = Path(tmp_dir) / "evil.parquet"
    arro3.io.write_parquet(new_table, tmp)
    print(len(arro3.io.read_parquet(tmp).read_all()))

This throws the following error:

Traceback (most recent call last):
  File "overflow.py", line 87, in <module>
    print(len(arro3.io.read_parquet(tmp).read_all()))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception: C Data interface error: Parquet argument error: Parquet error: index overflow decoding byte array

Expected behavior
Should print 500.

Additional context
arro3 is just a thin wrapper around arro-rs, which I'm using for convenience.
I'm using arro3-io version 0.5.1 and python 3.12.
arro3 uses arrow-rs 54.1.0 and our internal library uses arrow-rs 55.1.0. The problem is present with both.
pyarrow can read the file just fine, that's what makes me think that this might be an error in arrow-rs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions