Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB)

**Describe the bug**
I can write a table with a column whose entries together are larger than 2**31-1.
Reading them fails with an overflow error.


**To Reproduce**
```
from pathlib import Path
import pyarrow
import tempfile
import arro3.io

schema = pyarrow.schema([pyarrow.field("html", pyarrow.binary())])
sizes = [5068563] * 500

new_table = pyarrow.Table.from_pylist([{"html": b"0" * size} for size in sizes], schema)

with tempfile.TemporaryDirectory() as tmp_dir:
    tmp = Path(tmp_dir) / "evil.parquet"
    arro3.io.write_parquet(new_table, tmp)
    print(len(arro3.io.read_parquet(tmp).read_all()))
```
This throws the following error:
```
Traceback (most recent call last):
  File "overflow.py", line 87, in <module>
    print(len(arro3.io.read_parquet(tmp).read_all()))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception: C Data interface error: Parquet argument error: Parquet error: index overflow decoding byte array
```

**Expected behavior**
Should print 500.

**Additional context**
arro3 is just a thin wrapper around arro-rs, which I'm using for convenience.
I'm using arro3-io version 0.5.1 and python 3.12.
arro3 uses arrow-rs 54.1.0 and our internal library uses arrow-rs 55.1.0. The problem is present with both.
pyarrow can read the file just fine, that's what makes me think that this might be an error in arrow-rs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB) #7973

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB) #7973

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions