Skip to content

[Python][Parquet] 21.0.0 release introduced memory leak when reading parquet #47266

@Tom-Newton

Description

@Tom-Newton

Describe the bug, including details regarding any error messages, version, and platform.

Since 21.0.0 memory usage grows a lot when repeatedly reading a parquet dataset on the local disk. With version 20.0.0 the memory usage increased much less.

Script to reproduce

import tempfile
import numpy
import pyarrow
import pyarrow.dataset
import pyarrow.parquet
from memory_profiler import profile

def test_memory_leak():
    num_columns = 10
    num_rows = 5_000_000

    data = {f"col_{i}": numpy.random.rand(num_rows) for i in range(num_columns)}
    table = pyarrow.Table.from_pydict(data)

    with tempfile.TemporaryDirectory() as temp_dir:
        pyarrow.dataset.write_dataset(table, temp_dir, format="parquet")

        @profile
        def read():
            return pyarrow.dataset.dataset(temp_dir).to_table()

        for _ in range(50):
            read()

if __name__ == "__main__":
    test_memory_leak()

Environment

Ubuntu 24.04.2 LTS
Tested python 3.10.15 and python 3.12.3

Python packages:

$ pip freeze
memory-profiler==0.61.0
numpy==2.3.2
psutil==7.0.0
pyarrow==21.0.0

When using pyarrow==21.0.0 the memory usage increases with the iterations. After the first read its at about 1.5GiB. After the 50th read its at about 20GiB. If I run the same test with pyarrow==20.0.0 the memory usage still increases slightly with the iterations but its still less than 2GiB after the 50th iteration.

Debugging

I ran a git bisect and identified #45979 as the change point. Building from the 21.0.0 release commit with ARROW_MIMALLOC=OFF also solves the problem.

Component(s)

C++, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions