-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
Since 21.0.0 memory usage grows a lot when repeatedly reading a parquet dataset on the local disk. With version 20.0.0 the memory usage increased much less.
Script to reproduce
import tempfile
import numpy
import pyarrow
import pyarrow.dataset
import pyarrow.parquet
from memory_profiler import profile
def test_memory_leak():
num_columns = 10
num_rows = 5_000_000
data = {f"col_{i}": numpy.random.rand(num_rows) for i in range(num_columns)}
table = pyarrow.Table.from_pydict(data)
with tempfile.TemporaryDirectory() as temp_dir:
pyarrow.dataset.write_dataset(table, temp_dir, format="parquet")
@profile
def read():
return pyarrow.dataset.dataset(temp_dir).to_table()
for _ in range(50):
read()
if __name__ == "__main__":
test_memory_leak()
Environment
Ubuntu 24.04.2 LTS
Tested python 3.10.15 and python 3.12.3
Python packages:
$ pip freeze
memory-profiler==0.61.0
numpy==2.3.2
psutil==7.0.0
pyarrow==21.0.0
When using pyarrow==21.0.0
the memory usage increases with the iterations. After the first read its at about 1.5GiB. After the 50th read its at about 20GiB. If I run the same test with pyarrow==20.0.0
the memory usage still increases slightly with the iterations but its still less than 2GiB after the 50th iteration.
Debugging
I ran a git bisect and identified #45979 as the change point. Building from the 21.0.0 release commit with ARROW_MIMALLOC=OFF
also solves the problem.
Component(s)
C++, Python