It is possible to reduce peak memory usage when using datasets (to use predicate pushdown) when reading single parquet files #47003

aavbsouza · 2025-07-06T06:11:53Z

aavbsouza
Jul 6, 2025

Hello. I am trying to use the cpp Dataset API to use predicate pushdown and projections to simplify reading a parquet file. However it appear that does not matter (from the memory usage point of view) if I materialize the scanner directly to a table (using the method scanner->ToTable() or the method scanner->ToRecordBatchReader()

I have two functions that differs only on the return type (arrow::RecordBatchReader) or arrow::Table

arrow::Result<std::shared_ptr<arrow::RecordBatchReader>> ReadFilteredParquetRecordReader(
    const std::string &parquet_path, const std::int64_t start, const std::int64_t end)
{
  // Create a local filesystem
  ARROW_ASSIGN_OR_RAISE(auto filesystem, fs::FileSystemFromUri("file:///"));

  // Set up file selector for a single file
  fs::FileInfo file_info(parquet_path, fs::FileType::File);

  // Create a Parquet file format
  auto format = std::make_shared<ds::ParquetFileFormat>();

  arrow::dataset::FileSystemFactoryOptions options;
  ARROW_ASSIGN_OR_RAISE(auto factory, arrow::dataset::FileSystemDatasetFactory::Make(
                                          filesystem, {file_info}, format, options));

  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());

  // Create a scanner builder
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());

  // Set a filter: e.g., column start <= "INDEX" <= end
  auto filter = cp::and_(cp::greater_equal(cp::field_ref("INDEX"), cp::literal(start)),
                         cp::less_equal(cp::field_ref("INDEX"), cp::literal(end)));

  ARROW_RETURN_NOT_OK(scan_builder->Filter(filter));

  // Specify the columns to read, e.g., "TRACE_DATA"
  ARROW_RETURN_NOT_OK(scan_builder->Project({"TRACE_DATA"}));

  // Finish the scanner
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());

  scanner->options()->use_threads = true;
  scanner->options()->batch_readahead = 10;
  scanner->options()->fragment_readahead = 5;
  // Read the filtered table to a record batch reader
  return scanner->ToRecordBatchReader();
// return directly a table
// return scanner->ToTable();
}

I was expecting to reduce the memory usage using the RecordBatchReader approach. Next I iterate over the batches as shown here:

void test(const std::string &parquet_path, const std::int64_t start, const std::int64_t end, const std::string &output_file)
{
  auto start_time = absl::Now();
  auto reader_result = ReadFilteredParquetRecordReader(parquet_path, start, end);
  if (!reader_result.ok())
  {
    std::cerr << "Error reading parquet: " << reader_result.status().ToString() << std::endl;
    return;
  }
  auto reader = reader_result.ValueOrDie();
  std::shared_ptr<arrow::RecordBatch> batch;
  std::ofstream os(output_file, std::ios::binary|std::ios::trunc);
  while (true)
  {
    auto status = reader->ReadNext(&batch);
    if (!status.ok())
    {
      std::cerr << "Error reading next batch: " << status.ToString() << std::endl;
      break;
    }
    if (batch == nullptr)
    {
      break;
    }
    auto col = batch->GetColumnByName("DATA");
    auto list_chunk = std::static_pointer_cast<arrow::LargeListArray>(col);
    auto float_chunk = std::static_pointer_cast<arrow::FloatArray>(list_chunk->values());
    os.write(reinterpret_cast<const char *>(float_chunk->raw_values()),
             float_chunk->length() * sizeof(float));
  }
  auto end_time = absl::Now();
  auto duration = absl::ToDoubleSeconds(end_time - start_time);
  os.close();
  std::cout << "Total time taken[batches]: " << duration << " seconds.\n";
}

But the memory usage is almost the same of the case of returning a table directly and iterating over its chunks.

Thanks =)

adamreeve · 2025-07-06T21:48:41Z

adamreeve
Jul 6, 2025
Collaborator

You might want to try creating ParquetFragmentScanOptions and setting them on the ScannerBuilder, and enabling a buffered stream on the parquet::ReaderProperties and disabling pre-buffering on the parquet::ArrowReaderProperties. Those options together should reduce memory use when reading Parquet files. See this issue for some more context: #46935

0 replies

aavbsouza · 2025-07-09T23:06:45Z

aavbsouza
Jul 9, 2025
Author

Hello @adamreeve , thanks for the inputs. I have tried to change and use the ParquetFragmentScanOptions as shown on this snippet:

arrow::Result<std::shared_ptr<arrow::RecordBatchReader>> ReadFilteredParquetRecordReader(
    const std::string &parquet_path, const std::int64_t start, const std::int64_t end)
{
  // Create a local filesystem
  ARROW_ASSIGN_OR_RAISE(auto filesystem, fs::FileSystemFromUri("file:///"));

  // Set up file selector for a single file
  fs::FileInfo file_info(parquet_path, fs::FileType::File);

  // Create a Parquet file format
  auto format = std::make_shared<ds::ParquetFileFormat>();
  auto parquet_scan_options = std::make_shared<arrow::dataset::ParquetFragmentScanOptions>();
  // Configure general Parquet reader settings
  auto reader_properties = std::make_shared<parquet::ReaderProperties>(arrow::default_memory_pool());
  reader_properties->set_buffer_size(64*1024*1024);  // 64 MB buffer size
  reader_properties->enable_buffered_stream();

  // Configure Arrow-specific Parquet reader settings
  auto arrow_reader_props = std::make_shared<parquet::ArrowReaderProperties>();
  arrow_reader_props->set_batch_size(10000);  // default 64 * 1024// Configure general Parquet reader settings
  arrow_reader_props->set_use_threads(true);  // Enable multithreading
  arrow_reader_props->set_pre_buffer(false);  // Enable pre-buffering

  parquet_scan_options->reader_properties = reader_properties;
  parquet_scan_options->arrow_reader_properties = arrow_reader_props;

  auto scan_options = std::make_shared<arrow::dataset::ScanOptions>();

  arrow::dataset::FileSystemFactoryOptions options;
  ARROW_ASSIGN_OR_RAISE(auto factory, arrow::dataset::FileSystemDatasetFactory::Make(
                                          filesystem, {file_info}, format, options));

  ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());

  // Create a scanner builder
  ARROW_ASSIGN_OR_RAISE(auto scan_builder, dataset->NewScan());
  ARROW_RETURN_NOT_OK(scan_builder->FragmentScanOptions(parquet_scan_options));

  // Set a filter: e.g., column start <= "INDEX" <= end
  auto filter = cp::and_(cp::greater_equal(cp::field_ref("INDEX"), cp::literal(start)),
                         cp::less_equal(cp::field_ref("INDEX"), cp::literal(end)));

  ARROW_RETURN_NOT_OK(scan_builder->Filter(filter));

  // Specify the columns to read, e.g., "DATA"
  ARROW_RETURN_NOT_OK(scan_builder->Project({"DATA"}));

  // Finish the scanner
  ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder->Finish());

  scanner->options()->use_threads = true;     // Enable multithreading for better performance
  scanner->options()->cache_metadata = false; // Enable async reading
    // scanner->options()->batch_readahead = 10;   // Set batch readahead for performance
  // scanner->options()->fragment_readahead = 5; // Set fragment readahead for performance
  // Read the filtered table
  return scanner->ToRecordBatchReader();
}

But the memory usage reduction was less pronounced than I was expecting.

I was able to use the parquet stats, to determine the row groups that I need and reading sequentially these row groups controlled the memory usage, but it was slower than using the dataset API.

1 reply

adamreeve Jul 9, 2025
Collaborator

One other option that comes to mind is reducing batch_readahead. I believe it is 16 by default, so reducing it to something low like 1 or disabling it completely by setting it to 0 should reduce memory use too. fragment_readahead probably won't have any effect if you are only reading one Parquet file.

pitrou · 2025-07-10T06:54:08Z

pitrou
Jul 10, 2025
Collaborator

Also you can try to use another MemoryPool to see if that helps. See https://arrow.apache.org/docs/cpp/env_vars.html#envvar-ARROW_DEFAULT_MEMORY_POOL

0 replies

aavbsouza · 2025-07-12T17:12:45Z

aavbsouza
Jul 12, 2025
Author

Hello. The suggestion by @adamreeve to reduce the batch_readahead was effective in reduce the memory consumption with a increase in time to read the file. What I found to be more unexpected is that the memory used (with readahead of 16) to be many times greater than the size of the uncompressed file (the parquet file has 6.8GB and the saved column 7.6GB) at about 70GB.
@pitrou I have built the arrow library using VCPKG with the jemalloc feature, changing the environment variable to system it reduced the max rss (using time -v) to about half of the memory usage of the jemalloc pool

2 replies

adamreeve Jul 13, 2025
Collaborator

That's interesting you see reduced RSS with the system allocator. In some tests I did recently when streaming record batches from Parquet, but not using the dataset API, the max RSS with jemalloc was a lot lower than with the system allocator. I guess heap management is complicated and there isn't one allocator that will be best for all workloads.

pitrou Jul 15, 2025
Collaborator

Yes, and it will also depend on which platform you run it on (the system allocator is obviously system-dependent).

pitrou · 2025-07-15T14:50:39Z

pitrou
Jul 15, 2025
Collaborator

In addition to batch_readahead, you can also try the cache_metadata option.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

It is possible to reduce peak memory usage when using datasets (to use predicate pushdown) when reading single parquet files #47003

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

It is possible to reduce peak memory usage when using datasets (to use predicate pushdown) when reading single parquet files #47003

Uh oh!

aavbsouza Jul 6, 2025

Replies: 5 comments · 3 replies

Uh oh!

adamreeve Jul 6, 2025 Collaborator

Uh oh!

aavbsouza Jul 9, 2025 Author

Uh oh!

adamreeve Jul 9, 2025 Collaborator

Uh oh!

pitrou Jul 10, 2025 Collaborator

Uh oh!

aavbsouza Jul 12, 2025 Author

Uh oh!

adamreeve Jul 13, 2025 Collaborator

Uh oh!

pitrou Jul 15, 2025 Collaborator

Uh oh!

pitrou Jul 15, 2025 Collaborator

aavbsouza
Jul 6, 2025

Replies: 5 comments 3 replies

adamreeve
Jul 6, 2025
Collaborator

aavbsouza
Jul 9, 2025
Author

adamreeve Jul 9, 2025
Collaborator

pitrou
Jul 10, 2025
Collaborator

aavbsouza
Jul 12, 2025
Author

adamreeve Jul 13, 2025
Collaborator

pitrou Jul 15, 2025
Collaborator

pitrou
Jul 15, 2025
Collaborator