Skip to content

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Nov 20, 2025

This PR moves the proecess of collecting an array stream from R (where we had preserve/protect volume issues that made garbage collection very, very slow) into C/C++.

Doesn't quite solve #822 but it should help!

Reproducer for generating an IPC file with a lot of strings:

Details
library(nanoarrow)

ascii_bytes <- vapply(letters, charToRaw, raw(1), USE.NAMES = FALSE)

random_string_array <- function(n = 1, n_chars = 16) {
  data_buffer <- sample(ascii_bytes, n_chars * n, replace = TRUE)
  offsets_buffer <- as.integer(seq(0, n * n_chars, length.out = n + 1))
  nanoarrow_array_modify(
    nanoarrow_array_init(na_string()),
    list(
      length = n,
      null_count = 0,
      buffers = list(NULL, offsets_buffer, data_buffer)
    )
  )
}

random_string_struct <- function(n_rows = 1024, n_cols = 1, n_chars = 16) {
  col_names <- sprintf("col%03d", seq_len(n_cols))
  col_types <- rep(list(na_string()), n_cols)
  names(col_types) <- col_names
  schema <- na_struct(col_types)
  
  columns <- lapply(
    col_names,
    function(...) random_string_array(n_rows, n_chars = n_chars)
  )
  
  nanoarrow_array_modify(
    nanoarrow_array_init(schema),
    list(
      length = n_rows,
      null_count = 0,
      children = columns
    )
  )  
}

random_string_batches <- function(n_batches = 1, n_rows = 1, n_cols = 1, n_chars = 16) {
  lapply(
    seq_len(n_batches),
    function(...) random_string_struct(n_rows, n_cols, n_chars)
  )
}

batches <- random_string_batches(n_batches = 100, n_cols = 160)
stream <- basic_array_stream(batches)
write_nanoarrow(stream, "many_strings.arrows")

...in a separate R session, the issues around taking a long time for the GC to run seemed to go away (but it would be great to have a check!)

library(nanoarrow)

df <- read_nanoarrow("many_strings.arrows") |> 
  convert_array_stream()
f
nanoarrow:::preserved_count()
#> [1] 0
system.time(gc(), gcFirst = FALSE)
#> user  system elapsed 
#> 0.036   0.001   0.037

@paleolimbot paleolimbot marked this pull request as ready for review December 4, 2025 02:45
@paleolimbot paleolimbot requested a review from Copilot December 4, 2025 02:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors array stream collection to occur in C/C++ rather than R, addressing garbage collection performance issues caused by preserve/protect volume problems when handling large numbers of arrays in R.

Key changes:

  • Adds C++ implementation for collecting array stream batches into a vector
  • Introduces nanoarrow C++ header files with unique pointer wrappers and utility classes
  • Updates R code to use the new C-based collection approach
  • Removes the deletion of .hpp files from the bootstrap process

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
r/tools/make-callentries.R Updates file pattern to include .cc extension
r/src/nanoarrow_ipc.hpp New header defining IPC-related unique pointer wrappers
r/src/nanoarrow_cpp.cc Implements nanoarrow_c_collect_array_stream function for batch collection
r/src/nanoarrow.hpp New comprehensive header with C++ helpers for Arrow structures
r/src/init.c Registers new C function for array stream collection
r/bootstrap.R Removes cleanup of .hpp files
r/R/convert-array-stream.R Refactors to use new C-based collection approach

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@paleolimbot paleolimbot merged commit 0af8925 into apache:main Dec 4, 2025
13 checks passed
@paleolimbot paleolimbot deleted the start-fix-convert branch December 4, 2025 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant