Skip to content

fix(arrow-csv): bound RecordDecoder::flush offset accumulation#9886

Merged
alamb merged 2 commits intoapache:mainfrom
masumi-ryugo:fix/arrow-csv-records-overflow
May 6, 2026
Merged

fix(arrow-csv): bound RecordDecoder::flush offset accumulation#9886
alamb merged 2 commits intoapache:mainfrom
masumi-ryugo:fix/arrow-csv-records-overflow

Conversation

@masumi-ryugo
Copy link
Copy Markdown
Contributor

Closes #9885.

What

RecordDecoder::flush walks the per-row offsets emitted by csv_core::Reader and accumulates them so each end offset is absolute over self.data after the loop. The accumulator was a plain usize and the loop body did *x += offset, which on malformed input that drives csv_core to emit row-relative offsets large enough to wrap a usize:

  • panics with attempt to add with overflow in debug builds (and the cargo-fuzz csv_reader harness that found this is built with --debug-assertions);
  • silently wraps to a wildly out-of-bounds index in release builds, which then trips an unrelated assert! / unwrap somewhere downstream.

Fix

Switch the accumulator to checked_add and surface the overflow as ArrowError::CsvError instead. The body of the loop becomes a normal for loop because ? doesn't compose with the previous closure form.

let mut row_offset: usize = 0;
for row in self.offsets[1..self.offsets_len].chunks_exact_mut(self.num_columns) {
    let offset = row_offset;
    for x in row.iter_mut() {
        *x = x.checked_add(offset).ok_or_else(|| {
            ArrowError::CsvError(
                "CSV record offsets overflowed usize while flushing".to_string(),
            )
        })?;
        row_offset = *x;
    }
}

Repro

The cargo-fuzz csv_reader harness from fuzz/initial-harnesses (per #5332) reproduces this from an empty corpus in single-digit minutes. The minimized repro is 72 bytes:

0000  2e 22 3f 0a 31 0a 3f 3f  0a 3c 50 50 0a 3f 0a 31  |."?.1.??.<PP.?.1|
0010  0a 3f 38 0a 3c 0a 3f 0a  3c 50 50 0a 3f 0a 31 0a  |.?8.<.?.<PP.?.1.|
0020  3f 38 0a 0a 2e 22 3f 0a  31 0a 3f 3f 0a ce ce ce  |?8..."?.1.??....|
0030  b1 ce ce ce ce ce ce ce  ce 31 0a 3f 38 0a 3c 0a  |.........1.?8.<.|
0040  3f 0a 3c 0a 3f 0a 3f 69                            |?.<.?.?i|

Before this PR (run on main HEAD against the cargo-fuzz harness):

thread '<unnamed>' panicked at arrow-csv/src/reader/records.rs:207:21:
attempt to add with overflow

After this PR the same 72 bytes pass through the fuzz target in 40 ms with exit 0; the API now returns ArrowError::CsvError(...) for callers to handle.

Tests

Adds reader::records::tests::test_flush_offset_overflow_does_not_panic, which feeds the 72-byte fuzz repro through RecordDecoder::decode + flush and asserts the loop terminates cleanly instead of panicking. The existing 4 tests in that module continue to pass.

Alternatives considered

  • Cap by self.data_len: each emitted offset is supposed to be ≤ self.data_len, so an explicit cap would also turn the overflow into a clean error. I went with checked_add because it's the more targeted change — it doesn't add a new invariant on csv_core's output, only refuses to compute something that would have been arithmetically nonsensical anyway.
  • Use saturating_add: would silently truncate the offset and then mis-slice self.data, producing a confusing Encountered invalid UTF-8 data error or panic deeper in the call stack. Worse signal.

xref #5332 #9883 #9884

Closes apache#9885.

`RecordDecoder::flush` walks the per-row offsets emitted by
`csv_core::Reader` and accumulates them so each end offset is
absolute over `self.data` after the loop. The accumulator was a
plain `usize` and the loop body did `*x += offset`, which on
malformed input that drives `csv_core` to emit row-relative offsets
large enough to wrap a `usize` panics with `attempt to add with
overflow` in debug builds and silently wraps in release builds.

The cargo-fuzz `csv_reader` harness being prototyped for apache#5332
reproduces this in single-digit minutes from an empty corpus with a
72-byte input. After this patch the same input returns
`ArrowError::CsvError("CSV record offsets overflowed usize while
flushing")` instead of panicking.

Includes a regression test in `reader::records::tests` driving the
72-byte fuzz repro through `RecordDecoder::decode` + `flush`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the arrow Changes to the arrow crate label May 3, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @masumi-ryugo

Comment thread arrow-csv/src/reader/records.rs Outdated
}
decoder.clear();
}
// Reaching this assertion at all means we did not panic.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no assertion here 🤔

Comment thread arrow-csv/src/reader/records.rs Outdated
}
input = &input[consumed..];
// The buggy version panics inside `flush()`; the patched version
// either returns rows or surfaces a clean `CsvError`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should always return an error, right? Can we change the test to follow similar pattern to existing tests like

let err = decoder.flush().unwrap_err();
assert_eq!("msg", err.to_to_string());

I think that will be more concise and clearer what the expected value is

Comment thread arrow-csv/src/reader/records.rs Outdated
Comment thread arrow-csv/src/reader/records.rs Outdated
- Restore the original two-line "what" comment on the offset rebase
  loop and drop the verbose "add with overflow / debug vs release"
  exposition; the code is self-explanatory.
- Switch back to the existing `try_for_each` chain pattern instead of
  nested `for` loops, so the structure matches the rest of the file.
- Replace the regression test's loop-and-discard structure with a
  white-box construction that stages the overflow directly. The test
  now uses the standard `flush().unwrap_err()` + `assert_eq` pattern
  (matching `test_invalid_fields`) and asserts the specific
  `Csv error: CSV record offsets overflowed usize while flushing`
  message rather than just "did not panic".
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @masumi-ryugo

@alamb alamb merged commit ded985c into apache:main May 6, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

arrow-csv: integer overflow panic in Reader::records::flush

2 participants