fix(arrow-csv): bound RecordDecoder::flush offset accumulation#9886
Merged
alamb merged 2 commits intoapache:mainfrom May 6, 2026
Merged
fix(arrow-csv): bound RecordDecoder::flush offset accumulation#9886alamb merged 2 commits intoapache:mainfrom
alamb merged 2 commits intoapache:mainfrom
Conversation
Closes apache#9885. `RecordDecoder::flush` walks the per-row offsets emitted by `csv_core::Reader` and accumulates them so each end offset is absolute over `self.data` after the loop. The accumulator was a plain `usize` and the loop body did `*x += offset`, which on malformed input that drives `csv_core` to emit row-relative offsets large enough to wrap a `usize` panics with `attempt to add with overflow` in debug builds and silently wraps in release builds. The cargo-fuzz `csv_reader` harness being prototyped for apache#5332 reproduces this in single-digit minutes from an empty corpus with a 72-byte input. After this patch the same input returns `ArrowError::CsvError("CSV record offsets overflowed usize while flushing")` instead of panicking. Includes a regression test in `reader::records::tests` driving the 72-byte fuzz repro through `RecordDecoder::decode` + `flush`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alamb
reviewed
May 5, 2026
Contributor
alamb
left a comment
There was a problem hiding this comment.
Thank you for this @masumi-ryugo
| } | ||
| decoder.clear(); | ||
| } | ||
| // Reaching this assertion at all means we did not panic. |
| } | ||
| input = &input[consumed..]; | ||
| // The buggy version panics inside `flush()`; the patched version | ||
| // either returns rows or surfaces a clean `CsvError`. |
Contributor
There was a problem hiding this comment.
it should always return an error, right? Can we change the test to follow similar pattern to existing tests like
let err = decoder.flush().unwrap_err();
assert_eq!("msg", err.to_to_string());I think that will be more concise and clearer what the expected value is
- Restore the original two-line "what" comment on the offset rebase loop and drop the verbose "add with overflow / debug vs release" exposition; the code is self-explanatory. - Switch back to the existing `try_for_each` chain pattern instead of nested `for` loops, so the structure matches the rest of the file. - Replace the regression test's loop-and-discard structure with a white-box construction that stages the overflow directly. The test now uses the standard `flush().unwrap_err()` + `assert_eq` pattern (matching `test_invalid_fields`) and asserts the specific `Csv error: CSV record offsets overflowed usize while flushing` message rather than just "did not panic".
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #9885.
What
RecordDecoder::flushwalks the per-row offsets emitted bycsv_core::Readerand accumulates them so each end offset is absolute overself.dataafter the loop. The accumulator was a plainusizeand the loop body did*x += offset, which on malformed input that drivescsv_coreto emit row-relative offsets large enough to wrap ausize:attempt to add with overflowin debug builds (and the cargo-fuzzcsv_readerharness that found this is built with--debug-assertions);assert!/unwrapsomewhere downstream.Fix
Switch the accumulator to
checked_addand surface the overflow asArrowError::CsvErrorinstead. The body of the loop becomes a normalforloop because?doesn't compose with the previous closure form.Repro
The cargo-fuzz
csv_readerharness fromfuzz/initial-harnesses(per #5332) reproduces this from an empty corpus in single-digit minutes. The minimized repro is 72 bytes:Before this PR (run on
mainHEAD against the cargo-fuzz harness):After this PR the same 72 bytes pass through the fuzz target in 40 ms with exit 0; the API now returns
ArrowError::CsvError(...)for callers to handle.Tests
Adds
reader::records::tests::test_flush_offset_overflow_does_not_panic, which feeds the 72-byte fuzz repro throughRecordDecoder::decode+flushand asserts the loop terminates cleanly instead of panicking. The existing 4 tests in that module continue to pass.Alternatives considered
self.data_len: each emitted offset is supposed to be ≤self.data_len, so an explicit cap would also turn the overflow into a clean error. I went withchecked_addbecause it's the more targeted change — it doesn't add a new invariant oncsv_core's output, only refuses to compute something that would have been arithmetically nonsensical anyway.saturating_add: would silently truncate the offset and then mis-sliceself.data, producing a confusingEncountered invalid UTF-8 dataerror or panic deeper in the call stack. Worse signal.xref #5332 #9883 #9884