Symptom
When store_fs_create(root, /*unbuffered=*/1) is used with the streaming pipeline, the resulting Zarr v3 shard files are larger than data_bytes + index_bytes + 4 by up to one page (more if the shard receives multiple aggregator runs).
Concrete failure from the acquire-zarr shim integration tests after flipping FS-store to unbuffered:
| Test |
Expected |
Linux |
macOS |
stream-raw-to-filesystem shard c/0/0/0/0/0 |
163972 |
167936 |
180224 |
stream-with-ragged-final-shard |
31250260 |
31256576 |
31277056 |
stream-append-nullptr (fs leg) |
12308 |
16384 |
32768 |
All deltas are explained by align_up(_, 4096) padding. (macOS sees more padding because F_NOCACHE is best-effort and the runtime still goes through chucky's aligned-write path; the gap count varies with batching.)
Root cause
Two coupled places in the unbuffered path pad the on-disk layout so O_DIRECT / FILE_FLAG_NO_BUFFERING constraints are met:
src/zarr/shard_delivery.c — per run, write_bytes = align_up(run_bytes, sa) and data_cursor += align_up(run_bytes, sa). The final index buffer is align_up(index_total_bytes, sa) with leading zeros so the actual [index][crc] lives at the very end of the file.
src/zarr/shard_pool_fs.c — the slot pwrites whatever the producer hands it; nothing trims the trailing alignment bytes on finalize.
The file is functionally a valid Zarr v3 shard (chunks at the offsets the index records; readers locating the index via file_size - index_total_bytes still hit it because of the leading-zero pad inside the index buffer). It is just not tightly packed, so fs::file_size(shard) == data + idx + crc no longer holds — and that is the assertion baseline acquire-zarr integration tests carry.
Suggested direction
The aggregator should keep producing page-aligned aggregate buffers per shard (required_shard_alignment stays — upstream sizing depends on it). What needs to change is persistence-side: the FS slot owns the alignment, the producer hands it logical bytes.
Sketch:
- Add
platform_ftruncate(fd, size):
- POSIX/Darwin:
ftruncate(fd, size)
- Win32:
SetFileInformationByHandle(fd, FileEndOfFileInfo, ...) — works on FILE_FLAG_NO_BUFFERING handles since it doesn't go through the file pointer
- In
shard_pool_fs unbuffered slots, add per-slot state: tail_buf (page-aligned, one page), tail_len, logical_size, aligned_pos. fs_slot_write / fs_slot_write_direct coalesce:
- drain
tail_buf when it fills a page → aligned pwrite
- middle full-page span → zero-copy
pwrite_ref if the source is page-aligned, else copy+pwrite
- remainder → into
tail_buf
- assumption: writes are monotonic per slot (already true —
shard_delivery only advances data_cursor)
fs_slot_finalize (unbuffered): zero-fill tail_buf[tail_len..page], pwrite the final page, queue an ftruncate(fd, logical_size) job, then close. All on the io queue so they serialize behind outstanding pwrites.
shard_delivery: drop the align_up(run_bytes, sa) calls, drop the index_total_bytes padding/leading-zero indirection — just hand the slot logical sizes/offsets.
- Drop the per-entry zero-fill loop (lines 158–163) once persistence-side alignment is gone.
required_shard_alignment stays — pad_shard_sizes and aggregate buffer sizing are still upstream of the slot.
Open question — is incremental shard readability worth preserving?
The original layout (index at end with leading zero pad, padded data) was chosen so a reader could open a partially-written shard mid-stream and find the latest index. With the proposed coalescing the file is always tightly packed but only complete shards are readable (the index isn't on disk until finalize). Worth checking whether this is a regression for any consumer — my read is that array shape metadata in zarr.json is already gated to shard completion, so no reader currently benefits from the mid-stream index, but worth confirming.
Test gaps
tests/test_store_fs.c::test_shard_pool_unbuffered only checks file existence — should assert file_size == bytes_written and exercise a multi-write coalescing case (e.g. write 100 bytes then 5000 bytes; expect 5100, not 8192).
- A sharded zarr_array test under
unbuffered=1 should round-trip and assert tight file size.
Symptom
When
store_fs_create(root, /*unbuffered=*/1)is used with the streaming pipeline, the resulting Zarr v3 shard files are larger thandata_bytes + index_bytes + 4by up to one page (more if the shard receives multiple aggregator runs).Concrete failure from the acquire-zarr shim integration tests after flipping FS-store to unbuffered:
stream-raw-to-filesystemshardc/0/0/0/0/0stream-with-ragged-final-shardstream-append-nullptr(fs leg)All deltas are explained by
align_up(_, 4096)padding. (macOS sees more padding becauseF_NOCACHEis best-effort and the runtime still goes through chucky's aligned-write path; the gap count varies with batching.)Root cause
Two coupled places in the unbuffered path pad the on-disk layout so
O_DIRECT/FILE_FLAG_NO_BUFFERINGconstraints are met:src/zarr/shard_delivery.c— per run,write_bytes = align_up(run_bytes, sa)anddata_cursor += align_up(run_bytes, sa). The final index buffer isalign_up(index_total_bytes, sa)with leading zeros so the actual[index][crc]lives at the very end of the file.src/zarr/shard_pool_fs.c— the slot pwrites whatever the producer hands it; nothing trims the trailing alignment bytes onfinalize.The file is functionally a valid Zarr v3 shard (chunks at the offsets the index records; readers locating the index via
file_size - index_total_bytesstill hit it because of the leading-zero pad inside the index buffer). It is just not tightly packed, sofs::file_size(shard) == data + idx + crcno longer holds — and that is the assertion baseline acquire-zarr integration tests carry.Suggested direction
The aggregator should keep producing page-aligned aggregate buffers per shard (
required_shard_alignmentstays — upstream sizing depends on it). What needs to change is persistence-side: the FS slot owns the alignment, the producer hands it logical bytes.Sketch:
platform_ftruncate(fd, size):ftruncate(fd, size)SetFileInformationByHandle(fd, FileEndOfFileInfo, ...)— works onFILE_FLAG_NO_BUFFERINGhandles since it doesn't go through the file pointershard_pool_fsunbuffered slots, add per-slot state:tail_buf(page-aligned, one page),tail_len,logical_size,aligned_pos.fs_slot_write/fs_slot_write_directcoalesce:tail_bufwhen it fills a page → alignedpwritepwrite_refif the source is page-aligned, else copy+pwritetail_bufshard_deliveryonly advancesdata_cursor)fs_slot_finalize(unbuffered): zero-filltail_buf[tail_len..page], pwrite the final page, queue anftruncate(fd, logical_size)job, then close. All on the io queue so they serialize behind outstanding pwrites.shard_delivery: drop thealign_up(run_bytes, sa)calls, drop theindex_total_bytespadding/leading-zero indirection — just hand the slot logical sizes/offsets.required_shard_alignmentstays —pad_shard_sizesand aggregate buffer sizing are still upstream of the slot.Open question — is incremental shard readability worth preserving?
The original layout (index at end with leading zero pad, padded data) was chosen so a reader could open a partially-written shard mid-stream and find the latest index. With the proposed coalescing the file is always tightly packed but only complete shards are readable (the index isn't on disk until finalize). Worth checking whether this is a regression for any consumer — my read is that array shape metadata in
zarr.jsonis already gated to shard completion, so no reader currently benefits from the mid-stream index, but worth confirming.Test gaps
tests/test_store_fs.c::test_shard_pool_unbufferedonly checks file existence — should assertfile_size == bytes_writtenand exercise a multi-write coalescing case (e.g. write 100 bytes then 5000 bytes; expect 5100, not 8192).unbuffered=1should round-trip and assert tight file size.