Skip to content

Unbuffered FS shards leave sector-pad bytes in the file (breaks tight file-size invariant) #122

@nclack

Description

@nclack

Symptom

When store_fs_create(root, /*unbuffered=*/1) is used with the streaming pipeline, the resulting Zarr v3 shard files are larger than data_bytes + index_bytes + 4 by up to one page (more if the shard receives multiple aggregator runs).

Concrete failure from the acquire-zarr shim integration tests after flipping FS-store to unbuffered:

Test Expected Linux macOS
stream-raw-to-filesystem shard c/0/0/0/0/0 163972 167936 180224
stream-with-ragged-final-shard 31250260 31256576 31277056
stream-append-nullptr (fs leg) 12308 16384 32768

All deltas are explained by align_up(_, 4096) padding. (macOS sees more padding because F_NOCACHE is best-effort and the runtime still goes through chucky's aligned-write path; the gap count varies with batching.)

Root cause

Two coupled places in the unbuffered path pad the on-disk layout so O_DIRECT / FILE_FLAG_NO_BUFFERING constraints are met:

  • src/zarr/shard_delivery.c — per run, write_bytes = align_up(run_bytes, sa) and data_cursor += align_up(run_bytes, sa). The final index buffer is align_up(index_total_bytes, sa) with leading zeros so the actual [index][crc] lives at the very end of the file.
  • src/zarr/shard_pool_fs.c — the slot pwrites whatever the producer hands it; nothing trims the trailing alignment bytes on finalize.

The file is functionally a valid Zarr v3 shard (chunks at the offsets the index records; readers locating the index via file_size - index_total_bytes still hit it because of the leading-zero pad inside the index buffer). It is just not tightly packed, so fs::file_size(shard) == data + idx + crc no longer holds — and that is the assertion baseline acquire-zarr integration tests carry.

Suggested direction

The aggregator should keep producing page-aligned aggregate buffers per shard (required_shard_alignment stays — upstream sizing depends on it). What needs to change is persistence-side: the FS slot owns the alignment, the producer hands it logical bytes.

Sketch:

  1. Add platform_ftruncate(fd, size):
    • POSIX/Darwin: ftruncate(fd, size)
    • Win32: SetFileInformationByHandle(fd, FileEndOfFileInfo, ...) — works on FILE_FLAG_NO_BUFFERING handles since it doesn't go through the file pointer
  2. In shard_pool_fs unbuffered slots, add per-slot state: tail_buf (page-aligned, one page), tail_len, logical_size, aligned_pos. fs_slot_write / fs_slot_write_direct coalesce:
    • drain tail_buf when it fills a page → aligned pwrite
    • middle full-page span → zero-copy pwrite_ref if the source is page-aligned, else copy+pwrite
    • remainder → into tail_buf
    • assumption: writes are monotonic per slot (already true — shard_delivery only advances data_cursor)
  3. fs_slot_finalize (unbuffered): zero-fill tail_buf[tail_len..page], pwrite the final page, queue an ftruncate(fd, logical_size) job, then close. All on the io queue so they serialize behind outstanding pwrites.
  4. shard_delivery: drop the align_up(run_bytes, sa) calls, drop the index_total_bytes padding/leading-zero indirection — just hand the slot logical sizes/offsets.
  5. Drop the per-entry zero-fill loop (lines 158–163) once persistence-side alignment is gone.

required_shard_alignment stays — pad_shard_sizes and aggregate buffer sizing are still upstream of the slot.

Open question — is incremental shard readability worth preserving?

The original layout (index at end with leading zero pad, padded data) was chosen so a reader could open a partially-written shard mid-stream and find the latest index. With the proposed coalescing the file is always tightly packed but only complete shards are readable (the index isn't on disk until finalize). Worth checking whether this is a regression for any consumer — my read is that array shape metadata in zarr.json is already gated to shard completion, so no reader currently benefits from the mid-stream index, but worth confirming.

Test gaps

  • tests/test_store_fs.c::test_shard_pool_unbuffered only checks file existence — should assert file_size == bytes_written and exercise a multi-write coalescing case (e.g. write 100 bytes then 5000 bytes; expect 5100, not 8192).
  • A sharded zarr_array test under unbuffered=1 should round-trip and assert tight file size.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions