Unbuffered FS shards leave sector-pad bytes in the file (breaks tight file-size invariant)

## Symptom

When `store_fs_create(root, /*unbuffered=*/1)` is used with the streaming pipeline, the resulting Zarr v3 shard files are larger than `data_bytes + index_bytes + 4` by up to one page (more if the shard receives multiple aggregator runs).

Concrete failure from the acquire-zarr shim integration tests after flipping FS-store to unbuffered:

| Test | Expected | Linux | macOS |
|---|---|---|---|
| `stream-raw-to-filesystem` shard `c/0/0/0/0/0` | 163972 | 167936 | 180224 |
| `stream-with-ragged-final-shard` | 31250260 | 31256576 | 31277056 |
| `stream-append-nullptr` (fs leg) | 12308 | 16384 | 32768 |

All deltas are explained by `align_up(_, 4096)` padding. (macOS sees more padding because `F_NOCACHE` is best-effort and the runtime still goes through chucky's aligned-write path; the gap count varies with batching.)

## Root cause

Two coupled places in the unbuffered path pad the on-disk layout so `O_DIRECT` / `FILE_FLAG_NO_BUFFERING` constraints are met:

- `src/zarr/shard_delivery.c` — per run, `write_bytes = align_up(run_bytes, sa)` and `data_cursor += align_up(run_bytes, sa)`. The final index buffer is `align_up(index_total_bytes, sa)` with **leading zeros** so the actual `[index][crc]` lives at the very end of the file.
- `src/zarr/shard_pool_fs.c` — the slot pwrites whatever the producer hands it; nothing trims the trailing alignment bytes on `finalize`.

The file is functionally a valid Zarr v3 shard (chunks at the offsets the index records; readers locating the index via `file_size - index_total_bytes` still hit it because of the leading-zero pad inside the index buffer). It is just not tightly packed, so `fs::file_size(shard) == data + idx + crc` no longer holds — and that is the assertion baseline acquire-zarr integration tests carry.

## Suggested direction

The aggregator should keep producing page-aligned aggregate buffers per shard (`required_shard_alignment` stays — upstream sizing depends on it). What needs to change is **persistence-side**: the FS slot owns the alignment, the producer hands it logical bytes.

Sketch:

1. Add `platform_ftruncate(fd, size)`:
   - POSIX/Darwin: `ftruncate(fd, size)`
   - Win32: `SetFileInformationByHandle(fd, FileEndOfFileInfo, ...)` — works on `FILE_FLAG_NO_BUFFERING` handles since it doesn't go through the file pointer
2. In `shard_pool_fs` unbuffered slots, add per-slot state: `tail_buf` (page-aligned, one page), `tail_len`, `logical_size`, `aligned_pos`. `fs_slot_write` / `fs_slot_write_direct` coalesce:
   - drain `tail_buf` when it fills a page → aligned `pwrite`
   - middle full-page span → zero-copy `pwrite_ref` if the source is page-aligned, else copy+pwrite
   - remainder → into `tail_buf`
   - assumption: writes are monotonic per slot (already true — `shard_delivery` only advances `data_cursor`)
3. `fs_slot_finalize` (unbuffered): zero-fill `tail_buf[tail_len..page]`, pwrite the final page, queue an `ftruncate(fd, logical_size)` job, then close. All on the io queue so they serialize behind outstanding pwrites.
4. `shard_delivery`: drop the `align_up(run_bytes, sa)` calls, drop the `index_total_bytes` padding/leading-zero indirection — just hand the slot logical sizes/offsets.
5. Drop the per-entry zero-fill loop (lines 158–163) once persistence-side alignment is gone.

`required_shard_alignment` stays — `pad_shard_sizes` and aggregate buffer sizing are still upstream of the slot.

## Open question — is incremental shard readability worth preserving?

The original layout (index at end with leading zero pad, padded data) was chosen so a reader could open a partially-written shard mid-stream and find the latest index. With the proposed coalescing the file is always tightly packed but only complete shards are readable (the index isn't on disk until finalize). Worth checking whether this is a regression for any consumer — my read is that array shape metadata in `zarr.json` is already gated to shard completion, so no reader currently benefits from the mid-stream index, but worth confirming.

## Test gaps

- `tests/test_store_fs.c::test_shard_pool_unbuffered` only checks file existence — should assert `file_size == bytes_written` and exercise a multi-write coalescing case (e.g. write 100 bytes then 5000 bytes; expect 5100, not 8192).
- A sharded zarr_array test under `unbuffered=1` should round-trip and assert tight file size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unbuffered FS shards leave sector-pad bytes in the file (breaks tight file-size invariant) #122

Symptom

Root cause

Suggested direction

Open question — is incremental shard readability worth preserving?

Test gaps

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Test	Expected	Linux	macOS
`stream-raw-to-filesystem` shard `c/0/0/0/0/0`	163972	167936	180224
`stream-with-ragged-final-shard`	31250260	31256576	31277056
`stream-append-nullptr` (fs leg)	12308	16384	32768

Unbuffered FS shards leave sector-pad bytes in the file (breaks tight file-size invariant) #122

Description

Symptom

Root cause

Suggested direction

Open question — is incremental shard readability worth preserving?

Test gaps

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions