Goal
A reader/visualization tool opening the zarr mid-stream should always see a valid view: every byte advertised by metadata is in a finalized, parseable shard, and metadata itself is never observed mid-write.
Today, two gaps break this contract.
Gap 1 — metadata advertises chunks in unfinalized shards
The periodic update_append callback writes zarr.json with append-dim sizes derived from:
flat_append_chunks = shard_epoch * chunks_per_shard_append + epoch_in_shard
- GPU path:
src/gpu/flush.d2h_deliver.c:350-379
- CPU path:
src/cpu/stream.body.c:201-218
epoch_in_shard counts chunks already delivered into the currently open shard, which has not been finalized — its index block isn't on disk until finalize_shards runs (src/zarr/shard_delivery.c:35-102).
Concretely, a reader that:
- Reads the latest
zarr.json and computes the implied chunk grid,
- Opens the corresponding shard file,
- Locates the index at `file_size - index_total_bytes`,
will land inside the data region, not on a valid index, for any shard not yet finalized. Result: garbage chunk offsets/sizes, corrupted reads.
This is also the open question raised in #122.
Gap 2 — `zarr.json` writes are non-atomic
fs_put (src/zarr/store_fs.c:24-41) opens the metadata file with `O_WRONLY | O_CREAT | O_TRUNC`, writes, then closes. A concurrent reader can observe:
- Zero-length file (between `O_TRUNC` and the first write completing)
- Partial JSON (mid-write)
- Stale JSON (if the writer fails mid-write and leaves a truncated file)
Any of these break a streaming reader's parse step before they even get to shard files.
Proposed fixes
Fix 1: clamp `update_append` to the last completed shard boundary
Advertise only sizes derived from `shard_epoch * chunks_per_shard_append` — i.e., drop the `epoch_in_shard` term. Metadata then only ever describes data that is already finalized on disk.
Cost: append-dim shape lags the actual write cursor by up to one shard's worth of chunks. For typical `chunks_per_shard_append` values this is acceptable; the reader sees stair-step growth instead of smooth growth.
If smoother growth matters, an alternative is to also flush the in-progress shard's index (and re-open / re-extend) on each metadata update — but that defeats the streaming-write design and isn't recommended.
Fix 2: atomic-rename `zarr.json` writes
In `fs_put` (and the equivalent S3 path if applicable), write to `.tmp.`, `fsync`, then `rename` over the final path. POSIX guarantees `rename` is atomic within the same filesystem. Win32: `MoveFileEx(MOVEFILE_REPLACE_EXISTING | MOVEFILE_WRITE_THROUGH)`.
This is a small, contained change to one function plus a platform helper.
Out of scope
Test gaps
- Add a stress test that reads `zarr.json` continuously while a stream writes, asserting it always parses.
- Add a test that opens an in-progress shard at the advertised size and verifies the index is parseable (gates the Fix 1 invariant).
Goal
A reader/visualization tool opening the zarr mid-stream should always see a valid view: every byte advertised by metadata is in a finalized, parseable shard, and metadata itself is never observed mid-write.
Today, two gaps break this contract.
Gap 1 — metadata advertises chunks in unfinalized shards
The periodic
update_appendcallback writeszarr.jsonwith append-dim sizes derived from:src/gpu/flush.d2h_deliver.c:350-379src/cpu/stream.body.c:201-218epoch_in_shardcounts chunks already delivered into the currently open shard, which has not been finalized — its index block isn't on disk untilfinalize_shardsruns (src/zarr/shard_delivery.c:35-102).Concretely, a reader that:
zarr.jsonand computes the implied chunk grid,will land inside the data region, not on a valid index, for any shard not yet finalized. Result: garbage chunk offsets/sizes, corrupted reads.
This is also the open question raised in #122.
Gap 2 — `zarr.json` writes are non-atomic
fs_put(src/zarr/store_fs.c:24-41) opens the metadata file with `O_WRONLY | O_CREAT | O_TRUNC`, writes, then closes. A concurrent reader can observe:Any of these break a streaming reader's parse step before they even get to shard files.
Proposed fixes
Fix 1: clamp `update_append` to the last completed shard boundary
Advertise only sizes derived from `shard_epoch * chunks_per_shard_append` — i.e., drop the `epoch_in_shard` term. Metadata then only ever describes data that is already finalized on disk.
Cost: append-dim shape lags the actual write cursor by up to one shard's worth of chunks. For typical `chunks_per_shard_append` values this is acceptable; the reader sees stair-step growth instead of smooth growth.
If smoother growth matters, an alternative is to also flush the in-progress shard's index (and re-open / re-extend) on each metadata update — but that defeats the streaming-write design and isn't recommended.
Fix 2: atomic-rename `zarr.json` writes
In `fs_put` (and the equivalent S3 path if applicable), write to `.tmp.`, `fsync`, then `rename` over the final path. POSIX guarantees `rename` is atomic within the same filesystem. Win32: `MoveFileEx(MOVEFILE_REPLACE_EXISTING | MOVEFILE_WRITE_THROUGH)`.
This is a small, contained change to one function plus a platform helper.
Out of scope
Test gaps