Skip to content

Streaming readers can see invalid state: unfinalized shards advertised + torn zarr.json #123

@nclack

Description

@nclack

Goal

A reader/visualization tool opening the zarr mid-stream should always see a valid view: every byte advertised by metadata is in a finalized, parseable shard, and metadata itself is never observed mid-write.

Today, two gaps break this contract.

Gap 1 — metadata advertises chunks in unfinalized shards

The periodic update_append callback writes zarr.json with append-dim sizes derived from:

flat_append_chunks = shard_epoch * chunks_per_shard_append + epoch_in_shard
  • GPU path: src/gpu/flush.d2h_deliver.c:350-379
  • CPU path: src/cpu/stream.body.c:201-218

epoch_in_shard counts chunks already delivered into the currently open shard, which has not been finalized — its index block isn't on disk until finalize_shards runs (src/zarr/shard_delivery.c:35-102).

Concretely, a reader that:

  1. Reads the latest zarr.json and computes the implied chunk grid,
  2. Opens the corresponding shard file,
  3. Locates the index at `file_size - index_total_bytes`,

will land inside the data region, not on a valid index, for any shard not yet finalized. Result: garbage chunk offsets/sizes, corrupted reads.

This is also the open question raised in #122.

Gap 2 — `zarr.json` writes are non-atomic

fs_put (src/zarr/store_fs.c:24-41) opens the metadata file with `O_WRONLY | O_CREAT | O_TRUNC`, writes, then closes. A concurrent reader can observe:

  • Zero-length file (between `O_TRUNC` and the first write completing)
  • Partial JSON (mid-write)
  • Stale JSON (if the writer fails mid-write and leaves a truncated file)

Any of these break a streaming reader's parse step before they even get to shard files.

Proposed fixes

Fix 1: clamp `update_append` to the last completed shard boundary

Advertise only sizes derived from `shard_epoch * chunks_per_shard_append` — i.e., drop the `epoch_in_shard` term. Metadata then only ever describes data that is already finalized on disk.

Cost: append-dim shape lags the actual write cursor by up to one shard's worth of chunks. For typical `chunks_per_shard_append` values this is acceptable; the reader sees stair-step growth instead of smooth growth.

If smoother growth matters, an alternative is to also flush the in-progress shard's index (and re-open / re-extend) on each metadata update — but that defeats the streaming-write design and isn't recommended.

Fix 2: atomic-rename `zarr.json` writes

In `fs_put` (and the equivalent S3 path if applicable), write to `.tmp.`, `fsync`, then `rename` over the final path. POSIX guarantees `rename` is atomic within the same filesystem. Win32: `MoveFileEx(MOVEFILE_REPLACE_EXISTING | MOVEFILE_WRITE_THROUGH)`.

This is a small, contained change to one function plus a platform helper.

Out of scope

Test gaps

  • Add a stress test that reads `zarr.json` continuously while a stream writes, asserting it always parses.
  • Add a test that opens an in-progress shard at the advertised size and verifies the index is parseable (gates the Fix 1 invariant).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions