flush_file I/O errors on NFS are silently swallowed, producing corrupt output with exit code 0

  Problem

  When writing to an NFS/Lustre filesystem under high concurrency, flush_file in platform.cpp can fail with Input/output
   error. The error is logged but not raised so the write returns success to the caller, the process exits 0, and the
  resulting shard files are silently corrupt.

  Log output observed:
  platform.cpp:93 flush_file: Failed to flush file: Input/output error

  Reading back the written shards then raises:
  RuntimeError: error during blosc decompression: 0

  The corruption is partial and timepoint-dependent: shards written during peak NFS contention are corrupt while earlier
   ones are fine. This makes it extremely hard to detect without reading back every chunk.

  Environment: NFS/Lustre archive filesystem, 200+ parallel writers, acquire-zarr 0.8.0

  Expected behavior
  If flush_file fails, the error should propagate up as a C++ exception (and surface as a Python exception), so the
  caller knows the write failed and can exit non-zero, retry, or abort cleanly.

  Suggested fixes
  1. Propagate the error: Raise from flush_file rather than logging and continuing, so callers always know when a write
  has failed.
  2. Atomic shard writes: Write each shard to a temp file and rename it into place only on success. This prevents a
  partially-flushed file from ever appearing as a valid (but corrupt) shard, regardless of whether the exception
  propagates correctly.

  Reproduction

  The failure is not easily reproducible without an NFS filesystem under write contention, but the pattern is:

```
  import acquire_zarr as aqz
  import numpy as np
  import zarr

  settings = aqz.StreamSettings(
      store_path="/nfs/archive/test.zarr",  # NFS path under high concurrency
      arrays=[
          aqz.ArraySettings(
              output_key="0",
              shape=(1, 1, 1, 2048, 2048),
              chunks=(1, 1, 1, 2048, 2048),
              dtype=aqz.DataType.UINT16,
              version=aqz.ZarrVersion.V3,
          )
      ],
      version=aqz.ZarrVersion.V3,
  )

  stream = aqz.ZarrStream(settings)
  for t in range(141):
      block = np.random.randint(0, 65535, (1, 1, 1, 2048, 2048), dtype=np.uint16)
      stream.append(block, key="")
```

  stream.close()  # Returns without raising, even when flush failed.
                  # stderr shows: platform.cpp:93 flush_file: Failed to flush file: Input/output error

  Reading back the written data then raises:
  arr = zarr.open("/nfs/archive/test.zarr/0", mode="r")
  np.asarray(arr[70, 0, 0, :4, :4]) # RuntimeError: error during blosc decompression: 0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flush_file I/O errors on NFS are silently swallowed, producing corrupt output with exit code 0 #229

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

flush_file I/O errors on NFS are silently swallowed, producing corrupt output with exit code 0 #229

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions