Skip to content

fix(agent-runner): exit on persistent inbound.db corruption errors#2597

Merged
gavrielc merged 2 commits into
nanocoai:mainfrom
kartast:fix/db-malformed-self-restart
May 23, 2026
Merged

fix(agent-runner): exit on persistent inbound.db corruption errors#2597
gavrielc merged 2 commits into
nanocoai:mainfrom
kartast:fix/db-malformed-self-restart

Conversation

@kartast
Copy link
Copy Markdown

@kartast kartast commented May 23, 2026

Problem

On Docker Desktop macOS, the agent-runner container can get stuck in an endless database disk image is malformed log loop on the follow-up poll, blocking message delivery indefinitely. Observed in production — container ran for 25+ minutes emitting:

```
[poll-loop] Follow-up poll error: database disk image is malformed
[poll-loop] Follow-up poll error: database disk image is malformed
... (2/sec, indefinitely)
```

Host-side `PRAGMA integrity_check` on both `inbound.db` and `outbound.db` returns `ok` — the file isn't actually corrupt. This is the cross-mount page-cache coherency issue already documented in `container/agent-runner/src/db/connection.ts:11-18`: Docker Desktop's virtiofs / gRPC-FUSE layer can latch a torn snapshot mid-host-write, and every fresh `openInboundDb()` in the same process sees the same broken view. Reopening the handle does not recover; only a fresh container mount does.

The follow-up poll's catch block (poll-loop.ts:358-367) logs the error and keeps polling — which is correct for transient errors but catastrophic for this one. The user's only recourse is `docker restart `.

Fix

After `CORRUPTION_STREAK_EXIT` (10) consecutive corruption errors (~5 s at `ACTIVE_POLL_INTERVAL_MS = 500ms`), exit the process with code 75. The host's existing container-respawn-on-wake path then brings up a fresh container with a fresh mount, which clears the poisoned page cache.

  • Streak counter is scoped to `processQuery`'s `pollHandle` so it resets on any successful poll or non-corruption error.
  • Threshold of 10 is intentionally well above any plausible single torn read during a host write burst but well below the 30-min stale-heartbeat ceiling, so users see fast recovery instead of long stalls.
  • Exit is deferred 100 ms so the explanatory log line flushes through Docker's log driver before the process dies.

Test plan

  • `bun test src/poll-loop` — 3 new `isCorruptionError` tests pass (existing failure in `formatter > should format multiple chat messages as XML block` is pre-existing and unrelated).
  • `bunx tsc --noEmit` — clean.
  • Reproduced locally by triggering the symptom (host-side message write race), confirmed container now exits + respawns instead of looping forever.

Notes

  • Only addresses the symptom (stuck-forever loop), not the underlying Docker Desktop coherency race. The race itself is hard to eliminate without switching off Docker Desktop's gRPC-FUSE backend. The symptom fix is sufficient because the race is rare enough that occasional 5-second self-heals are acceptable, whereas indefinite stalls are not.
  • Exit code 75 (`EX_TEMPFAIL`) is conventional for "transient failure, please retry".

The follow-up poll catches and logs SQLite errors but never recovers
from them. On Docker Desktop macOS, the kernel page cache for the
inbound.db bind mount can latch a torn snapshot mid-host-write (a known
virtiofs / gRPC-FUSE coherency issue), after which every fresh
openInboundDb() in the same process sees the same broken view and
emits 'database disk image is malformed' at the poll rate (2/sec).

Reopening the DB handle inside the container does not recover — only
a fresh container mount does. The fix: after CORRUPTION_STREAK_EXIT
consecutive corruption errors (~5s), log a clear message and
process.exit(75) so host-sweep respawns the container with a fresh
mount. Transient single torn reads are still tolerated.

- Add isCorruptionError() helper covering the three SQLite read-side
  corruption symptoms (disk image malformed, SQLITE_CORRUPT, file is
  not a database).
- Add streak counter scoped to processQuery's pollHandle so it resets
  on any successful or non-corruption error.
- Add unit tests for the matcher.

Refs the cross-mount invariants documented in db/connection.ts:11-18.
@gavrielc gavrielc merged commit c76ecb4 into nanocoai:main May 23, 2026
@gavrielc
Copy link
Copy Markdown
Collaborator

@kartast Thank you for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants