fix(agent-runner): exit on persistent inbound.db corruption errors by kartast · Pull Request #2597 · nanocoai/nanoclaw

kartast · 2026-05-23T02:10:44Z

Problem

On Docker Desktop macOS, the agent-runner container can get stuck in an endless database disk image is malformed log loop on the follow-up poll, blocking message delivery indefinitely. Observed in production — container ran for 25+ minutes emitting:

```
[poll-loop] Follow-up poll error: database disk image is malformed
[poll-loop] Follow-up poll error: database disk image is malformed
... (2/sec, indefinitely)
```

Host-side `PRAGMA integrity_check` on both `inbound.db` and `outbound.db` returns `ok` — the file isn't actually corrupt. This is the cross-mount page-cache coherency issue already documented in `container/agent-runner/src/db/connection.ts:11-18`: Docker Desktop's virtiofs / gRPC-FUSE layer can latch a torn snapshot mid-host-write, and every fresh `openInboundDb()` in the same process sees the same broken view. Reopening the handle does not recover; only a fresh container mount does.

The follow-up poll's catch block (poll-loop.ts:358-367) logs the error and keeps polling — which is correct for transient errors but catastrophic for this one. The user's only recourse is `docker restart `.

Fix

After `CORRUPTION_STREAK_EXIT` (10) consecutive corruption errors (~5 s at `ACTIVE_POLL_INTERVAL_MS = 500ms`), exit the process with code 75. The host's existing container-respawn-on-wake path then brings up a fresh container with a fresh mount, which clears the poisoned page cache.

Streak counter is scoped to `processQuery`'s `pollHandle` so it resets on any successful poll or non-corruption error.
Threshold of 10 is intentionally well above any plausible single torn read during a host write burst but well below the 30-min stale-heartbeat ceiling, so users see fast recovery instead of long stalls.
Exit is deferred 100 ms so the explanatory log line flushes through Docker's log driver before the process dies.

Test plan

`bun test src/poll-loop` — 3 new `isCorruptionError` tests pass (existing failure in `formatter > should format multiple chat messages as XML block` is pre-existing and unrelated).
`bunx tsc --noEmit` — clean.
Reproduced locally by triggering the symptom (host-side message write race), confirmed container now exits + respawns instead of looping forever.

Notes

Only addresses the symptom (stuck-forever loop), not the underlying Docker Desktop coherency race. The race itself is hard to eliminate without switching off Docker Desktop's gRPC-FUSE backend. The symptom fix is sufficient because the race is rare enough that occasional 5-second self-heals are acceptable, whereas indefinite stalls are not.
Exit code 75 (`EX_TEMPFAIL`) is conventional for "transient failure, please retry".

The follow-up poll catches and logs SQLite errors but never recovers from them. On Docker Desktop macOS, the kernel page cache for the inbound.db bind mount can latch a torn snapshot mid-host-write (a known virtiofs / gRPC-FUSE coherency issue), after which every fresh openInboundDb() in the same process sees the same broken view and emits 'database disk image is malformed' at the poll rate (2/sec). Reopening the DB handle inside the container does not recover — only a fresh container mount does. The fix: after CORRUPTION_STREAK_EXIT consecutive corruption errors (~5s), log a clear message and process.exit(75) so host-sweep respawns the container with a fresh mount. Transient single torn reads are still tolerated. - Add isCorruptionError() helper covering the three SQLite read-side corruption symptoms (disk image malformed, SQLITE_CORRUPT, file is not a database). - Add streak counter scoped to processQuery's pollHandle so it resets on any successful or non-corruption error. - Add unit tests for the matcher. Refs the cross-mount invariants documented in db/connection.ts:11-18.

gavrielc · 2026-05-23T17:06:43Z

@kartast Thank you for the contribution!

kartast requested review from gabi-simons and gavrielc as code owners May 23, 2026 02:10

github-actions Bot mentioned this pull request May 23, 2026

🦞 OpenClaw 生态日报 2026-05-23 ivanweng2077/big_model_radar#78

Open

Merge branch 'main' into fix/db-malformed-self-restart

9dc9efa

gavrielc merged commit c76ecb4 into nanocoai:main May 23, 2026

This was referenced May 24, 2026

🦞 OpenClaw 生态日报 2026-05-24 zx0828/big_model_radar#70

Open

🦞 OpenClaw 生态日报 2026-05-24 ivanweng2077/big_model_radar#82

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent-runner): exit on persistent inbound.db corruption errors#2597

fix(agent-runner): exit on persistent inbound.db corruption errors#2597
gavrielc merged 2 commits into
nanocoai:mainfrom
kartast:fix/db-malformed-self-restart

kartast commented May 23, 2026

Uh oh!

gavrielc commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kartast commented May 23, 2026

Problem

Fix

Test plan

Notes

Uh oh!

gavrielc commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants