fix(agent-runner): exit on persistent inbound.db corruption errors#2597
Merged
Conversation
The follow-up poll catches and logs SQLite errors but never recovers from them. On Docker Desktop macOS, the kernel page cache for the inbound.db bind mount can latch a torn snapshot mid-host-write (a known virtiofs / gRPC-FUSE coherency issue), after which every fresh openInboundDb() in the same process sees the same broken view and emits 'database disk image is malformed' at the poll rate (2/sec). Reopening the DB handle inside the container does not recover — only a fresh container mount does. The fix: after CORRUPTION_STREAK_EXIT consecutive corruption errors (~5s), log a clear message and process.exit(75) so host-sweep respawns the container with a fresh mount. Transient single torn reads are still tolerated. - Add isCorruptionError() helper covering the three SQLite read-side corruption symptoms (disk image malformed, SQLITE_CORRUPT, file is not a database). - Add streak counter scoped to processQuery's pollHandle so it resets on any successful or non-corruption error. - Add unit tests for the matcher. Refs the cross-mount invariants documented in db/connection.ts:11-18.
Collaborator
|
@kartast Thank you for the contribution! |
This was referenced May 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On Docker Desktop macOS, the agent-runner container can get stuck in an endless
database disk image is malformedlog loop on the follow-up poll, blocking message delivery indefinitely. Observed in production — container ran for 25+ minutes emitting:```
[poll-loop] Follow-up poll error: database disk image is malformed
[poll-loop] Follow-up poll error: database disk image is malformed
... (2/sec, indefinitely)
```
Host-side `PRAGMA integrity_check` on both `inbound.db` and `outbound.db` returns `ok` — the file isn't actually corrupt. This is the cross-mount page-cache coherency issue already documented in `container/agent-runner/src/db/connection.ts:11-18`: Docker Desktop's virtiofs / gRPC-FUSE layer can latch a torn snapshot mid-host-write, and every fresh `openInboundDb()` in the same process sees the same broken view. Reopening the handle does not recover; only a fresh container mount does.
The follow-up poll's catch block (poll-loop.ts:358-367) logs the error and keeps polling — which is correct for transient errors but catastrophic for this one. The user's only recourse is `docker restart `.
Fix
After `CORRUPTION_STREAK_EXIT` (10) consecutive corruption errors (~5 s at `ACTIVE_POLL_INTERVAL_MS = 500ms`), exit the process with code 75. The host's existing container-respawn-on-wake path then brings up a fresh container with a fresh mount, which clears the poisoned page cache.
Test plan
Notes