fix: gateway double-spawn, sync errors, and crash retry#337
Merged
andreasjansson merged 16 commits intoMar 28, 2026
Conversation
3 tasks
6bb6430 to
119e189
Compare
When the OpenClaw gateway process starts successfully and passes the port health check, but then crashes while handling the first request, subsequent containerFetch/wsConnect calls throw 'is not listening' errors with no recovery path. The user sees HTTP 500s followed by connection failures. This adds retry-on-crash logic to both HTTP and WebSocket proxy paths: 1. Detect 'is not listening' errors from the Sandbox SDK 2. Kill the dead gateway process 3. Restart the gateway via ensureMoltbotGateway() 4. Retry the request once Also adds proper error handling around containerFetch (previously had no try-catch at all), returning structured JSON errors instead of unhandled exceptions. Fixes #179
…Gateway Complete the crash retry implementation: - HTTP proxy: catch 'is not listening' errors from containerFetch, kill crashed gateway, restart, retry once - WebSocket proxy: same for wsConnect - Return structured errors (503 for crash+failed recovery, 502 for other) Extract killGateway() into gateway/process.ts as a shared function used by both the restart handler and the crash retry logic. Removes duplicate kill code from index.ts and api.ts. Tested on staging: kill gateway → next HTTP request returns 200 (retry worked).
The 4 e2e variants pull the sandbox base image in parallel, frequently hitting Docker Hub rate limits. Retry up to 5 times with 60s backoff.
…291) Fixes #289, closes #291 1. Gateway double-spawn (#289): findExistingMoltbotProcess() missed processes invoked as 'bash /usr/local/bin/start-openclaw.sh' (full path with shell prefix), causing a second spawn that fails with 'port already in use'. Fix: broaden command matching and add a TCP port pre-check before spawning as a safety net. 2. Clarify sandbox.start() is NOT needed (#291): Added a comment to the sandbox middleware explaining why we don't call sandbox.start(). The SDK's containerFetch() auto-starts the container, and the catch-all route uses ensureMoltbotGateway() for explicit lifecycle management. Three separate PRs (#292, #294, #315) proposed adding sandbox.start() based on a misunderstanding of the API.
119e189 to
fde4516
Compare
When the gateway fails to start, we need to see what /api/status is returning. Added: - Background debug loop in _setup that polls /api/status and logs to stderr - 15s timeout around restoreIfNeeded calls (was potentially hanging) - Logging in /api/status handler
…quest The catch-all route and /api/status were calling restoreIfNeeded on EVERY request, including WebSocket reconnects from the browser. If a reconnect happened after a sync stored a backup handle, restoreIfNeeded would mount a FUSE overlay. The next createBackup would then reset the overlay, wiping upper-layer files (like the marker). Fix: check if the gateway is already running FIRST. Only call restoreIfNeeded if the gateway needs to be started.
3 tasks
On cold start, sandbox.listProcesses() can hang if the container isn't ready yet. This causes the catch-all route to block forever, returning an empty response to the browser (blank page). Adding a 10s timeout so the catch-all falls through to the loading page instead of hanging.
This was referenced Mar 28, 2026
This was referenced Mar 28, 2026
sandcastle
pushed a commit
to sandcastle/openclaw-worker
that referenced
this pull request
Jun 12, 2026
…y-double-spawn-and-sync-errors fix: gateway double-spawn, sync errors, and crash retry
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Summary
Fixes #289, #291, #179
Gateway double-spawn prevention
listProcesses()), skip the spawnpgrep/pkill/ss—Process.kill()only kills the tracked shell PID, but the forkedopenclaw-gatewaychild survives. SharedkillGateway()function used by both restart handler and crash retryBackup/restore reliability
restoreIfNeededwhen the gateway needs to start (not on every request). The SDK'screateBackup()resets the FUSE overlay, wiping upper-layer writes. WebSocket reconnects through the catch-all route were triggering unnecessary restoresrestoreIfNeededto prevent hanging when the container isn't readyCrash retry (fixes #179)
containerFetch, kill crashed gateway, restart, retry oncewsConnectTesting
All 4 CI variants pass 23/23: base, discord, telegram, workers-ai