Skip to content

fix: gateway double-spawn, sync errors, and crash retry#337

Merged
andreasjansson merged 16 commits into
mainfrom
ajansson/fix/gateway-double-spawn-and-sync-errors
Mar 28, 2026
Merged

fix: gateway double-spawn, sync errors, and crash retry#337
andreasjansson merged 16 commits into
mainfrom
ajansson/fix/gateway-double-spawn-and-sync-errors

Conversation

@andreasjansson

@andreasjansson andreasjansson commented Mar 25, 2026

Copy link
Copy Markdown
Member

Summary

Fixes #289, #291, #179

Gateway double-spawn prevention

  • Port probe safety net before spawning: if port 18789 is already open (gateway running but undetected by listProcesses()), skip the spawn
  • Reliable process kill via pgrep/pkill/ssProcess.kill() only kills the tracked shell PID, but the forked openclaw-gateway child survives. Shared killGateway() function used by both restart handler and crash retry

Backup/restore reliability

  • Only call restoreIfNeeded when the gateway needs to start (not on every request). The SDK's createBackup() resets the FUSE overlay, wiping upper-layer writes. WebSocket reconnects through the catch-all route were triggering unnecessary restores
  • Unmount stale FUSE overlays before restore to clear whiteout entries from deleted files
  • 15s timeout on restoreIfNeeded to prevent hanging when the container isn't ready
  • Deploy script retries on Docker Hub 429 rate limits (5 attempts, 60s backoff)

Crash retry (fixes #179)

  • HTTP proxy: catch "is not listening" errors from containerFetch, kill crashed gateway, restart, retry once
  • WebSocket proxy: same for wsConnect
  • Return structured errors (503 for crash+failed recovery, 502 for other proxy errors)

Testing

All 4 CI variants pass 23/23: base, discord, telegram, workers-ai

@andreasjansson andreasjansson force-pushed the ajansson/fix/gateway-double-spawn-and-sync-errors branch 3 times, most recently from 6bb6430 to 119e189 Compare March 27, 2026 11:26
Metamolty and others added 11 commits March 27, 2026 17:39
When the OpenClaw gateway process starts successfully and passes the port
health check, but then crashes while handling the first request, subsequent
containerFetch/wsConnect calls throw 'is not listening' errors with no
recovery path. The user sees HTTP 500s followed by connection failures.

This adds retry-on-crash logic to both HTTP and WebSocket proxy paths:
1. Detect 'is not listening' errors from the Sandbox SDK
2. Kill the dead gateway process
3. Restart the gateway via ensureMoltbotGateway()
4. Retry the request once

Also adds proper error handling around containerFetch (previously had no
try-catch at all), returning structured JSON errors instead of unhandled
exceptions.

Fixes #179
…Gateway

Complete the crash retry implementation:
- HTTP proxy: catch 'is not listening' errors from containerFetch,
  kill crashed gateway, restart, retry once
- WebSocket proxy: same for wsConnect
- Return structured errors (503 for crash+failed recovery, 502 for other)

Extract killGateway() into gateway/process.ts as a shared function
used by both the restart handler and the crash retry logic. Removes
duplicate kill code from index.ts and api.ts.

Tested on staging: kill gateway → next HTTP request returns 200 (retry worked).
The 4 e2e variants pull the sandbox base image in parallel, frequently
hitting Docker Hub rate limits. Retry up to 5 times with 60s backoff.
…291)

Fixes #289, closes #291

1. Gateway double-spawn (#289): findExistingMoltbotProcess() missed
   processes invoked as 'bash /usr/local/bin/start-openclaw.sh' (full
   path with shell prefix), causing a second spawn that fails with
   'port already in use'. Fix: broaden command matching and add a TCP
   port pre-check before spawning as a safety net.

2. Clarify sandbox.start() is NOT needed (#291): Added a comment to the
   sandbox middleware explaining why we don't call sandbox.start(). The
   SDK's containerFetch() auto-starts the container, and the catch-all
   route uses ensureMoltbotGateway() for explicit lifecycle management.
   Three separate PRs (#292, #294, #315) proposed adding sandbox.start()
   based on a misunderstanding of the API.
@andreasjansson andreasjansson force-pushed the ajansson/fix/gateway-double-spawn-and-sync-errors branch from 119e189 to fde4516 Compare March 27, 2026 23:03
When the gateway fails to start, we need to see what /api/status is
returning. Added:
- Background debug loop in _setup that polls /api/status and logs to stderr
- 15s timeout around restoreIfNeeded calls (was potentially hanging)
- Logging in /api/status handler
…quest

The catch-all route and /api/status were calling restoreIfNeeded on EVERY
request, including WebSocket reconnects from the browser. If a reconnect
happened after a sync stored a backup handle, restoreIfNeeded would mount
a FUSE overlay. The next createBackup would then reset the overlay, wiping
upper-layer files (like the marker).

Fix: check if the gateway is already running FIRST. Only call
restoreIfNeeded if the gateway needs to be started.
@andreasjansson andreasjansson changed the title fix: prevent gateway double-spawn and clarify sandbox.start() fix: gateway double-spawn, sync errors, and crash retry Mar 28, 2026
On cold start, sandbox.listProcesses() can hang if the container isn't
ready yet. This causes the catch-all route to block forever, returning
an empty response to the browser (blank page). Adding a 10s timeout
so the catch-all falls through to the loading page instead of hanging.
@github-actions

Copy link
Copy Markdown

E2E Test Recording (workers-ai)

✅ Tests passed

E2E Test Video

@github-actions

Copy link
Copy Markdown

E2E Test Recording (discord)

✅ Tests passed

E2E Test Video

@github-actions

Copy link
Copy Markdown

E2E Test Recording (telegram)

✅ Tests passed

E2E Test Video

@github-actions

Copy link
Copy Markdown

E2E Test Recording (base)

✅ Tests passed

E2E Test Video

@andreasjansson andreasjansson merged commit 28de44c into main Mar 28, 2026
16 of 18 checks passed
sandcastle pushed a commit to sandcastle/openclaw-worker that referenced this pull request Jun 12, 2026
…y-double-spawn-and-sync-errors

fix: gateway double-spawn, sync errors, and crash retry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway double-spawn when process undetected by listProcesses() Gateway returns HTTP 500 errors and crashes immediately after startup

1 participant