Skip to content

Fix #41: keep openclaw binary in sync + verify gateway is actually alive#42

Open
obaid wants to merge 1 commit into
mainfrom
fix-41-openclaw-upgrade-path
Open

Fix #41: keep openclaw binary in sync + verify gateway is actually alive#42
obaid wants to merge 1 commit into
mainfrom
fix-41-openclaw-upgrade-path

Conversation

@obaid
Copy link
Copy Markdown
Contributor

@obaid obaid commented May 24, 2026

Summary

Closes #41.

  • Sync the openclaw binary to config('provision.openclaw_version') with npm install -g openclaw@<pinned> before systemctl restart in AgentUpdateScriptService, so the version-downgrade safety check in OpenClaw can't permanently fail the gateway after an upgrade run.
  • Replace the weak openclaw health check (which returned success even when the gateway service was failed) with the same openclaw gateway call health --timeout 5000 + systemctl --user is-active retry loop that ServerSetupScriptService already uses on first install.
  • Report status=error (with an explicit error_message) instead of status=updated&warning=health_check_failed when the gateway never comes back, so the existing webhook handler logs at error level and does not promote a Deploying agent to Active on a dead server.

Verified manually on prod by reproducing the failure mode (#41 comment thread), then manually recovering with the same npm install -g openclaw@<pinned> + restart that this PR now bakes into every agent update.

Test plan

  • php artisan test --compact --filter=AgentUpdateScriptTest (17 passed)
  • php artisan test --compact tests/Feature/Api/ (53 passed)
  • vendor/bin/pint --dirty --format agent clean
  • Smoke-deploy to staging and trigger an agent update; confirm gateway stays healthy and callback path reports status=updated.
  • Force a mismatched binary on a sandbox server, trigger an update, confirm the script upgrades the binary first and the gateway comes back active.

The agent update script restarted openclaw-gateway without checking that
the on-disk binary still matches the version Provision had written config
for. If a previous run advanced the on-disk config (e.g. via
`openclaw gateway install --force` from a newer CLI invocation) while the
installed binary stayed older, restart exited with EX_CONFIG (78) and the
gateway stayed dead. The script's health check used `openclaw health`,
which still reported success, so the callback reported `status=updated`
and the agent stayed labeled `active` even though no messages could be
processed for any agent on the server.

- Sync the openclaw binary to config('provision.openclaw_version') with
  `npm install -g openclaw@<pinned>` before `systemctl restart`.
- Replace the weak `openclaw health` check with the same
  `openclaw gateway call health --timeout 5000` + `systemctl is-active`
  retry loop that ServerSetupScriptService already uses on first install.
- Report `status=error` (with an explicit error_message) instead of
  `status=updated&warning=health_check_failed` when the gateway never
  comes back, so the existing webhook handler logs an error and does not
  promote a Deploying agent to Active on a dead server.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

In-app chat: agent shows typing indicator indefinitely, never replies

1 participant