Fix #41: keep openclaw binary in sync + verify gateway is actually alive by obaid · Pull Request #42 · provision-org/provision-core

obaid · 2026-05-24T07:02:43Z

Summary

Closes #41.

Sync the openclaw binary to config('provision.openclaw_version') with npm install -g openclaw@<pinned> before systemctl restart in AgentUpdateScriptService, so the version-downgrade safety check in OpenClaw can't permanently fail the gateway after an upgrade run.
Replace the weak openclaw health check (which returned success even when the gateway service was failed) with the same openclaw gateway call health --timeout 5000 + systemctl --user is-active retry loop that ServerSetupScriptService already uses on first install.
Report status=error (with an explicit error_message) instead of status=updated&warning=health_check_failed when the gateway never comes back, so the existing webhook handler logs at error level and does not promote a Deploying agent to Active on a dead server.

Verified manually on prod by reproducing the failure mode (#41 comment thread), then manually recovering with the same npm install -g openclaw@<pinned> + restart that this PR now bakes into every agent update.

Test plan

php artisan test --compact --filter=AgentUpdateScriptTest (17 passed)
php artisan test --compact tests/Feature/Api/ (53 passed)
vendor/bin/pint --dirty --format agent clean
Smoke-deploy to staging and trigger an agent update; confirm gateway stays healthy and callback path reports status=updated.
Force a mismatched binary on a sandbox server, trigger an update, confirm the script upgrades the binary first and the gateway comes back active.

The agent update script restarted openclaw-gateway without checking that the on-disk binary still matches the version Provision had written config for. If a previous run advanced the on-disk config (e.g. via `openclaw gateway install --force` from a newer CLI invocation) while the installed binary stayed older, restart exited with EX_CONFIG (78) and the gateway stayed dead. The script's health check used `openclaw health`, which still reported success, so the callback reported `status=updated` and the agent stayed labeled `active` even though no messages could be processed for any agent on the server. - Sync the openclaw binary to config('provision.openclaw_version') with `npm install -g openclaw@<pinned>` before `systemctl restart`. - Replace the weak `openclaw health` check with the same `openclaw gateway call health --timeout 5000` + `systemctl is-active` retry loop that ServerSetupScriptService already uses on first install. - Report `status=error` (with an explicit error_message) instead of `status=updated&warning=health_check_failed` when the gateway never comes back, so the existing webhook handler logs an error and does not promote a Deploying agent to Active on a dead server.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #41: keep openclaw binary in sync + verify gateway is actually alive#42

Fix #41: keep openclaw binary in sync + verify gateway is actually alive#42
obaid wants to merge 1 commit into
mainfrom
fix-41-openclaw-upgrade-path

obaid commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

obaid commented May 24, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant