fix(sell): add `obol sell resume` so offers recover after a host reboot by bussyjd · Pull Request #619 · ObolNetwork/obol-stack

bussyjd · 2026-06-10T15:04:55Z

Problem

After a host reboot, the public seller catalog stays empty even though the stack looks healthy: storefront serves 200, controller/verifier/catalog pods are all Running, but /api/services.json returns [].

Root cause (diagnosed live on the rc14 prod seller after a reboot):

Docker's restart policy resurrects the k3d node containers at boot, so the cluster comes back without anyone running obol stack up.
The sell-resume path (resumeSellOffers → re-apply ServiceOffer → startDetachedInferenceGateway) is only invoked from the stack up action (cmd/obol/main.go), so it never fires.
The host-side x402 gateway started by obol sell inference died with the host. The ServiceOffer survives in etcd but sits at:
```
UpstreamHealthy=False: Get "http://<name>.llm.svc.cluster.local:8402/health": connect: connection refused
Ready=False
```
→ excluded from the catalog → [] → buyers see nothing.

Fix

obol sell resume — exposes the existing resume path as a first-class subcommand so a reboot no longer requires a full stack up to recover. Idempotent: gateways with a live PID on disk are skipped; the kubectl applies re-assert existing objects.
obol sell resume --install-boot-unit (Linux/systemd, opt-in) — writes + enables a systemd user unit (obol-sell-resume.service, WantedBy=default.target, pre-start sleep for the API server) so resume runs automatically at boot. Prints the loginctl enable-linger hint required for boot-without-login. Non-Linux returns actionable guidance instead.
Refreshes the stale resumeSellOffers doc comment (it still claimed gateways are not restarted; startDetachedInferenceGateway has done so since the resume feature shipped).

No behavior change to obol stack up — it keeps calling the same resume path.

Tests

TestSellResumeCommand_Registered — registration + --install-boot-unit guard (mirrors the existing TestStackUpAction_CallsResumeSellOffers source-guard pattern).
TestRenderResumeBootUnit — pins ExecStart/Environment/WantedBy/pre-start sleep of the emitted unit.
TestInstallResumeBootUnit_PlatformBehavior — non-Linux refuses with guidance; Linux writes the unit under $HOME even when systemctl is unavailable (enable steps are best-effort warnings).

go build ./... and go test ./cmd/obol -count=1 green. Branched off latest main (tree-identical to v0.10.0-rc14).

Follow-up (out of scope)

The long-term shape noted in the existing code comment stands: package the inference gateway as an in-cluster Deployment so the cluster supervises it like Traefik/LiteLLM and the host-side PID plumbing disappears.

Coverage: all ServiceOffer types (review follow-up)

Per review feedback, c7ca58f generalizes resume beyond inference+http: the sell-http/ store becomes the persisted-ServiceOffer ledger (dir name kept for already-shipped files), and every offer surface persists into it — sell agent (both creation sites), the agent-backed demo (offer only; replaying an Agent CR would mint a fresh wallet), and the legacy demo as a v1 List bundle (namespace + backend + offer) so resume restores a working demo. sell mcp has no ServiceOffer and is documented as not resumed. Lifecycle holes closed: agent delete drops the agent's ledger entries; sell update refreshes the ledger from the live CR so resume can't revert a payTo/price change. resumeSellOffers simplifies to one guard + two phases with a pure, tested loader.

After a host reboot Docker's restart policy brings the k3d cluster back without a `stack up`, so resumeSellOffers (previously only reachable from the stack-up action) never runs: persisted sell-inference offers survive in etcd but their host gateways are gone, every offer sits at UpstreamHealthy=False, and the public catalog (/api/services.json) serves []. Observed live on the rc14 prod seller after a reboot. - expose the existing resume path as `obol sell resume` (idempotent: live-PID gateways are skipped, kubectl applies re-assert) - `--install-boot-unit` writes + enables a systemd user unit on Linux so resume runs automatically at boot (lingering hint printed) - refresh the stale resumeSellOffers doc comment (it claimed gateways are not restarted; startDetachedInferenceGateway has done so since the resume feature shipped)

Live reboot test on the seller box surfaced two ways the relaunched gateway dies instantly, leaving the offer at UpstreamHealthy=False: 1. Binary skew: startDetachedInferenceGateway preferred the installed BinDir obol over the binary running `sell resume`. The arg-builder encodes the running version's flag surface, so an older installed CLI (rc11 predates the --description spelling) rejects the args with "flag provided but not defined". Relaunch now spawns the running executable (resumeGatewayBinary), and the arg-builder emits --register-description — the one spelling every released CLI parses — as belt-and-braces for the BinDir fallback. 2. Validation drift: the slash-in-model rule (added after existing descriptors were persisted) rejected the replayed model name, so pre-rule offers could never resume under a new binary either. The spawned gateway now carries OBOL_SELL_RESUME_REPLAY=1 and the rule downgrades to a warning for replays; new offers still hard-fail. Tests: TestResumeGatewayBinaryPrefersRunningExecutable, TestResumeGatewayEnviron, TestValidateSellInferenceModelName, and TestBuildResumeGatewayArgs now pins --register-description and bans the rc12+ spelling.

bussyjd · 2026-06-10T17:17:12Z

Live reboot validation on the seller box found two bugs in the resume path as originally pushed — both fixed in 29426c0:

Binary skew — startDetachedInferenceGateway preferred the installed BinDir/obol (rc11) over the binary running sell resume. The arg-builder emits the current flag surface (--description, renamed in rc12), so the relaunched gateway died with flag provided but not defined: -description. Fix: spawn the running executable; also emit --register-description (parsed by every released CLI) for the BinDir fallback.
Validation drift — with a current binary doing the relaunch, the slash-in-model rule (newer than the persisted descriptor) rejected --model AEON-7/Qwen3.6-…, so pre-rule offers could never resume at all. Fix: spawned gateway carries OBOL_SELL_RESUME_REPLAY=1; the rule downgrades to a warning for replays, still hard-fails new offers.

Repro sequence: reboot host → cluster auto-recovers via Docker restart policy, catalog goes [], route 404s → sell resume re-applied manifests but the gateway died instantly both ways above. Re-running the full unattended reboot test with the fixed binary + boot unit next.

Two more gaps found prepping the unattended reboot test: - installResumeBootUnit pinned BinDir/obol into ExecStart. On a box whose installed CLI predates `sell resume` (rc11 on the live seller), the unit fails on every boot. Pin the binary running the install command instead (it just proved it has the subcommand) and print the pinned path so the operator knows to re-install after moving it. - At boot the k3d API server lags Docker by a minute or more, and resumeOneInferenceOffer's kubectl applies run BEFORE the gateway relaunch and warn-and-continue — a too-early resume silently resumed nothing and nothing retried. `sell resume` now waits for /readyz (3min cap) before replaying offers; `stack up` is unaffected. Tests: TestWaitForClusterAPI (nil fast-path without kubeconfig, error after deadline for unreachable cluster).

OisinKyne · 2026-06-10T17:21:43Z

will we need to make this more general than only for selling? some part of me wonders should this be like obol stack resume and/or we just have obol stack up or a version of it set to trigger for a reboot? like if you're away from your node your agent may not be able to use this to fix itself

Third live-reboot finding: the unit ran at boot, resume reported "Gateway started in background", and the gateway was dead anyway with an empty log. setsid detaches the session but NOT the cgroup — the relaunched gateway lives in the unit's cgroup, and when a plain Type=oneshot unit deactivates, systemd kills every process left in it. RemainAfterExit=yes keeps the unit active (exited) after ExecStart, which preserves the cgroup and the gateway with it. Side effect worth having: `systemctl --user stop obol-sell-resume` is now a deliberate way to take the gateways down. TestRenderResumeBootUnit pins the new directive.

bussyjd · 2026-06-10T17:52:11Z

Live unattended reboot test: PASSED at cc7ef06.

Timeline (zero intervention after shutdown -r): SSH back in 38s → boot unit fires (sleep 30 → /readyz wait passes immediately) → resume re-applies manifests + relaunches gateway → offer Ready=True at t+68s. Verified from the public internet: catalog lists the offer, unpaid probe returns 402 with x402 requirements, ERC-8004 discovery 200. Gateway is running inside the unit's cgroup with the unit active (exited).

Three additional fixes were required beyond the original two commits, all found by this live test:

Commit	Bug	Fix
`29426c0`	Relaunch spawned the installed (older) obol → `flag provided but not defined: -description`; new binary rejected the persisted slashed model name	Spawn the running executable; emit `--register-description` (parsed by every released CLI); `OBOL_SELL_RESUME_REPLAY=1` downgrades the slash rule to a warning for replays
`080a346`	Boot unit pinned `BinDir/obol` (rc11 has no `sell resume`); boot race — kubectl applies run before the cluster API is up, warn-and-continue, nothing retries	Pin the installing binary into ExecStart; `sell resume` waits up to 3min for `/readyz`
`cc7ef06`	systemd killed the "detached" gateway when the oneshot unit finished (setsid doesn't escape the cgroup) — unit reported success, gateway dead, log empty	`RemainAfterExit=yes` keeps the cgroup (and gateway) alive; `systemctl --user stop` becomes a deliberate gateway kill

Each has a pinned regression test (TestResumeGatewayBinaryPrefersRunningExecutable, TestResumeGatewayEnviron, TestValidateSellInferenceModelName, TestWaitForClusterAPI, TestRenderResumeBootUnit). go build ./... + go test ./cmd/obol -count=1 green.

bussyjd · 2026-06-10T18:15:49Z

will we need to make this more general than only for selling? … should this be like obol stack resume?

Generalized in c7ca58f — sell resume now covers every ServiceOffer type, not just inference:

Surface	Persisted	Resume behavior
`sell inference`	own store (`inference/`)	re-apply cluster artifacts + relaunch detached host gateway
`sell http`	manifest ledger	re-apply
`sell agent`	manifest ledger (both creation sites)	re-apply; offer waits on its Agent CR with a clear controller condition if the agent is gone
`sell demo` (legacy)	ledger, as a `v1 List` bundle (ns + backend Deployment + Service + offer)	re-apply → working demo, not an offer with a dead upstream
`sell demo quant` (agent-backed)	ledger, offer only	deliberate: replaying an Agent CR mints a fresh wallet and orphans funds on the old one
`sell mcp`	—	no ServiceOffer exists (foreground server); documented as not resumed

Two lifecycle holes the coverage matrix surfaced, both fixed: agent delete now drops the agent's ledger entries (else resume replays ghost offers forever), and sell update refreshes the ledger from the live post-patch CR (else the next resume kubectl-applies the old payTo/price back — silently reverting an intentional payment change).

On stack resume vs sell resume: obol stack up already runs this exact resume path, so the "recover everything" verb exists; what a reboot uniquely kills is host-side sell processes, which is what this command (plus the --install-boot-unit systemd unit, for the away-from-node case you raised) targets. If we later grow more host-side state worth resuming, promoting this to stack resume is a rename, not a redesign.

bussyjd · 2026-06-10T18:40:51Z

Adversarial review of c7ca58f (12 agents, findings reproduced against a live cluster) confirmed three majors in the generalization — all fixed in 9b1c146:

agent delete swept ledger entries for offers still alive in etcd. Nothing actually deletes the agent's ServiceOffers (the finalizer leaves the namespace + offers) — and a survivor reconciles back to Ready, paying the deleted agent's wallet, if the name is ever reused. deleteCRDAgent now deletes the namespace's offers in-cluster, and the ledger sweep only runs when the cluster was reachable.
sell stop didn't refresh the ledger — an etcd-wiping stack down/up replayed the pre-drain manifest and resurrected a deliberately stopped offer fully live. The drain patch now refreshes like sell update.
Agent-offer replay failed after stack recreation (namespaces "agent-<name>" not found) — the documented "offer waits on the missing agent" never happened. Agent offers now persist as a v1 List bundling the agent namespace (canonical labels) with the offer; the Agent CR stays excluded so no fresh wallet is ever minted by a replay.

Also closed the inference half of delete⇒no-resume: sell delete tombstones the inference descriptor (DeletedAt, kept for list/status history, cleared by re-creating the offer) so resume can't relaunch a deleted offer's gateway. Full go test ./... green.

…edger Generalizes the resume path per review feedback: the sell-http store becomes the persisted-ServiceOffer ledger (dir name kept for files written by shipped CLIs) and every offer type without a host process persists into it — sell http, sell agent (both creation sites), the agent-backed demo (offer only: replaying an Agent CR would mint a fresh wallet and orphan funds on the old one), and the legacy demo as a v1 List bundle (namespace + backend Deployment + Service + offer) so resume restores a working demo rather than an offer with a missing upstream. sell mcp has no ServiceOffer (foreground server) and is documented as not resumed. resumeSellOffers simplifies to one kubeconfig guard + two phases: inference store (cluster artifacts + detached gateway relaunch), then a single ledger walk with type-aware messaging via a pure, tested loader. Two lifecycle holes closed while building the coverage matrix: - obol agent delete now drops the agent namespace's ledger entries — otherwise every later resume replays ghost offers for a deleted agent. - obol sell update now refreshes the ledger from the live post-patch CR (List-bundle aware) — otherwise the next resume kubectl-applies the OLD payTo/price back, silently reverting an intentional payment change. Update also adopts offers created outside the CLI. New tests: mixed-type ledger walk, demo List-bundle parsing, namespace-scoped removal round-trip, and source-scope guards on every persist/refresh/cleanup site.

Three confirmed majors, all reproduced against a live cluster: - agent delete left the agent's ServiceOffers ALIVE in etcd (the agent finalizer tears down children but leaves the namespace and offers) while sweeping their ledger entries — and a surviving offer reconciles back to Ready, paying the deleted agent's wallet, if the name is ever reused. deleteCRDAgent now deletes the namespace's ServiceOffer CRs (the offer finalizer handles route/registration teardown) and the ledger sweep moves into the cluster-reachable branch, since an unreachable cluster means the CRs survive and the ledger must keep covering them. - sell stop never refreshed the ledger, so an etcd-wiping stack-down/up replayed the pre-drain manifest and resurrected a deliberately stopped offer fully live (reboot-resume was already safe: client-side apply never owned drainAt). The drain patch now refreshes the ledger like sell update does. - agent-offer replay failed outright after a stack recreation: the bare manifest's namespace no longer exists and kubectl apply errors with 'namespaces not found' — the documented 'offer waits on the missing agent' behavior never happened. Both agent persist sites now store a v1 List bundling the agent NAMESPACE (canonical labels) with the offer; the Agent CR stays excluded so no fresh wallet is minted. Plus the inference half of delete=>no-resume (minor): sell delete now tombstones the inference descriptor (DeletedAt; kept for list/status history, cleared by re-creating the offer) and resume filters tombstoned descriptors via activeInferenceDeployments. Misleading sell-http wording in delete output and the false 'CRs died with the namespace' comments corrected; new tests pin every behavior (bundle round-trip incl. no-Agent-CR assertion, stop/delete scope guards, tombstone filter, branch placement of the agent-delete sweep).

bussyjd added 2 commits June 10, 2026 19:04

bussyjd requested a review from OisinKyne June 10, 2026 18:15

bussyjd added 2 commits June 11, 2026 07:45

bussyjd force-pushed the fix/sell-resume-after-reboot branch from 9b1c146 to ffab6cf Compare June 11, 2026 03:46

bussyjd mentioned this pull request Jun 11, 2026

release: v0.10.0-rc15 train — auto-repin CI + sell resume + flow gates + paid-MCP smoke #623

Closed

OisinKyne mentioned this pull request Jun 11, 2026

feat: import/export persistence clean up #624

Merged

OisinKyne closed this Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sell): add `obol sell resume` so offers recover after a host reboot#619

fix(sell): add `obol sell resume` so offers recover after a host reboot#619
bussyjd wants to merge 6 commits into
mainfrom
fix/sell-resume-after-reboot

bussyjd commented Jun 10, 2026 •

edited

Loading

Uh oh!

bussyjd commented Jun 10, 2026

Uh oh!

OisinKyne commented Jun 10, 2026

Uh oh!

bussyjd commented Jun 10, 2026

Uh oh!

bussyjd commented Jun 10, 2026

Uh oh!

bussyjd commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bussyjd commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Follow-up (out of scope)

Coverage: all ServiceOffer types (review follow-up)

Uh oh!

bussyjd commented Jun 10, 2026

Uh oh!

OisinKyne commented Jun 10, 2026

Uh oh!

bussyjd commented Jun 10, 2026

Uh oh!

bussyjd commented Jun 10, 2026

Uh oh!

bussyjd commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bussyjd commented Jun 10, 2026 •

edited

Loading