Skip to content

fix(sell): add obol sell resume so offers recover after a host reboot#619

Closed
bussyjd wants to merge 6 commits into
mainfrom
fix/sell-resume-after-reboot
Closed

fix(sell): add obol sell resume so offers recover after a host reboot#619
bussyjd wants to merge 6 commits into
mainfrom
fix/sell-resume-after-reboot

Conversation

@bussyjd

@bussyjd bussyjd commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Problem

After a host reboot, the public seller catalog stays empty even though the stack looks healthy: storefront serves 200, controller/verifier/catalog pods are all Running, but /api/services.json returns [].

Root cause (diagnosed live on the rc14 prod seller after a reboot):

  1. Docker's restart policy resurrects the k3d node containers at boot, so the cluster comes back without anyone running obol stack up.
  2. The sell-resume path (resumeSellOffers → re-apply ServiceOffer → startDetachedInferenceGateway) is only invoked from the stack up action (cmd/obol/main.go), so it never fires.
  3. The host-side x402 gateway started by obol sell inference died with the host. The ServiceOffer survives in etcd but sits at:
    UpstreamHealthy=False: Get "http://<name>.llm.svc.cluster.local:8402/health": connect: connection refused
    Ready=False
    
    → excluded from the catalog → [] → buyers see nothing.

Fix

  • obol sell resume — exposes the existing resume path as a first-class subcommand so a reboot no longer requires a full stack up to recover. Idempotent: gateways with a live PID on disk are skipped; the kubectl applies re-assert existing objects.
  • obol sell resume --install-boot-unit (Linux/systemd, opt-in) — writes + enables a systemd user unit (obol-sell-resume.service, WantedBy=default.target, pre-start sleep for the API server) so resume runs automatically at boot. Prints the loginctl enable-linger hint required for boot-without-login. Non-Linux returns actionable guidance instead.
  • Refreshes the stale resumeSellOffers doc comment (it still claimed gateways are not restarted; startDetachedInferenceGateway has done so since the resume feature shipped).

No behavior change to obol stack up — it keeps calling the same resume path.

Tests

  • TestSellResumeCommand_Registered — registration + --install-boot-unit guard (mirrors the existing TestStackUpAction_CallsResumeSellOffers source-guard pattern).
  • TestRenderResumeBootUnit — pins ExecStart/Environment/WantedBy/pre-start sleep of the emitted unit.
  • TestInstallResumeBootUnit_PlatformBehavior — non-Linux refuses with guidance; Linux writes the unit under $HOME even when systemctl is unavailable (enable steps are best-effort warnings).

go build ./... and go test ./cmd/obol -count=1 green. Branched off latest main (tree-identical to v0.10.0-rc14).

Follow-up (out of scope)

The long-term shape noted in the existing code comment stands: package the inference gateway as an in-cluster Deployment so the cluster supervises it like Traefik/LiteLLM and the host-side PID plumbing disappears.

Coverage: all ServiceOffer types (review follow-up)

Per review feedback, c7ca58f generalizes resume beyond inference+http: the sell-http/ store becomes the persisted-ServiceOffer ledger (dir name kept for already-shipped files), and every offer surface persists into it — sell agent (both creation sites), the agent-backed demo (offer only; replaying an Agent CR would mint a fresh wallet), and the legacy demo as a v1 List bundle (namespace + backend + offer) so resume restores a working demo. sell mcp has no ServiceOffer and is documented as not resumed. Lifecycle holes closed: agent delete drops the agent's ledger entries; sell update refreshes the ledger from the live CR so resume can't revert a payTo/price change. resumeSellOffers simplifies to one guard + two phases with a pure, tested loader.

bussyjd added 2 commits June 10, 2026 19:04
After a host reboot Docker's restart policy brings the k3d cluster back
without a `stack up`, so resumeSellOffers (previously only reachable
from the stack-up action) never runs: persisted sell-inference offers
survive in etcd but their host gateways are gone, every offer sits at
UpstreamHealthy=False, and the public catalog (/api/services.json)
serves []. Observed live on the rc14 prod seller after a reboot.

- expose the existing resume path as `obol sell resume` (idempotent:
  live-PID gateways are skipped, kubectl applies re-assert)
- `--install-boot-unit` writes + enables a systemd user unit on Linux
  so resume runs automatically at boot (lingering hint printed)
- refresh the stale resumeSellOffers doc comment (it claimed gateways
  are not restarted; startDetachedInferenceGateway has done so since
  the resume feature shipped)
Live reboot test on the seller box surfaced two ways the relaunched
gateway dies instantly, leaving the offer at UpstreamHealthy=False:

1. Binary skew: startDetachedInferenceGateway preferred the installed
   BinDir obol over the binary running `sell resume`. The arg-builder
   encodes the running version's flag surface, so an older installed
   CLI (rc11 predates the --description spelling) rejects the args
   with "flag provided but not defined". Relaunch now spawns the
   running executable (resumeGatewayBinary), and the arg-builder emits
   --register-description — the one spelling every released CLI
   parses — as belt-and-braces for the BinDir fallback.

2. Validation drift: the slash-in-model rule (added after existing
   descriptors were persisted) rejected the replayed model name, so
   pre-rule offers could never resume under a new binary either. The
   spawned gateway now carries OBOL_SELL_RESUME_REPLAY=1 and the rule
   downgrades to a warning for replays; new offers still hard-fail.

Tests: TestResumeGatewayBinaryPrefersRunningExecutable,
TestResumeGatewayEnviron, TestValidateSellInferenceModelName, and
TestBuildResumeGatewayArgs now pins --register-description and bans
the rc12+ spelling.
@bussyjd

bussyjd commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Live reboot validation on the seller box found two bugs in the resume path as originally pushed — both fixed in 29426c0:

  1. Binary skewstartDetachedInferenceGateway preferred the installed BinDir/obol (rc11) over the binary running sell resume. The arg-builder emits the current flag surface (--description, renamed in rc12), so the relaunched gateway died with flag provided but not defined: -description. Fix: spawn the running executable; also emit --register-description (parsed by every released CLI) for the BinDir fallback.

  2. Validation drift — with a current binary doing the relaunch, the slash-in-model rule (newer than the persisted descriptor) rejected --model AEON-7/Qwen3.6-…, so pre-rule offers could never resume at all. Fix: spawned gateway carries OBOL_SELL_RESUME_REPLAY=1; the rule downgrades to a warning for replays, still hard-fails new offers.

Repro sequence: reboot host → cluster auto-recovers via Docker restart policy, catalog goes [], route 404s → sell resume re-applied manifests but the gateway died instantly both ways above. Re-running the full unattended reboot test with the fixed binary + boot unit next.

Two more gaps found prepping the unattended reboot test:

- installResumeBootUnit pinned BinDir/obol into ExecStart. On a box
  whose installed CLI predates `sell resume` (rc11 on the live seller),
  the unit fails on every boot. Pin the binary running the install
  command instead (it just proved it has the subcommand) and print the
  pinned path so the operator knows to re-install after moving it.

- At boot the k3d API server lags Docker by a minute or more, and
  resumeOneInferenceOffer's kubectl applies run BEFORE the gateway
  relaunch and warn-and-continue — a too-early resume silently resumed
  nothing and nothing retried. `sell resume` now waits for /readyz
  (3min cap) before replaying offers; `stack up` is unaffected.

Tests: TestWaitForClusterAPI (nil fast-path without kubeconfig, error
after deadline for unreachable cluster).
@OisinKyne

Copy link
Copy Markdown
Contributor

will we need to make this more general than only for selling? some part of me wonders should this be like obol stack resume and/or we just have obol stack up or a version of it set to trigger for a reboot? like if you're away from your node your agent may not be able to use this to fix itself

Third live-reboot finding: the unit ran at boot, resume reported
"Gateway started in background", and the gateway was dead anyway with
an empty log. setsid detaches the session but NOT the cgroup — the
relaunched gateway lives in the unit's cgroup, and when a plain
Type=oneshot unit deactivates, systemd kills every process left in it.

RemainAfterExit=yes keeps the unit active (exited) after ExecStart,
which preserves the cgroup and the gateway with it. Side effect worth
having: `systemctl --user stop obol-sell-resume` is now a deliberate
way to take the gateways down.

TestRenderResumeBootUnit pins the new directive.
@bussyjd

bussyjd commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Live unattended reboot test: PASSED at cc7ef06.

Timeline (zero intervention after shutdown -r): SSH back in 38s → boot unit fires (sleep 30 → /readyz wait passes immediately) → resume re-applies manifests + relaunches gateway → offer Ready=True at t+68s. Verified from the public internet: catalog lists the offer, unpaid probe returns 402 with x402 requirements, ERC-8004 discovery 200. Gateway is running inside the unit's cgroup with the unit active (exited).

Three additional fixes were required beyond the original two commits, all found by this live test:

Commit Bug Fix
29426c0 Relaunch spawned the installed (older) obol → flag provided but not defined: -description; new binary rejected the persisted slashed model name Spawn the running executable; emit --register-description (parsed by every released CLI); OBOL_SELL_RESUME_REPLAY=1 downgrades the slash rule to a warning for replays
080a346 Boot unit pinned BinDir/obol (rc11 has no sell resume); boot race — kubectl applies run before the cluster API is up, warn-and-continue, nothing retries Pin the installing binary into ExecStart; sell resume waits up to 3min for /readyz
cc7ef06 systemd killed the "detached" gateway when the oneshot unit finished (setsid doesn't escape the cgroup) — unit reported success, gateway dead, log empty RemainAfterExit=yes keeps the cgroup (and gateway) alive; systemctl --user stop becomes a deliberate gateway kill

Each has a pinned regression test (TestResumeGatewayBinaryPrefersRunningExecutable, TestResumeGatewayEnviron, TestValidateSellInferenceModelName, TestWaitForClusterAPI, TestRenderResumeBootUnit). go build ./... + go test ./cmd/obol -count=1 green.

@bussyjd

bussyjd commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

will we need to make this more general than only for selling? … should this be like obol stack resume?

Generalized in c7ca58fsell resume now covers every ServiceOffer type, not just inference:

Surface Persisted Resume behavior
sell inference own store (inference/) re-apply cluster artifacts + relaunch detached host gateway
sell http manifest ledger re-apply
sell agent manifest ledger (both creation sites) re-apply; offer waits on its Agent CR with a clear controller condition if the agent is gone
sell demo (legacy) ledger, as a v1 List bundle (ns + backend Deployment + Service + offer) re-apply → working demo, not an offer with a dead upstream
sell demo quant (agent-backed) ledger, offer only deliberate: replaying an Agent CR mints a fresh wallet and orphans funds on the old one
sell mcp no ServiceOffer exists (foreground server); documented as not resumed

Two lifecycle holes the coverage matrix surfaced, both fixed: agent delete now drops the agent's ledger entries (else resume replays ghost offers forever), and sell update refreshes the ledger from the live post-patch CR (else the next resume kubectl-applies the old payTo/price back — silently reverting an intentional payment change).

On stack resume vs sell resume: obol stack up already runs this exact resume path, so the "recover everything" verb exists; what a reboot uniquely kills is host-side sell processes, which is what this command (plus the --install-boot-unit systemd unit, for the away-from-node case you raised) targets. If we later grow more host-side state worth resuming, promoting this to stack resume is a rename, not a redesign.

@bussyjd bussyjd requested a review from OisinKyne June 10, 2026 18:15
@bussyjd

bussyjd commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Adversarial review of c7ca58f (12 agents, findings reproduced against a live cluster) confirmed three majors in the generalization — all fixed in 9b1c146:

  1. agent delete swept ledger entries for offers still alive in etcd. Nothing actually deletes the agent's ServiceOffers (the finalizer leaves the namespace + offers) — and a survivor reconciles back to Ready, paying the deleted agent's wallet, if the name is ever reused. deleteCRDAgent now deletes the namespace's offers in-cluster, and the ledger sweep only runs when the cluster was reachable.
  2. sell stop didn't refresh the ledger — an etcd-wiping stack down/up replayed the pre-drain manifest and resurrected a deliberately stopped offer fully live. The drain patch now refreshes like sell update.
  3. Agent-offer replay failed after stack recreation (namespaces "agent-<name>" not found) — the documented "offer waits on the missing agent" never happened. Agent offers now persist as a v1 List bundling the agent namespace (canonical labels) with the offer; the Agent CR stays excluded so no fresh wallet is ever minted by a replay.

Also closed the inference half of delete⇒no-resume: sell delete tombstones the inference descriptor (DeletedAt, kept for list/status history, cleared by re-creating the offer) so resume can't relaunch a deleted offer's gateway. Full go test ./... green.

bussyjd added 2 commits June 11, 2026 07:45
…edger

Generalizes the resume path per review feedback: the sell-http store
becomes the persisted-ServiceOffer ledger (dir name kept for files
written by shipped CLIs) and every offer type without a host process
persists into it — sell http, sell agent (both creation sites), the
agent-backed demo (offer only: replaying an Agent CR would mint a fresh
wallet and orphan funds on the old one), and the legacy demo as a v1
List bundle (namespace + backend Deployment + Service + offer) so
resume restores a working demo rather than an offer with a missing
upstream. sell mcp has no ServiceOffer (foreground server) and is
documented as not resumed.

resumeSellOffers simplifies to one kubeconfig guard + two phases:
inference store (cluster artifacts + detached gateway relaunch), then a
single ledger walk with type-aware messaging via a pure, tested loader.

Two lifecycle holes closed while building the coverage matrix:
- obol agent delete now drops the agent namespace's ledger entries —
  otherwise every later resume replays ghost offers for a deleted agent.
- obol sell update now refreshes the ledger from the live post-patch CR
  (List-bundle aware) — otherwise the next resume kubectl-applies the
  OLD payTo/price back, silently reverting an intentional payment
  change. Update also adopts offers created outside the CLI.

New tests: mixed-type ledger walk, demo List-bundle parsing,
namespace-scoped removal round-trip, and source-scope guards on every
persist/refresh/cleanup site.
Three confirmed majors, all reproduced against a live cluster:

- agent delete left the agent's ServiceOffers ALIVE in etcd (the agent
  finalizer tears down children but leaves the namespace and offers)
  while sweeping their ledger entries — and a surviving offer
  reconciles back to Ready, paying the deleted agent's wallet, if the
  name is ever reused. deleteCRDAgent now deletes the namespace's
  ServiceOffer CRs (the offer finalizer handles route/registration
  teardown) and the ledger sweep moves into the cluster-reachable
  branch, since an unreachable cluster means the CRs survive and the
  ledger must keep covering them.

- sell stop never refreshed the ledger, so an etcd-wiping
  stack-down/up replayed the pre-drain manifest and resurrected a
  deliberately stopped offer fully live (reboot-resume was already
  safe: client-side apply never owned drainAt). The drain patch now
  refreshes the ledger like sell update does.

- agent-offer replay failed outright after a stack recreation: the
  bare manifest's namespace no longer exists and kubectl apply errors
  with 'namespaces not found' — the documented 'offer waits on the
  missing agent' behavior never happened. Both agent persist sites now
  store a v1 List bundling the agent NAMESPACE (canonical labels) with
  the offer; the Agent CR stays excluded so no fresh wallet is minted.

Plus the inference half of delete=>no-resume (minor): sell delete now
tombstones the inference descriptor (DeletedAt; kept for list/status
history, cleared by re-creating the offer) and resume filters
tombstoned descriptors via activeInferenceDeployments. Misleading
sell-http wording in delete output and the false 'CRs died with the
namespace' comments corrected; new tests pin every behavior (bundle
round-trip incl. no-Agent-CR assertion, stop/delete scope guards,
tombstone filter, branch placement of the agent-delete sweep).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants