fix(sell): add obol sell resume so offers recover after a host reboot#619
fix(sell): add obol sell resume so offers recover after a host reboot#619bussyjd wants to merge 6 commits into
obol sell resume so offers recover after a host reboot#619Conversation
After a host reboot Docker's restart policy brings the k3d cluster back without a `stack up`, so resumeSellOffers (previously only reachable from the stack-up action) never runs: persisted sell-inference offers survive in etcd but their host gateways are gone, every offer sits at UpstreamHealthy=False, and the public catalog (/api/services.json) serves []. Observed live on the rc14 prod seller after a reboot. - expose the existing resume path as `obol sell resume` (idempotent: live-PID gateways are skipped, kubectl applies re-assert) - `--install-boot-unit` writes + enables a systemd user unit on Linux so resume runs automatically at boot (lingering hint printed) - refresh the stale resumeSellOffers doc comment (it claimed gateways are not restarted; startDetachedInferenceGateway has done so since the resume feature shipped)
Live reboot test on the seller box surfaced two ways the relaunched gateway dies instantly, leaving the offer at UpstreamHealthy=False: 1. Binary skew: startDetachedInferenceGateway preferred the installed BinDir obol over the binary running `sell resume`. The arg-builder encodes the running version's flag surface, so an older installed CLI (rc11 predates the --description spelling) rejects the args with "flag provided but not defined". Relaunch now spawns the running executable (resumeGatewayBinary), and the arg-builder emits --register-description — the one spelling every released CLI parses — as belt-and-braces for the BinDir fallback. 2. Validation drift: the slash-in-model rule (added after existing descriptors were persisted) rejected the replayed model name, so pre-rule offers could never resume under a new binary either. The spawned gateway now carries OBOL_SELL_RESUME_REPLAY=1 and the rule downgrades to a warning for replays; new offers still hard-fail. Tests: TestResumeGatewayBinaryPrefersRunningExecutable, TestResumeGatewayEnviron, TestValidateSellInferenceModelName, and TestBuildResumeGatewayArgs now pins --register-description and bans the rc12+ spelling.
|
Live reboot validation on the seller box found two bugs in the resume path as originally pushed — both fixed in 29426c0:
Repro sequence: reboot host → cluster auto-recovers via Docker restart policy, catalog goes |
Two more gaps found prepping the unattended reboot test: - installResumeBootUnit pinned BinDir/obol into ExecStart. On a box whose installed CLI predates `sell resume` (rc11 on the live seller), the unit fails on every boot. Pin the binary running the install command instead (it just proved it has the subcommand) and print the pinned path so the operator knows to re-install after moving it. - At boot the k3d API server lags Docker by a minute or more, and resumeOneInferenceOffer's kubectl applies run BEFORE the gateway relaunch and warn-and-continue — a too-early resume silently resumed nothing and nothing retried. `sell resume` now waits for /readyz (3min cap) before replaying offers; `stack up` is unaffected. Tests: TestWaitForClusterAPI (nil fast-path without kubeconfig, error after deadline for unreachable cluster).
|
will we need to make this more general than only for selling? some part of me wonders should this be like |
Third live-reboot finding: the unit ran at boot, resume reported "Gateway started in background", and the gateway was dead anyway with an empty log. setsid detaches the session but NOT the cgroup — the relaunched gateway lives in the unit's cgroup, and when a plain Type=oneshot unit deactivates, systemd kills every process left in it. RemainAfterExit=yes keeps the unit active (exited) after ExecStart, which preserves the cgroup and the gateway with it. Side effect worth having: `systemctl --user stop obol-sell-resume` is now a deliberate way to take the gateways down. TestRenderResumeBootUnit pins the new directive.
|
Live unattended reboot test: PASSED at cc7ef06. Timeline (zero intervention after Three additional fixes were required beyond the original two commits, all found by this live test:
Each has a pinned regression test ( |
Generalized in
Two lifecycle holes the coverage matrix surfaced, both fixed: On |
|
Adversarial review of
Also closed the inference half of delete⇒no-resume: |
…edger Generalizes the resume path per review feedback: the sell-http store becomes the persisted-ServiceOffer ledger (dir name kept for files written by shipped CLIs) and every offer type without a host process persists into it — sell http, sell agent (both creation sites), the agent-backed demo (offer only: replaying an Agent CR would mint a fresh wallet and orphan funds on the old one), and the legacy demo as a v1 List bundle (namespace + backend Deployment + Service + offer) so resume restores a working demo rather than an offer with a missing upstream. sell mcp has no ServiceOffer (foreground server) and is documented as not resumed. resumeSellOffers simplifies to one kubeconfig guard + two phases: inference store (cluster artifacts + detached gateway relaunch), then a single ledger walk with type-aware messaging via a pure, tested loader. Two lifecycle holes closed while building the coverage matrix: - obol agent delete now drops the agent namespace's ledger entries — otherwise every later resume replays ghost offers for a deleted agent. - obol sell update now refreshes the ledger from the live post-patch CR (List-bundle aware) — otherwise the next resume kubectl-applies the OLD payTo/price back, silently reverting an intentional payment change. Update also adopts offers created outside the CLI. New tests: mixed-type ledger walk, demo List-bundle parsing, namespace-scoped removal round-trip, and source-scope guards on every persist/refresh/cleanup site.
Three confirmed majors, all reproduced against a live cluster: - agent delete left the agent's ServiceOffers ALIVE in etcd (the agent finalizer tears down children but leaves the namespace and offers) while sweeping their ledger entries — and a surviving offer reconciles back to Ready, paying the deleted agent's wallet, if the name is ever reused. deleteCRDAgent now deletes the namespace's ServiceOffer CRs (the offer finalizer handles route/registration teardown) and the ledger sweep moves into the cluster-reachable branch, since an unreachable cluster means the CRs survive and the ledger must keep covering them. - sell stop never refreshed the ledger, so an etcd-wiping stack-down/up replayed the pre-drain manifest and resurrected a deliberately stopped offer fully live (reboot-resume was already safe: client-side apply never owned drainAt). The drain patch now refreshes the ledger like sell update does. - agent-offer replay failed outright after a stack recreation: the bare manifest's namespace no longer exists and kubectl apply errors with 'namespaces not found' — the documented 'offer waits on the missing agent' behavior never happened. Both agent persist sites now store a v1 List bundling the agent NAMESPACE (canonical labels) with the offer; the Agent CR stays excluded so no fresh wallet is minted. Plus the inference half of delete=>no-resume (minor): sell delete now tombstones the inference descriptor (DeletedAt; kept for list/status history, cleared by re-creating the offer) and resume filters tombstoned descriptors via activeInferenceDeployments. Misleading sell-http wording in delete output and the false 'CRs died with the namespace' comments corrected; new tests pin every behavior (bundle round-trip incl. no-Agent-CR assertion, stop/delete scope guards, tombstone filter, branch placement of the agent-delete sweep).
9b1c146 to
ffab6cf
Compare
Problem
After a host reboot, the public seller catalog stays empty even though the stack looks healthy: storefront serves 200, controller/verifier/catalog pods are all Running, but
/api/services.jsonreturns[].Root cause (diagnosed live on the rc14 prod seller after a reboot):
obol stack up.resumeSellOffers→ re-apply ServiceOffer →startDetachedInferenceGateway) is only invoked from thestack upaction (cmd/obol/main.go), so it never fires.obol sell inferencedied with the host. The ServiceOffer survives in etcd but sits at:[]→ buyers see nothing.Fix
obol sell resume— exposes the existing resume path as a first-class subcommand so a reboot no longer requires a fullstack upto recover. Idempotent: gateways with a live PID on disk are skipped; the kubectl applies re-assert existing objects.obol sell resume --install-boot-unit(Linux/systemd, opt-in) — writes + enables a systemd user unit (obol-sell-resume.service,WantedBy=default.target, pre-start sleep for the API server) so resume runs automatically at boot. Prints theloginctl enable-lingerhint required for boot-without-login. Non-Linux returns actionable guidance instead.resumeSellOffersdoc comment (it still claimed gateways are not restarted;startDetachedInferenceGatewayhas done so since the resume feature shipped).No behavior change to
obol stack up— it keeps calling the same resume path.Tests
TestSellResumeCommand_Registered— registration +--install-boot-unitguard (mirrors the existingTestStackUpAction_CallsResumeSellOfferssource-guard pattern).TestRenderResumeBootUnit— pins ExecStart/Environment/WantedBy/pre-start sleep of the emitted unit.TestInstallResumeBootUnit_PlatformBehavior— non-Linux refuses with guidance; Linux writes the unit under$HOMEeven whensystemctlis unavailable (enable steps are best-effort warnings).go build ./...andgo test ./cmd/obol -count=1green. Branched off latestmain(tree-identical tov0.10.0-rc14).Follow-up (out of scope)
The long-term shape noted in the existing code comment stands: package the inference gateway as an in-cluster Deployment so the cluster supervises it like Traefik/LiteLLM and the host-side PID plumbing disappears.
Coverage: all ServiceOffer types (review follow-up)
Per review feedback,
c7ca58fgeneralizes resume beyond inference+http: thesell-http/store becomes the persisted-ServiceOffer ledger (dir name kept for already-shipped files), and every offer surface persists into it —sell agent(both creation sites), the agent-backed demo (offer only; replaying an Agent CR would mint a fresh wallet), and the legacy demo as av1 Listbundle (namespace + backend + offer) so resume restores a working demo.sell mcphas no ServiceOffer and is documented as not resumed. Lifecycle holes closed:agent deletedrops the agent's ledger entries;sell updaterefreshes the ledger from the live CR so resume can't revert a payTo/price change.resumeSellOfferssimplifies to one guard + two phases with a pure, tested loader.