feat: import/export persistence clean up#624
Conversation
After a host reboot Docker's restart policy brings the k3d cluster back without a `stack up`, so resumeSellOffers (previously only reachable from the stack-up action) never runs: persisted sell-inference offers survive in etcd but their host gateways are gone, every offer sits at UpstreamHealthy=False, and the public catalog (/api/services.json) serves []. Observed live on the rc14 prod seller after a reboot. - expose the existing resume path as `obol sell resume` (idempotent: live-PID gateways are skipped, kubectl applies re-assert) - `--install-boot-unit` writes + enables a systemd user unit on Linux so resume runs automatically at boot (lingering hint printed) - refresh the stale resumeSellOffers doc comment (it claimed gateways are not restarted; startDetachedInferenceGateway has done so since the resume feature shipped)
Live reboot test on the seller box surfaced two ways the relaunched gateway dies instantly, leaving the offer at UpstreamHealthy=False: 1. Binary skew: startDetachedInferenceGateway preferred the installed BinDir obol over the binary running `sell resume`. The arg-builder encodes the running version's flag surface, so an older installed CLI (rc11 predates the --description spelling) rejects the args with "flag provided but not defined". Relaunch now spawns the running executable (resumeGatewayBinary), and the arg-builder emits --register-description — the one spelling every released CLI parses — as belt-and-braces for the BinDir fallback. 2. Validation drift: the slash-in-model rule (added after existing descriptors were persisted) rejected the replayed model name, so pre-rule offers could never resume under a new binary either. The spawned gateway now carries OBOL_SELL_RESUME_REPLAY=1 and the rule downgrades to a warning for replays; new offers still hard-fail. Tests: TestResumeGatewayBinaryPrefersRunningExecutable, TestResumeGatewayEnviron, TestValidateSellInferenceModelName, and TestBuildResumeGatewayArgs now pins --register-description and bans the rc12+ spelling.
Two more gaps found prepping the unattended reboot test: - installResumeBootUnit pinned BinDir/obol into ExecStart. On a box whose installed CLI predates `sell resume` (rc11 on the live seller), the unit fails on every boot. Pin the binary running the install command instead (it just proved it has the subcommand) and print the pinned path so the operator knows to re-install after moving it. - At boot the k3d API server lags Docker by a minute or more, and resumeOneInferenceOffer's kubectl applies run BEFORE the gateway relaunch and warn-and-continue — a too-early resume silently resumed nothing and nothing retried. `sell resume` now waits for /readyz (3min cap) before replaying offers; `stack up` is unaffected. Tests: TestWaitForClusterAPI (nil fast-path without kubeconfig, error after deadline for unreachable cluster).
Third live-reboot finding: the unit ran at boot, resume reported "Gateway started in background", and the gateway was dead anyway with an empty log. setsid detaches the session but NOT the cgroup — the relaunched gateway lives in the unit's cgroup, and when a plain Type=oneshot unit deactivates, systemd kills every process left in it. RemainAfterExit=yes keeps the unit active (exited) after ExecStart, which preserves the cgroup and the gateway with it. Side effect worth having: `systemctl --user stop obol-sell-resume` is now a deliberate way to take the gateways down. TestRenderResumeBootUnit pins the new directive.
…n freshness Every docker-publish-x402 branch build now lands a pin-bump commit (repin-embedded-pins job) updating the embedded x402-verifier / serviceoffer-controller / x402-buyer references to the images just built, and the release workflow gains a verify-image-pins gate that fails the tag when any source in the binaries' live import graph (go list -deps) changed after the pinned build commit. Together they make the rc14 stale-pin trap — a release whose embedded pins predate its own payment-path changes — structurally impossible: the bump is automatic, and a tag cut before the bump lands cannot release. The bump is committed through the GraphQL createCommitOnBranch API, so the commit is signed by GitHub itself (verified, github-actions bot) — compatible with the repo ruleset rejecting unsigned commits, which a workflow git push could never satisfy. expectedHeadOid is the live remote head with one retry on race; only the two guarded template files are ever sent. The gate is fail-closed (a go-list failure refuses to pass rather than degrade to a partial path set), binds each embedded digest to what GHCR serves for the pinned tag (a fresh tag with a hand-edited digest fails), ignores _test.go/testdata churn, and hunk-filters the two pin-carrying templates so pin bumps don't self-stale while any other edit to them still counts. release/** branches get the same build+repin treatment as main. The exact-ref pin tests become invariant tests: pins must be digest- pinned, share one build commit, and descend from the named fix commits (ancestry-verified via git, skipped on shallow clones) — so the bot can bump pins without touching Go files while the carries-fix-X guarantees get stronger.
…edger Generalizes the resume path per review feedback: the sell-http store becomes the persisted-ServiceOffer ledger (dir name kept for files written by shipped CLIs) and every offer type without a host process persists into it — sell http, sell agent (both creation sites), the agent-backed demo (offer only: replaying an Agent CR would mint a fresh wallet and orphan funds on the old one), and the legacy demo as a v1 List bundle (namespace + backend Deployment + Service + offer) so resume restores a working demo rather than an offer with a missing upstream. sell mcp has no ServiceOffer (foreground server) and is documented as not resumed. resumeSellOffers simplifies to one kubeconfig guard + two phases: inference store (cluster artifacts + detached gateway relaunch), then a single ledger walk with type-aware messaging via a pure, tested loader. Two lifecycle holes closed while building the coverage matrix: - obol agent delete now drops the agent namespace's ledger entries — otherwise every later resume replays ghost offers for a deleted agent. - obol sell update now refreshes the ledger from the live post-patch CR (List-bundle aware) — otherwise the next resume kubectl-applies the OLD payTo/price back, silently reverting an intentional payment change. Update also adopts offers created outside the CLI. New tests: mixed-type ledger walk, demo List-bundle parsing, namespace-scoped removal round-trip, and source-scope guards on every persist/refresh/cleanup site.
Three confirmed majors, all reproduced against a live cluster: - agent delete left the agent's ServiceOffers ALIVE in etcd (the agent finalizer tears down children but leaves the namespace and offers) while sweeping their ledger entries — and a surviving offer reconciles back to Ready, paying the deleted agent's wallet, if the name is ever reused. deleteCRDAgent now deletes the namespace's ServiceOffer CRs (the offer finalizer handles route/registration teardown) and the ledger sweep moves into the cluster-reachable branch, since an unreachable cluster means the CRs survive and the ledger must keep covering them. - sell stop never refreshed the ledger, so an etcd-wiping stack-down/up replayed the pre-drain manifest and resurrected a deliberately stopped offer fully live (reboot-resume was already safe: client-side apply never owned drainAt). The drain patch now refreshes the ledger like sell update does. - agent-offer replay failed outright after a stack recreation: the bare manifest's namespace no longer exists and kubectl apply errors with 'namespaces not found' — the documented 'offer waits on the missing agent' behavior never happened. Both agent persist sites now store a v1 List bundling the agent NAMESPACE (canonical labels) with the offer; the Agent CR stays excluded so no fresh wallet is minted. Plus the inference half of delete=>no-resume (minor): sell delete now tombstones the inference descriptor (DeletedAt; kept for list/status history, cleared by re-creating the offer) and resume filters tombstoned descriptors via activeInferenceDeployments. Misleading sell-http wording in delete output and the false 'CRs died with the namespace' comments corrected; new tests pin every behavior (bundle round-trip incl. no-Agent-CR assertion, stop/delete scope guards, tombstone filter, branch placement of the agent-delete sweep).
…ference probe
Both release-smoke failures on the rc14 wopus run reduce to these two
flow bugs — no stack defect (reproduced live, full chain diagnosed):
- flow-16 §2.2 polled Ready=True for an offer created WITH registration
enabled and no `obol sell register` submitted, which the controller
keeps Ready=False / AwaitingExternalRegistration by design ('offer
already serves paid traffic') — the gate could never pass as written,
and only ever matched historically because 'Ready=True' substring-
matched 'PaymentGateReady=True' when the ladder converged in time.
Gate now polls the serving condition set (UpstreamHealthy +
PaymentGateReady + RoutePublished, anchored greps) over 300s, which is
exactly what §3's 402 probe exercises.
- flow-04 step 12 used `curl -sf --max-time 120`: too tight for the
FIRST inference ever routed through the Hermes agent pipeline on a
local Ollama model (the multi-thousand-token system prompt pays full
prompt processing before the KV cache warms; ~150s observed for a 27B
on an M-series host), and -f swallowed every diagnostic so the fail
message was empty. Now 300s, no -f, and the fail message carries the
HTTP status + body snippet.
Verified against a live cluster in the failing state: the new flow-16
gate passes where the old one cannot; the flow-04 call returns 200 with
correct content once warm.
Adds the only e2e coverage of `obol sell mcp` (paid MCP tool over x402 in-band _meta; shipped in rc14 with unit tests only). A reusable SDK client (flows/clients/mcp-paid-client.go, the x402-foundation MCP client) drives the full loop against a foreground `obol sell mcp` server backed by a mock upstream, reusing flow-10's anvil fork + facilitator: free ping → requirements surfaced in _meta (payTo/asset/network/amount) → unpaid call rejected → auto-paid call (EIP-3009 in _meta) settles on-chain → seller API key injected upstream + invisible to the buyer → buyer balance delta == price. Registered in release-smoke.sh after flow-16 (needs no cluster of its own; runs while anvil + facilitator are up). Two fork-specific footguns found and documented (CLAUDE.md pitfalls 17/18), both reproduced live: - EIP-7702-contaminated test accounts: anvil/hardhat accounts #1-#9 carry 0xef0100 delegation code from real-chain 7702 experiments on Base Sepolia. FiatTokenV2_2 verifies EIP-3009 via SignatureChecker, which routes any code-bearing `from` to EIP-1271 and rejects an otherwise-valid ECDSA signature ('invalid signature' → facilitator 503). The buyer MUST be a fresh EOA — flow-17 generates fresh keys and preflights `cast code == 0x`. (Same reason flow-08 uses the agent's generated wallet.) - x402 SDK signs validAfter=now with no past buffer; a long-lived anvil fork's block.timestamp lags real time → 'authorization is not yet valid'. flow-17 syncs the fork clock (evm_setNextBlockTimestamp+mine) before the paid call.
…rence window on local models
Automated by docker-publish-x402/repin-embedded-pins after the image build at 926620b. Committed via the GitHub API so the commit is verified.
note to look into.
|
|
The rc15 branch vs mine — and the rebase verdict They're mostly complementary, not competing. rc15's offer work So: rebase feat/stack-export-import onto rc15, keep their Agent creation paths: seven, and the trap is inverted There are 7 distinct creation paths but only 2 render On "is the CR work done": my record-on-write is done on the Multiple locked-down agent businesses: ~40% there What already works: N ServiceOffers can route to one
This also crystallizes the three-tier trust model you The plan doc sequences it A→E: rebase now, isolation MVP |
3238bf2 to
a3fc3aa
Compare
a3fc3aa to
c525d10
Compare
…bombs on import obol stack import sanitized tar entry NAMES (sanitizeEntryName) but wrote symlinks from the raw, unchecked Linkname. A malicious archive could ship a clean-named symlink whose target points outside the extraction root (e.g. link -> /etc, or ../../..), then a follow-up entry written THROUGH that link lands at an arbitrary path — arbitrary file write as the importing user. The only guard (name sanitization) does not help: the escape happens at OS symlink-resolution time, not lexically. - symlinkEscapesRoot rejects absolute and ..-walking symlink targets that resolve outside destRoot; in-root relative links (the common case) still work (TestArchiveRoundTrip unchanged). - Add a decompression-ratio guard (countingReader + ratioGuard between gzip and tar): aborts when uncompressed:compressed exceeds 100:1 past a 64 MiB floor, so a tiny gzip can no longer inflate to a disk-filling tar. The floor keeps legitimate multi-GB agent-data restores (low, steady ratio) passing; thresholds are vars so tests can lower them. - Tests: TestExtractRejectsSymlinkEscape (relative + absolute targets, no link created) and TestExtractRejectsDecompressionBomb. Found in review of #624.
…trip flow Fills the cheap, cluster-free coverage gaps found in review and adds the end-to-end flow the unit tests cannot reach. Unit tests (were 0%, now 100% on the targeted helpers): - Manifest.component lookup (hit/present-excluded/miss) - listInstances (absent dir -> nil; ignores non-dir entries) - walkArchive rejects non-gzip input cleanly - walletNote ok/FAILED formatting - model.mapKeys flow-18-stack-export-import.sh: seed recorded RPC + preferred model (+ optional Agent CR) -> export -> purge -> import -> stack up -> cluster-only re-apply -> assert the RPC, model_list head, and Agent CR all returned. This exercises export quiesce/harvest, host restore, and the stack-up replay ordering that only the source-level guard test touches today. Standalone + destructive by design (purges/recreates), so it is NOT wired into the release-smoke baseline; header documents the run recipe. Syntax + shellcheck clean; needs one live-cluster run to validate before use as a gate.
bussyjd
left a comment
There was a problem hiding this comment.
Approve — reviewed in depth, hardened, and coverage-filled
Read the full net-new persistence subsystem (stackbackup / agentcrd / model.record / network.record + the stack-up wiring). The design is sound — record-on-write with atomic 0600 writes, the agent-CRs-before-offers and model-reconcile-after-autoconfig ordering invariants, honest secret disclosure, and version gates on all three record formats. Test quality on the pure-logic layer is genuinely strong (real contracts asserted: 0600 perms, version rejection, merge ordering/dedup, unrecord-keeps-custom), not coverage padding.
Two things were addressed during review (pushed to this branch, both GitHub-verified):
71b5848— security: the import extractor sanitized entry names but wrote symlinks from the rawLinkname, allowing a symlink-target escape (clean-named link →/etc, then a write through it = arbitrary file write as the importing user). Fixed withsymlinkEscapesRoot(in-root links still work) + a decompression-ratio guard against zip-bombs. Tests added;TestArchiveRoundTripunchanged.975aefd— coverage: unit tests for the previously-0% pure helpers (Manifest.component,listInstances, non-gzip rejection,walletNote,mapKeys) andflow-18-stack-export-import.sh— the export→purge→import→up→verify round-trip the unit tests can't reach. The flow is standalone/destructive (not in the release-smoke baseline) and needs one live-cluster validation run before use as a gate.
Remaining items are design judgments, not blockers, left to you: RPC record fail-loud vs. model/agent best-effort; the cluster-dump-vs-recorded-models.yaml double path for LiteLLM config. Worth a rebase onto main once the rc15 train (#623) lands so the diff collapses to just the three persistence commits.
LGTM.
…bombs on import obol stack import sanitized tar entry NAMES (sanitizeEntryName) but wrote symlinks from the raw, unchecked Linkname. A malicious archive could ship a clean-named symlink whose target points outside the extraction root (e.g. link -> /etc, or ../../..), then a follow-up entry written THROUGH that link lands at an arbitrary path — arbitrary file write as the importing user. The only guard (name sanitization) does not help: the escape happens at OS symlink-resolution time, not lexically. - symlinkEscapesRoot rejects absolute and ..-walking symlink targets that resolve outside destRoot; in-root relative links (the common case) still work (TestArchiveRoundTrip unchanged). - Add a decompression-ratio guard (countingReader + ratioGuard between gzip and tar): aborts when uncompressed:compressed exceeds 100:1 past a 64 MiB floor, so a tiny gzip can no longer inflate to a disk-filling tar. The floor keeps legitimate multi-GB agent-data restores (low, steady ratio) passing; thresholds are vars so tests can lower them. - Tests: TestExtractRejectsSymlinkEscape (relative + absolute targets, no link created) and TestExtractRejectsDecompressionBomb. Found in review of #624.
Problem to be solved
Losing agent data on a purge. And other resources giving up on the persistence aspects.
This fable run was intended to add import / export, and phase 2 was to bring more persistence back, to make the cluster more declarative than imperative. Next step is to rebase (ish) off the other pr in flight #623
Summary
Phase 2 is done and staged on
feat/stack-export-import (on top of your
Phase 1 commit ba9b536) — stopping
short of committing per your yubikey
workflow.
What record-on-write now covers
Agent CRs — obol agent new/update
persist the applied manifest to
$CONFIG_DIR/agents/.yaml; agent
delete removes it; agentcrd.ResumeAll
replays them at stack up. This was the
worst gap: host seeds existed but
nothing recreated the CR, so agents (and
any offer pointing at them) silently
never returned.
sell agent offers — now persist into the
existing sell-http/ store, so the
proven replay path and sell delete
cleanup handle them with zero new
machinery. sell mcp (foreground process)
and sell demo (ephemeral) stay
unrecorded by design.
Model config — obol model
setup|prefer|remove snapshot the
resulting model_list plus referenced
provider API keys to
$CONFIG_DIR/llm/recorded-models.yaml
(0600). paid/* entries are filtered out
— verified live against your cluster,
where the real paid/AEON-7/... purchase
route was correctly excluded while the
Ollama entry was captured. The subtle
design point: recording happens only at
the command layer, never from
autoConfigureLLM — otherwise stack-up
auto-detection would overwrite your
preferences moments before
model.ReconcileRecorded (which runs
after auto-config, before default-Hermes
setup, so your recorded head model
stays the agents' default).
Remote RPCs — AddPublicRPCs/AddCustomRPC
/RemovePublicRPCs maintain
$CONFIG_DIR/rpc/recorded-upstreams.yaml
(0600, since custom URLs can carry
paid-provider keys), with ChainList
endpoints snapshotted at add-time for
deterministic offline replay. stack up
replays through the same idempotent add
functions. Local-node upstreams stay
unrecorded — network sync already
re-registers those.
Stack-up ordering is pinned by a
source-guard test
(cmd/obol/stackup_resume_guard_test.go,
mirroring your existing resumeSellOffers
guard): stack.Up → RPCs → Agent CRs →
sell offers, with Agent CRs strictly
before offers since agent-backed
ServiceOffers resolve agent.ref.
Validation
All 35 packages pass. Live smoke on your
dev cluster: model record verified with
the paid-filter; agent full cycle
verified (agent new → record written →
CR deleted from etcd → replayed from
record → agent delete → record and data
dir cleaned). Cluster left healthy
(hermes 2/2 Running). stack export needs
no changes — the config component picks
up all three new stores automatically.
One noted follow-up in the plan doc:
(agent new → record written → CR deleted from etcd → replayed
from record → agent delete → record and data dir cleaned).
Cluster left healthy (hermes 2/2 Running). stack export needs
no changes — the config component picks up all three new stores
automatically.
One noted follow-up in the plan doc: custom RPC endpoints have
no removal command (network remove only drops ChainList
selections, matching its cluster-side semantics), so a recorded
custom endpoint currently requires editing the file — a
network remove --endpoint flag would close that. Phase 3
(sellable OCI bundles + obol sell export) remains the open
phase.