Skip to content

fix(flows): flow-16 unsatisfiable Ready gate; flow-04 first-inference window on local models#621

Closed
bussyjd wants to merge 1 commit into
mainfrom
fix/flow-gates-local-llm
Closed

fix(flows): flow-16 unsatisfiable Ready gate; flow-04 first-inference window on local models#621
bussyjd wants to merge 1 commit into
mainfrom
fix/flow-gates-local-llm

Conversation

@bussyjd

@bussyjd bussyjd commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Both release-gating FAILs from the rc14 release smoke on a local Ollama model (qwopus3.6-27b-v2-mtp:q5_k_m, 27B) reduce to flow bugs — the stack itself is clean. Diagnosed live against a reproducing cluster; full chain below.

flow-16 §2.2 — the Ready gate could never pass

The flow creates its offer with registration enabled and never runs obol sell register, so the controller keeps Ready=False / Registered=AwaitingExternalRegistration by design ("offer already serves paid traffic"). A Ready=True poll is unsatisfiable as written — flow-11 only polls Ready after actually registering.

It historically "passed" by accident: the gate grepped Ready=True against sell status output, which substring-matches PaymentGateReady=True whenever the condition ladder converged inside the window.

Fix: poll the serving condition set — UpstreamHealthy + PaymentGateReady + RoutePublished, anchored greps via obol kubectl jsonpath — over 300s. That set is exactly what §3's 402 probe then exercises.

flow-04 step 12 — window too tight, diagnostics swallowed

curl -sf --max-time 120 failed reproducibly (twice, warm model) on the first inference ever routed through the Hermes agent pipeline: Hermes prepends a multi-thousand-token system prompt, and a local Ollama model pays full prompt processing before the KV cache warms (~150s observed; Hermes' internal client retries re-pay it until one attempt survives). The very next call answers in ~20s — which is why step 13 passed seconds later in every run. GPU-class endpoints converge in seconds and never see this.

-f also swallowed the response entirely, so the fail line was the empty Agent inference failed — (which initially sent the investigation toward cold model loads and auth). Fix: 300s window, no -f, and the fail message carries HTTP status + body snippet.

Verification

  • Live cluster in the failing state: new flow-16 gate passes where the old one cannot (UpstreamHealthy=True PaymentGateReady=True RoutePublished=True Registered=False Ready=False); the exact flow-04 request returns 200 with correct content once the prompt cache is warm.
  • bash -n clean on both flows; with these fixes the rc14 wopus smoke is 12 PASS / 2 SKIP (the SKIPs are waived registration-receipt sub-checks, registrations themselves succeeded on-chain).

…ference probe

Both release-smoke failures on the rc14 wopus run reduce to these two
flow bugs — no stack defect (reproduced live, full chain diagnosed):

- flow-16 §2.2 polled Ready=True for an offer created WITH registration
  enabled and no `obol sell register` submitted, which the controller
  keeps Ready=False / AwaitingExternalRegistration by design ('offer
  already serves paid traffic') — the gate could never pass as written,
  and only ever matched historically because 'Ready=True' substring-
  matched 'PaymentGateReady=True' when the ladder converged in time.
  Gate now polls the serving condition set (UpstreamHealthy +
  PaymentGateReady + RoutePublished, anchored greps) over 300s, which is
  exactly what §3's 402 probe exercises.

- flow-04 step 12 used `curl -sf --max-time 120`: too tight for the
  FIRST inference ever routed through the Hermes agent pipeline on a
  local Ollama model (the multi-thousand-token system prompt pays full
  prompt processing before the KV cache warms; ~150s observed for a 27B
  on an M-series host), and -f swallowed every diagnostic so the fail
  message was empty. Now 300s, no -f, and the fail message carries the
  HTTP status + body snippet.

Verified against a live cluster in the failing state: the new flow-16
gate passes where the old one cannot; the flow-04 call returns 200 with
correct content once warm.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants