Background
We hit a Tenero failure while enriching /agents SSR with USD volume (PRs around 962cc1d → f327554), gave up server-side, and moved Tenero to browser-side with localStorage caching (92b37cd). Investigating it now revealed:
-
The original failure was never observed properly. lib/competition/volume.ts defaulted its logger to createConsoleLogger, which writes to console.warn — not the worker-logs RPC binding. Confirmed by querying logs.aibtc.com for the deploy window (2026-05-11 09:08–09:19 UTC): zero tenero_* / competition.volume.* events landed, only cache.hit/cache.miss/payment.required. Every other API route in the repo uses the isLogsRPC(env.LOGS) ? createLogger(env.LOGS, ctx) : createConsoleLogger(...) switch — volume.ts skipped it. So failures fired into the wrong sink and the team's wrangler tail couldn't see the preview-URL traffic.
-
Tenero's web-ui-ip rate limiter is a plausible cause of the silent nulls. Live probe shows: x-ratelimit-type: web-ui-ip, 100/min, 50k/month, server: cloudflare. From a CF Worker, source IP is a shared CF-datacenter egress IP — the 100/min budget is collective, easy to hit, and without the logger correctly wired we'd never have known.
-
The Hiro and Tenero use cases are not symmetric. Hiro is open-ended, request-driven, needs per-call retry + observability on the request path. Tenero is a closed-set push problem (3–30 tokens, ~5-min freshness, identical for every user) — a wrong fit for the per-request wrapper pattern.
Putting periodic work on the user request path was the structural mistake. We want it precomputed by a scheduler, with prices/cursors in D1/KV.
Decision
Build a single Durable Object with an alarm inside the landing-page Worker that coordinates all periodic background work. Two initial consumers:
- Tenero price refresh — every ~5 min, refresh USD prices for the active token set (static base + dynamic from
swaps.token_in WHERE source='agent'), write tenero:price:{tokenId} to KV with {price, fetchedAt, rateLimit}.
- Competition Hiro sweep — every ~15 min, run the existing
runCompetitionCron logic against D1's competition_state cursor. Replaces the current POST /api/competition/cron + X-Cron-Secret external-scheduler dance.
Why DO+alarm (not Cron Triggers, not a sibling Worker, not the current pattern)
We considered four options. The constraints that actually drove the decision:
| Constraint |
What it rules out |
Hiro budget must stay pooled (one key, one backoff state in stacks-api-fetch.ts) |
Sibling scheduler Worker (fragments the budget) |
| Competition is scoped to landing-page; don't fragment the product surface |
Sibling scheduler Worker (two repos / deploys / log namespaces) |
OpenNext + scheduled() compatibility unverified, potentially fiddly |
Cron Triggers in the OpenNext Worker (we'd be debugging the build) |
Phase 3 plan adds chainhook push as a third write path into swaps — single-writer guarantee on the sweep eliminates one race surface (idempotency at the data layer still handles user-vs-chainhook) |
The current "POST endpoint poked externally" pattern (no single-writer guarantee) |
| Adaptive cadence on rate-limit signals (Hiro 429, Tenero quota low → slow down) is genuinely useful |
Cron Triggers (declarative schedule, no easy way to back off) |
| Want a clean RPC surface for status / manual trigger / pause from elsewhere in the codebase |
The current secret-authed HTTPS endpoint dance |
The bullets that made it stop being "DO is overkill":
- Single-writer guarantee becomes load-bearing once chainhook lands
- OpenNext + DO is orthogonal to the build (DOs are wrangler-config, not entry-point code) — sidesteps the cron-trigger compatibility unknown
- RPC surface replaces both the GET self-doc and POST manual-invoke routes cleanly
Naming + scope
- DO class:
SchedulerDO
- Module path:
lib/scheduler/scheduler-do.ts
- Wrangler binding name:
SCHEDULER
- Singleton instance name:
"v1" — env.SCHEDULER.idFromName("v1"). Versioned so a future migration can move to "v2" without disturbing the v1 instance.
One DO instance for the whole subsystem to start. If a task ever needs genuinely independent state or its alarm cadence diverges hard from the others, split then — easier than merging back.
DO surface
Storage:
lastTeneroRunAt: number
lastSweepRunAt: number
lastTeneroResult: { succeeded: number, failed: number, rateLimitRemaining?: number }
lastSweepResult: { scanned, found, inserted, alreadyKnown, pending, rejected, cursor } — mirrors today's cron summary shape
consecutiveFailures: { tenero: number, sweep: number } (see alarm-failure section)
- Long-lived cursors stay in D1 (
competition_state), not DO storage — keep the data-layer authority where it already is.
RPC methods:
status() → snapshot of the above
refreshNow(task: "tenero" | "sweep" | "all") → fire the named task now, return result
pauseUntil(timestamp) → ops kill switch
resume() → undo pause
alarm() handler:
- Determine which tasks are due (5-min cadence for Tenero, 15-min for sweep — use modulo on alarm tick)
- Run each due task in its own
try/catch
- Log structured events through
env.LOGS (DO has its own ExecutionContext)
- Persist run results to storage
- Re-arm next alarm
Alarm-failure policy — options + recommendation
CF runtime behavior baseline: when alarm() throws, the DO runtime auto-retries the alarm up to ~6 times with exponential backoff before dropping it. That retry budget applies to the whole alarm() invocation, not per-task. Whether you want to lean on this depends on which failure modes you treat as transient.
Option A — Bare runtime retry. Let alarm() throw on any failure. CF retries up to 6× automatically; if all fail the alarm is dropped and the next scheduled tick recovers. Pros: zero code. Cons: one failing task aborts the others (Tenero 429 also kills the Hiro sweep that tick); no structured observability unless you try/catch + log + rethrow; logs from retried attempts pile up.
Option B — Per-task isolation + structured logging, transport-only throws. Each task runs inside its own try/catch inside alarm(). Task-level failures (Tenero 429, Hiro 5xx, parse error) are logged via env.LOGS and the task returns normally — the next scheduled tick handles recovery. The alarm() body only throws on transport-level failures (DO storage write fails, env binding missing, etc.) where runtime retry is appropriate. Pros: one task's failure doesn't take down the others; observable; predictable cadence. Cons: requires the discipline to classify errors correctly.
Option C — Adaptive backoff on known rate-limit signals. Layer on top of B. When a task sees Hiro 429 or Tenero x-ratelimit-minute-remaining: 0, write a nextRunAfter timestamp to DO storage; subsequent alarm() ticks skip that task until the timestamp passes. Independent from runtime retry — this is "successful execution observed a signal that says slow down." Pros: respects upstream; reduces our chance of getting fully blocked. Cons: a little more code; need to make sure backoff state can't get stuck.
Option D — Explicit error budget + circuit breaker. Track consecutiveFailures[task] in DO storage. After N consecutive failures (e.g., 6), pause the task and emit a critical log; require explicit resume() RPC call to restart. Pros: hard guarantee against silent failure for hours. Cons: needs ops attention; adds state to maintain; mostly redundant with good observability.
Recommended starting policy: B + C.
- B is the floor. Wrap each task, log structured failures, only throw for transport-level issues. This is the minimum to avoid the "Tenero broke the Hiro sweep" failure mode and to actually see what's happening in
logs.aibtc.com.
- C is cheap and matches what we already know about both upstreams: Hiro's
stacks-api-fetch.ts already surfaces 429 and monthly-quota warnings, and Tenero's headers expose x-ratelimit-minute-remaining / x-ratelimit-month-remaining. Honoring those signals at the scheduler layer means we don't fight Hiro/Tenero's own throttling.
- A is what you fall back to if B's transport-error path is reached.
- D is deferred. Track
consecutiveFailures in storage from day one (it's three lines), but don't wire the circuit-breaker behavior until observability shows we need it. If a task fails for hours in production we'll see it in worker-logs before users notice — that's the bar for adding D.
Tenero client (separate but lands with this)
The DO needs a typed Tenero fetch wrapper to call. Modeled on lib/stacks-api-fetch.ts (Hiro):
lib/external/tenero-fetch.ts — typed wrapper with retry, per-attempt timeout, cf-ray + status logged on failure, x-ratelimit-* headers parsed and surfaced
lib/external/tenero/prices.ts — fetchTokenPriceUsd(assetId, logger) on top of the wrapper
- KV reader for consumers:
getCachedTokenPrice(kv, tokenId): { price, fetchedAt } | null
- Optional
TENERO_API_KEY env var support (header name TBD — x-api-key is a guess; confirm with Tenero docs before wiring)
LeaderboardClient.tsx then moves from browser-side Tenero fetch to a thin GET /api/prices?tokens=... route that reads KV. Browser-side localStorage cache can stay as a second-tier cache or come out entirely — that's a small follow-up either way.
Implementation outline
lib/external/tenero-fetch.ts + prices.ts + KV reader — wrapper modeled on stacks-api-fetch.ts. Lands first so the DO can call into stable code.
SchedulerDO class + wrangler binding — bare skeleton with status() and refreshNow() RPC, no tasks yet. Verify it deploys and the RPC stub works from a route.
- Tenero refresh task — wire as the first consumer, 5-min cadence. Confirm KV writes and log events show up in
logs.aibtc.com.
- Migrate competition cron logic — move
runCompetitionCron invocation from POST /api/competition/cron to the DO's sweep task. Keep the HTTP route as a thin RPC pass-through (or remove if nothing external calls it; check before deleting).
LeaderboardClient → KV-backed /api/prices — drop the direct Tenero fetch from the browser.
- Wire pause/resume RPC + an admin route if/when we want manual ops control.
References
- Investigation transcript context (this issue) — observability root cause + Tenero rate-limit findings
lib/stacks-api-fetch.ts — Hiro wrapper pattern to mirror for Tenero
lib/logging.ts — isLogsRPC(env.LOGS) ? createLogger : createConsoleLogger pattern; the bug we're avoiding
app/api/competition/cron/route.ts + lib/competition/cron.ts + lib/competition/state.ts — current sweep implementation to migrate
- Reverted commit
f327554 — Tenero server-side abandonment that this work undoes properly
- Reverted/replaced commit
92b37cd — browser-side Tenero that this work replaces
Open questions for implementer
- Tenero API key — does the AIBTC org have one? If so, confirm the auth header name and update the wrapper. If not, file separately; unauthenticated 100/min/IP is fine for v1 but worth tracking.
- Cron expression for the sweep task isn't strictly needed inside the DO (alarm computes cadence) but document the chosen cadence (15 min) and the rationale so future maintainers don't have to dig.
- Decide whether to keep
POST /api/competition/cron as a public-with-secret manual-trigger surface, or remove it once the RPC path lands. If anything external still pokes it, keep it as a thin pass-through.
cc @biwasxyz @secret-mars — taking it from here, this captures the full context. Happy to pair / answer questions on any piece of this.
Background
We hit a Tenero failure while enriching
/agentsSSR with USD volume (PRs around962cc1d→f327554), gave up server-side, and moved Tenero to browser-side with localStorage caching (92b37cd). Investigating it now revealed:The original failure was never observed properly.
lib/competition/volume.tsdefaulted its logger tocreateConsoleLogger, which writes toconsole.warn— not the worker-logs RPC binding. Confirmed by queryinglogs.aibtc.comfor the deploy window (2026-05-11 09:08–09:19 UTC): zerotenero_*/competition.volume.*events landed, onlycache.hit/cache.miss/payment.required. Every other API route in the repo uses theisLogsRPC(env.LOGS) ? createLogger(env.LOGS, ctx) : createConsoleLogger(...)switch —volume.tsskipped it. So failures fired into the wrong sink and the team'swrangler tailcouldn't see the preview-URL traffic.Tenero's web-ui-ip rate limiter is a plausible cause of the silent nulls. Live probe shows:
x-ratelimit-type: web-ui-ip, 100/min, 50k/month,server: cloudflare. From a CF Worker, source IP is a shared CF-datacenter egress IP — the 100/min budget is collective, easy to hit, and without the logger correctly wired we'd never have known.The Hiro and Tenero use cases are not symmetric. Hiro is open-ended, request-driven, needs per-call retry + observability on the request path. Tenero is a closed-set push problem (3–30 tokens, ~5-min freshness, identical for every user) — a wrong fit for the per-request wrapper pattern.
Putting periodic work on the user request path was the structural mistake. We want it precomputed by a scheduler, with prices/cursors in D1/KV.
Decision
Build a single Durable Object with an alarm inside the landing-page Worker that coordinates all periodic background work. Two initial consumers:
swaps.token_in WHERE source='agent'), writetenero:price:{tokenId}to KV with{price, fetchedAt, rateLimit}.runCompetitionCronlogic against D1'scompetition_statecursor. Replaces the currentPOST /api/competition/cron+X-Cron-Secretexternal-scheduler dance.Why DO+alarm (not Cron Triggers, not a sibling Worker, not the current pattern)
We considered four options. The constraints that actually drove the decision:
stacks-api-fetch.ts)scheduled()compatibility unverified, potentially fiddlyswaps— single-writer guarantee on the sweep eliminates one race surface (idempotency at the data layer still handles user-vs-chainhook)The bullets that made it stop being "DO is overkill":
Naming + scope
SchedulerDOlib/scheduler/scheduler-do.tsSCHEDULER"v1"—env.SCHEDULER.idFromName("v1"). Versioned so a future migration can move to"v2"without disturbing the v1 instance.One DO instance for the whole subsystem to start. If a task ever needs genuinely independent state or its alarm cadence diverges hard from the others, split then — easier than merging back.
DO surface
Storage:
lastTeneroRunAt: numberlastSweepRunAt: numberlastTeneroResult: { succeeded: number, failed: number, rateLimitRemaining?: number }lastSweepResult: { scanned, found, inserted, alreadyKnown, pending, rejected, cursor }— mirrors today's cron summary shapeconsecutiveFailures: { tenero: number, sweep: number }(see alarm-failure section)competition_state), not DO storage — keep the data-layer authority where it already is.RPC methods:
status()→ snapshot of the aboverefreshNow(task: "tenero" | "sweep" | "all")→ fire the named task now, return resultpauseUntil(timestamp)→ ops kill switchresume()→ undo pausealarm()handler:try/catchenv.LOGS(DO has its own ExecutionContext)Alarm-failure policy — options + recommendation
CF runtime behavior baseline: when
alarm()throws, the DO runtime auto-retries the alarm up to ~6 times with exponential backoff before dropping it. That retry budget applies to the wholealarm()invocation, not per-task. Whether you want to lean on this depends on which failure modes you treat as transient.Option A — Bare runtime retry. Let
alarm()throw on any failure. CF retries up to 6× automatically; if all fail the alarm is dropped and the next scheduled tick recovers. Pros: zero code. Cons: one failing task aborts the others (Tenero 429 also kills the Hiro sweep that tick); no structured observability unless youtry/catch + log + rethrow; logs from retried attempts pile up.Option B — Per-task isolation + structured logging, transport-only throws. Each task runs inside its own
try/catchinsidealarm(). Task-level failures (Tenero 429, Hiro 5xx, parse error) are logged viaenv.LOGSand the task returns normally — the next scheduled tick handles recovery. Thealarm()body only throws on transport-level failures (DO storage write fails, env binding missing, etc.) where runtime retry is appropriate. Pros: one task's failure doesn't take down the others; observable; predictable cadence. Cons: requires the discipline to classify errors correctly.Option C — Adaptive backoff on known rate-limit signals. Layer on top of B. When a task sees
Hiro 429orTenero x-ratelimit-minute-remaining: 0, write anextRunAftertimestamp to DO storage; subsequentalarm()ticks skip that task until the timestamp passes. Independent from runtime retry — this is "successful execution observed a signal that says slow down." Pros: respects upstream; reduces our chance of getting fully blocked. Cons: a little more code; need to make sure backoff state can't get stuck.Option D — Explicit error budget + circuit breaker. Track
consecutiveFailures[task]in DO storage. After N consecutive failures (e.g., 6), pause the task and emit a critical log; require explicitresume()RPC call to restart. Pros: hard guarantee against silent failure for hours. Cons: needs ops attention; adds state to maintain; mostly redundant with good observability.Recommended starting policy: B + C.
logs.aibtc.com.stacks-api-fetch.tsalready surfaces 429 and monthly-quota warnings, and Tenero's headers exposex-ratelimit-minute-remaining/x-ratelimit-month-remaining. Honoring those signals at the scheduler layer means we don't fight Hiro/Tenero's own throttling.consecutiveFailuresin storage from day one (it's three lines), but don't wire the circuit-breaker behavior until observability shows we need it. If a task fails for hours in production we'll see it in worker-logs before users notice — that's the bar for adding D.Tenero client (separate but lands with this)
The DO needs a typed Tenero fetch wrapper to call. Modeled on
lib/stacks-api-fetch.ts(Hiro):lib/external/tenero-fetch.ts— typed wrapper with retry, per-attempt timeout,cf-ray+ status logged on failure,x-ratelimit-*headers parsed and surfacedlib/external/tenero/prices.ts—fetchTokenPriceUsd(assetId, logger)on top of the wrappergetCachedTokenPrice(kv, tokenId): { price, fetchedAt } | nullTENERO_API_KEYenv var support (header name TBD —x-api-keyis a guess; confirm with Tenero docs before wiring)LeaderboardClient.tsxthen moves from browser-side Tenero fetch to a thinGET /api/prices?tokens=...route that reads KV. Browser-side localStorage cache can stay as a second-tier cache or come out entirely — that's a small follow-up either way.Implementation outline
lib/external/tenero-fetch.ts+prices.ts+ KV reader — wrapper modeled onstacks-api-fetch.ts. Lands first so the DO can call into stable code.SchedulerDOclass + wrangler binding — bare skeleton withstatus()andrefreshNow()RPC, no tasks yet. Verify it deploys and the RPC stub works from a route.logs.aibtc.com.runCompetitionCroninvocation fromPOST /api/competition/cronto the DO's sweep task. Keep the HTTP route as a thin RPC pass-through (or remove if nothing external calls it; check before deleting).LeaderboardClient→ KV-backed/api/prices— drop the direct Tenero fetch from the browser.References
lib/stacks-api-fetch.ts— Hiro wrapper pattern to mirror for Tenerolib/logging.ts—isLogsRPC(env.LOGS) ? createLogger : createConsoleLoggerpattern; the bug we're avoidingapp/api/competition/cron/route.ts+lib/competition/cron.ts+lib/competition/state.ts— current sweep implementation to migratef327554— Tenero server-side abandonment that this work undoes properly92b37cd— browser-side Tenero that this work replacesOpen questions for implementer
POST /api/competition/cronas a public-with-secret manual-trigger surface, or remove it once the RPC path lands. If anything external still pokes it, keep it as a thin pass-through.cc @biwasxyz @secret-mars — taking it from here, this captures the full context. Happy to pair / answer questions on any piece of this.