Skip to content

feat(scheduler): SchedulerDO + alarm for periodic landing-page jobs (Tenero prices + competition Hiro sweep) #768

@whoabuddy

Description

@whoabuddy

Background

We hit a Tenero failure while enriching /agents SSR with USD volume (PRs around 962cc1df327554), gave up server-side, and moved Tenero to browser-side with localStorage caching (92b37cd). Investigating it now revealed:

  1. The original failure was never observed properly. lib/competition/volume.ts defaulted its logger to createConsoleLogger, which writes to console.warn — not the worker-logs RPC binding. Confirmed by querying logs.aibtc.com for the deploy window (2026-05-11 09:08–09:19 UTC): zero tenero_* / competition.volume.* events landed, only cache.hit/cache.miss/payment.required. Every other API route in the repo uses the isLogsRPC(env.LOGS) ? createLogger(env.LOGS, ctx) : createConsoleLogger(...) switch — volume.ts skipped it. So failures fired into the wrong sink and the team's wrangler tail couldn't see the preview-URL traffic.

  2. Tenero's web-ui-ip rate limiter is a plausible cause of the silent nulls. Live probe shows: x-ratelimit-type: web-ui-ip, 100/min, 50k/month, server: cloudflare. From a CF Worker, source IP is a shared CF-datacenter egress IP — the 100/min budget is collective, easy to hit, and without the logger correctly wired we'd never have known.

  3. The Hiro and Tenero use cases are not symmetric. Hiro is open-ended, request-driven, needs per-call retry + observability on the request path. Tenero is a closed-set push problem (3–30 tokens, ~5-min freshness, identical for every user) — a wrong fit for the per-request wrapper pattern.

Putting periodic work on the user request path was the structural mistake. We want it precomputed by a scheduler, with prices/cursors in D1/KV.

Decision

Build a single Durable Object with an alarm inside the landing-page Worker that coordinates all periodic background work. Two initial consumers:

  • Tenero price refresh — every ~5 min, refresh USD prices for the active token set (static base + dynamic from swaps.token_in WHERE source='agent'), write tenero:price:{tokenId} to KV with {price, fetchedAt, rateLimit}.
  • Competition Hiro sweep — every ~15 min, run the existing runCompetitionCron logic against D1's competition_state cursor. Replaces the current POST /api/competition/cron + X-Cron-Secret external-scheduler dance.

Why DO+alarm (not Cron Triggers, not a sibling Worker, not the current pattern)

We considered four options. The constraints that actually drove the decision:

Constraint What it rules out
Hiro budget must stay pooled (one key, one backoff state in stacks-api-fetch.ts) Sibling scheduler Worker (fragments the budget)
Competition is scoped to landing-page; don't fragment the product surface Sibling scheduler Worker (two repos / deploys / log namespaces)
OpenNext + scheduled() compatibility unverified, potentially fiddly Cron Triggers in the OpenNext Worker (we'd be debugging the build)
Phase 3 plan adds chainhook push as a third write path into swaps — single-writer guarantee on the sweep eliminates one race surface (idempotency at the data layer still handles user-vs-chainhook) The current "POST endpoint poked externally" pattern (no single-writer guarantee)
Adaptive cadence on rate-limit signals (Hiro 429, Tenero quota low → slow down) is genuinely useful Cron Triggers (declarative schedule, no easy way to back off)
Want a clean RPC surface for status / manual trigger / pause from elsewhere in the codebase The current secret-authed HTTPS endpoint dance

The bullets that made it stop being "DO is overkill":

  • Single-writer guarantee becomes load-bearing once chainhook lands
  • OpenNext + DO is orthogonal to the build (DOs are wrangler-config, not entry-point code) — sidesteps the cron-trigger compatibility unknown
  • RPC surface replaces both the GET self-doc and POST manual-invoke routes cleanly

Naming + scope

  • DO class: SchedulerDO
  • Module path: lib/scheduler/scheduler-do.ts
  • Wrangler binding name: SCHEDULER
  • Singleton instance name: "v1"env.SCHEDULER.idFromName("v1"). Versioned so a future migration can move to "v2" without disturbing the v1 instance.

One DO instance for the whole subsystem to start. If a task ever needs genuinely independent state or its alarm cadence diverges hard from the others, split then — easier than merging back.

DO surface

Storage:

  • lastTeneroRunAt: number
  • lastSweepRunAt: number
  • lastTeneroResult: { succeeded: number, failed: number, rateLimitRemaining?: number }
  • lastSweepResult: { scanned, found, inserted, alreadyKnown, pending, rejected, cursor } — mirrors today's cron summary shape
  • consecutiveFailures: { tenero: number, sweep: number } (see alarm-failure section)
  • Long-lived cursors stay in D1 (competition_state), not DO storage — keep the data-layer authority where it already is.

RPC methods:

  • status() → snapshot of the above
  • refreshNow(task: "tenero" | "sweep" | "all") → fire the named task now, return result
  • pauseUntil(timestamp) → ops kill switch
  • resume() → undo pause

alarm() handler:

  • Determine which tasks are due (5-min cadence for Tenero, 15-min for sweep — use modulo on alarm tick)
  • Run each due task in its own try/catch
  • Log structured events through env.LOGS (DO has its own ExecutionContext)
  • Persist run results to storage
  • Re-arm next alarm

Alarm-failure policy — options + recommendation

CF runtime behavior baseline: when alarm() throws, the DO runtime auto-retries the alarm up to ~6 times with exponential backoff before dropping it. That retry budget applies to the whole alarm() invocation, not per-task. Whether you want to lean on this depends on which failure modes you treat as transient.

Option A — Bare runtime retry. Let alarm() throw on any failure. CF retries up to 6× automatically; if all fail the alarm is dropped and the next scheduled tick recovers. Pros: zero code. Cons: one failing task aborts the others (Tenero 429 also kills the Hiro sweep that tick); no structured observability unless you try/catch + log + rethrow; logs from retried attempts pile up.

Option B — Per-task isolation + structured logging, transport-only throws. Each task runs inside its own try/catch inside alarm(). Task-level failures (Tenero 429, Hiro 5xx, parse error) are logged via env.LOGS and the task returns normally — the next scheduled tick handles recovery. The alarm() body only throws on transport-level failures (DO storage write fails, env binding missing, etc.) where runtime retry is appropriate. Pros: one task's failure doesn't take down the others; observable; predictable cadence. Cons: requires the discipline to classify errors correctly.

Option C — Adaptive backoff on known rate-limit signals. Layer on top of B. When a task sees Hiro 429 or Tenero x-ratelimit-minute-remaining: 0, write a nextRunAfter timestamp to DO storage; subsequent alarm() ticks skip that task until the timestamp passes. Independent from runtime retry — this is "successful execution observed a signal that says slow down." Pros: respects upstream; reduces our chance of getting fully blocked. Cons: a little more code; need to make sure backoff state can't get stuck.

Option D — Explicit error budget + circuit breaker. Track consecutiveFailures[task] in DO storage. After N consecutive failures (e.g., 6), pause the task and emit a critical log; require explicit resume() RPC call to restart. Pros: hard guarantee against silent failure for hours. Cons: needs ops attention; adds state to maintain; mostly redundant with good observability.

Recommended starting policy: B + C.

  • B is the floor. Wrap each task, log structured failures, only throw for transport-level issues. This is the minimum to avoid the "Tenero broke the Hiro sweep" failure mode and to actually see what's happening in logs.aibtc.com.
  • C is cheap and matches what we already know about both upstreams: Hiro's stacks-api-fetch.ts already surfaces 429 and monthly-quota warnings, and Tenero's headers expose x-ratelimit-minute-remaining / x-ratelimit-month-remaining. Honoring those signals at the scheduler layer means we don't fight Hiro/Tenero's own throttling.
  • A is what you fall back to if B's transport-error path is reached.
  • D is deferred. Track consecutiveFailures in storage from day one (it's three lines), but don't wire the circuit-breaker behavior until observability shows we need it. If a task fails for hours in production we'll see it in worker-logs before users notice — that's the bar for adding D.

Tenero client (separate but lands with this)

The DO needs a typed Tenero fetch wrapper to call. Modeled on lib/stacks-api-fetch.ts (Hiro):

  • lib/external/tenero-fetch.ts — typed wrapper with retry, per-attempt timeout, cf-ray + status logged on failure, x-ratelimit-* headers parsed and surfaced
  • lib/external/tenero/prices.tsfetchTokenPriceUsd(assetId, logger) on top of the wrapper
  • KV reader for consumers: getCachedTokenPrice(kv, tokenId): { price, fetchedAt } | null
  • Optional TENERO_API_KEY env var support (header name TBD — x-api-key is a guess; confirm with Tenero docs before wiring)

LeaderboardClient.tsx then moves from browser-side Tenero fetch to a thin GET /api/prices?tokens=... route that reads KV. Browser-side localStorage cache can stay as a second-tier cache or come out entirely — that's a small follow-up either way.

Implementation outline

  1. lib/external/tenero-fetch.ts + prices.ts + KV reader — wrapper modeled on stacks-api-fetch.ts. Lands first so the DO can call into stable code.
  2. SchedulerDO class + wrangler binding — bare skeleton with status() and refreshNow() RPC, no tasks yet. Verify it deploys and the RPC stub works from a route.
  3. Tenero refresh task — wire as the first consumer, 5-min cadence. Confirm KV writes and log events show up in logs.aibtc.com.
  4. Migrate competition cron logic — move runCompetitionCron invocation from POST /api/competition/cron to the DO's sweep task. Keep the HTTP route as a thin RPC pass-through (or remove if nothing external calls it; check before deleting).
  5. LeaderboardClient → KV-backed /api/prices — drop the direct Tenero fetch from the browser.
  6. Wire pause/resume RPC + an admin route if/when we want manual ops control.

References

  • Investigation transcript context (this issue) — observability root cause + Tenero rate-limit findings
  • lib/stacks-api-fetch.ts — Hiro wrapper pattern to mirror for Tenero
  • lib/logging.tsisLogsRPC(env.LOGS) ? createLogger : createConsoleLogger pattern; the bug we're avoiding
  • app/api/competition/cron/route.ts + lib/competition/cron.ts + lib/competition/state.ts — current sweep implementation to migrate
  • Reverted commit f327554 — Tenero server-side abandonment that this work undoes properly
  • Reverted/replaced commit 92b37cd — browser-side Tenero that this work replaces

Open questions for implementer

  • Tenero API key — does the AIBTC org have one? If so, confirm the auth header name and update the wrapper. If not, file separately; unauthenticated 100/min/IP is fine for v1 but worth tracking.
  • Cron expression for the sweep task isn't strictly needed inside the DO (alarm computes cadence) but document the chosen cadence (15 min) and the rationale so future maintainers don't have to dig.
  • Decide whether to keep POST /api/competition/cron as a public-with-secret manual-trigger surface, or remove it once the RPC path lands. If anything external still pokes it, keep it as a thin pass-through.

cc @biwasxyz @secret-mars — taking it from here, this captures the full context. Happy to pair / answer questions on any piece of this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions