feat(scheduler): SchedulerDO + alarm for periodic landing-page jobs (Tenero prices + competition Hiro sweep)

## Background

We hit a Tenero failure while enriching `/agents` SSR with USD volume (PRs around `962cc1d` → `f327554`), gave up server-side, and moved Tenero to browser-side with localStorage caching (`92b37cd`). Investigating it now revealed:

1. **The original failure was never observed properly.** `lib/competition/volume.ts` defaulted its logger to `createConsoleLogger`, which writes to `console.warn` — not the worker-logs RPC binding. Confirmed by querying `logs.aibtc.com` for the deploy window (2026-05-11 09:08–09:19 UTC): zero `tenero_*` / `competition.volume.*` events landed, only `cache.hit`/`cache.miss`/`payment.required`. Every other API route in the repo uses the `isLogsRPC(env.LOGS) ? createLogger(env.LOGS, ctx) : createConsoleLogger(...)` switch — `volume.ts` skipped it. So failures fired into the wrong sink and the team's `wrangler tail` couldn't see the preview-URL traffic.

2. **Tenero's web-ui-ip rate limiter is a plausible cause of the silent nulls.** Live probe shows: `x-ratelimit-type: web-ui-ip`, 100/min, 50k/month, `server: cloudflare`. From a CF Worker, source IP is a shared CF-datacenter egress IP — the 100/min budget is collective, easy to hit, and without the logger correctly wired we'd never have known.

3. **The Hiro and Tenero use cases are not symmetric.** Hiro is open-ended, request-driven, needs per-call retry + observability on the request path. Tenero is a closed-set push problem (3–30 tokens, ~5-min freshness, identical for every user) — a wrong fit for the per-request wrapper pattern.

Putting periodic work on the user request path was the structural mistake. We want it precomputed by a scheduler, with prices/cursors in D1/KV.

## Decision

Build a **single Durable Object with an alarm** inside the landing-page Worker that coordinates all periodic background work. Two initial consumers:

- **Tenero price refresh** — every ~5 min, refresh USD prices for the active token set (static base + dynamic from `swaps.token_in WHERE source='agent'`), write `tenero:price:{tokenId}` to KV with `{price, fetchedAt, rateLimit}`.
- **Competition Hiro sweep** — every ~15 min, run the existing `runCompetitionCron` logic against D1's `competition_state` cursor. Replaces the current `POST /api/competition/cron` + `X-Cron-Secret` external-scheduler dance.

## Why DO+alarm (not Cron Triggers, not a sibling Worker, not the current pattern)

We considered four options. The constraints that actually drove the decision:

| Constraint | What it rules out |
|---|---|
| Hiro budget must stay pooled (one key, one backoff state in `stacks-api-fetch.ts`) | Sibling scheduler Worker (fragments the budget) |
| Competition is scoped to landing-page; don't fragment the product surface | Sibling scheduler Worker (two repos / deploys / log namespaces) |
| OpenNext + `scheduled()` compatibility unverified, potentially fiddly | Cron Triggers in the OpenNext Worker (we'd be debugging the build) |
| Phase 3 plan adds chainhook push as a third write path into `swaps` — single-writer guarantee on the sweep eliminates one race surface (idempotency at the data layer still handles user-vs-chainhook) | The current "POST endpoint poked externally" pattern (no single-writer guarantee) |
| Adaptive cadence on rate-limit signals (Hiro 429, Tenero quota low → slow down) is genuinely useful | Cron Triggers (declarative schedule, no easy way to back off) |
| Want a clean RPC surface for status / manual trigger / pause from elsewhere in the codebase | The current secret-authed HTTPS endpoint dance |

The bullets that made it stop being "DO is overkill":
- Single-writer guarantee becomes load-bearing once chainhook lands
- OpenNext + DO is orthogonal to the build (DOs are wrangler-config, not entry-point code) — sidesteps the cron-trigger compatibility unknown
- RPC surface replaces both the GET self-doc and POST manual-invoke routes cleanly

## Naming + scope

- **DO class:** `SchedulerDO`
- **Module path:** `lib/scheduler/scheduler-do.ts`
- **Wrangler binding name:** `SCHEDULER`
- **Singleton instance name:** `"v1"` — `env.SCHEDULER.idFromName("v1")`. Versioned so a future migration can move to `"v2"` without disturbing the v1 instance.

One DO instance for the whole subsystem to start. If a task ever needs genuinely independent state or its alarm cadence diverges hard from the others, split then — easier than merging back.

## DO surface

**Storage:**
- `lastTeneroRunAt: number`
- `lastSweepRunAt: number`
- `lastTeneroResult: { succeeded: number, failed: number, rateLimitRemaining?: number }`
- `lastSweepResult: { scanned, found, inserted, alreadyKnown, pending, rejected, cursor }` — mirrors today's cron summary shape
- `consecutiveFailures: { tenero: number, sweep: number }` (see alarm-failure section)
- Long-lived cursors stay in **D1** (`competition_state`), not DO storage — keep the data-layer authority where it already is.

**RPC methods:**
- `status()` → snapshot of the above
- `refreshNow(task: "tenero" | "sweep" | "all")` → fire the named task now, return result
- `pauseUntil(timestamp)` → ops kill switch
- `resume()` → undo pause

**`alarm()` handler:**
- Determine which tasks are due (5-min cadence for Tenero, 15-min for sweep — use modulo on alarm tick)
- Run each due task in its own `try/catch`
- Log structured events through `env.LOGS` (DO has its own ExecutionContext)
- Persist run results to storage
- Re-arm next alarm

## Alarm-failure policy — options + recommendation

CF runtime behavior baseline: when `alarm()` throws, the DO runtime auto-retries the alarm up to ~6 times with exponential backoff before dropping it. That retry budget applies to the *whole* `alarm()` invocation, not per-task. Whether you want to lean on this depends on which failure modes you treat as transient.

**Option A — Bare runtime retry.** Let `alarm()` throw on any failure. CF retries up to 6× automatically; if all fail the alarm is dropped and the next scheduled tick recovers. Pros: zero code. Cons: one failing task aborts the others (Tenero 429 also kills the Hiro sweep that tick); no structured observability unless you `try/catch + log + rethrow`; logs from retried attempts pile up.

**Option B — Per-task isolation + structured logging, transport-only throws.** Each task runs inside its own `try/catch` inside `alarm()`. Task-level failures (Tenero 429, Hiro 5xx, parse error) are logged via `env.LOGS` and the task returns normally — the next scheduled tick handles recovery. The `alarm()` body only throws on transport-level failures (DO storage write fails, env binding missing, etc.) where runtime retry is appropriate. Pros: one task's failure doesn't take down the others; observable; predictable cadence. Cons: requires the discipline to classify errors correctly.

**Option C — Adaptive backoff on known rate-limit signals.** Layer on top of B. When a task sees `Hiro 429` or `Tenero x-ratelimit-minute-remaining: 0`, write a `nextRunAfter` timestamp to DO storage; subsequent `alarm()` ticks skip that task until the timestamp passes. Independent from runtime retry — this is "successful execution observed a signal that says slow down." Pros: respects upstream; reduces our chance of getting fully blocked. Cons: a little more code; need to make sure backoff state can't get stuck.

**Option D — Explicit error budget + circuit breaker.** Track `consecutiveFailures[task]` in DO storage. After N consecutive failures (e.g., 6), pause the task and emit a critical log; require explicit `resume()` RPC call to restart. Pros: hard guarantee against silent failure for hours. Cons: needs ops attention; adds state to maintain; mostly redundant with good observability.

**Recommended starting policy: B + C.**

- **B** is the floor. Wrap each task, log structured failures, only throw for transport-level issues. This is the minimum to avoid the "Tenero broke the Hiro sweep" failure mode and to actually see what's happening in `logs.aibtc.com`.
- **C** is cheap and matches what we already know about both upstreams: Hiro's `stacks-api-fetch.ts` already surfaces 429 and monthly-quota warnings, and Tenero's headers expose `x-ratelimit-minute-remaining` / `x-ratelimit-month-remaining`. Honoring those signals at the scheduler layer means we don't fight Hiro/Tenero's own throttling.
- **A** is what you fall back to if B's transport-error path is reached.
- **D** is deferred. Track `consecutiveFailures` in storage from day one (it's three lines), but don't wire the circuit-breaker behavior until observability shows we need it. If a task fails for hours in production we'll see it in worker-logs before users notice — that's the bar for adding D.

## Tenero client (separate but lands with this)

The DO needs a typed Tenero fetch wrapper to call. Modeled on `lib/stacks-api-fetch.ts` (Hiro):

- `lib/external/tenero-fetch.ts` — typed wrapper with retry, per-attempt timeout, `cf-ray` + status logged on failure, `x-ratelimit-*` headers parsed and surfaced
- `lib/external/tenero/prices.ts` — `fetchTokenPriceUsd(assetId, logger)` on top of the wrapper
- KV reader for consumers: `getCachedTokenPrice(kv, tokenId): { price, fetchedAt } | null`
- Optional `TENERO_API_KEY` env var support (header name TBD — `x-api-key` is a guess; confirm with Tenero docs before wiring)

`LeaderboardClient.tsx` then moves from browser-side Tenero fetch to a thin `GET /api/prices?tokens=...` route that reads KV. Browser-side localStorage cache can stay as a second-tier cache or come out entirely — that's a small follow-up either way.

## Implementation outline

1. **`lib/external/tenero-fetch.ts` + `prices.ts` + KV reader** — wrapper modeled on `stacks-api-fetch.ts`. Lands first so the DO can call into stable code.
2. **`SchedulerDO` class + wrangler binding** — bare skeleton with `status()` and `refreshNow()` RPC, no tasks yet. Verify it deploys and the RPC stub works from a route.
3. **Tenero refresh task** — wire as the first consumer, 5-min cadence. Confirm KV writes and log events show up in `logs.aibtc.com`.
4. **Migrate competition cron logic** — move `runCompetitionCron` invocation from `POST /api/competition/cron` to the DO's sweep task. Keep the HTTP route as a thin RPC pass-through (or remove if nothing external calls it; check before deleting).
5. **`LeaderboardClient` → KV-backed `/api/prices`** — drop the direct Tenero fetch from the browser.
6. **Wire pause/resume RPC + an admin route** if/when we want manual ops control.

## References

- Investigation transcript context (this issue) — observability root cause + Tenero rate-limit findings
- `lib/stacks-api-fetch.ts` — Hiro wrapper pattern to mirror for Tenero
- `lib/logging.ts` — `isLogsRPC(env.LOGS) ? createLogger : createConsoleLogger` pattern; the bug we're avoiding
- `app/api/competition/cron/route.ts` + `lib/competition/cron.ts` + `lib/competition/state.ts` — current sweep implementation to migrate
- Reverted commit `f327554` — Tenero server-side abandonment that this work undoes properly
- Reverted/replaced commit `92b37cd` — browser-side Tenero that this work replaces

## Open questions for implementer

- Tenero API key — does the AIBTC org have one? If so, confirm the auth header name and update the wrapper. If not, file separately; unauthenticated 100/min/IP is fine for v1 but worth tracking.
- Cron expression for the sweep task isn't strictly needed inside the DO (alarm computes cadence) but document the chosen cadence (15 min) and the rationale so future maintainers don't have to dig.
- Decide whether to keep `POST /api/competition/cron` as a public-with-secret manual-trigger surface, or remove it once the RPC path lands. If anything external still pokes it, keep it as a thin pass-through.

cc @biwasxyz @secret-mars — taking it from here, this captures the full context. Happy to pair / answer questions on any piece of this.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scheduler): SchedulerDO + alarm for periodic landing-page jobs (Tenero prices + competition Hiro sweep) #768

Background

Decision

Why DO+alarm (not Cron Triggers, not a sibling Worker, not the current pattern)

Naming + scope

DO surface

Alarm-failure policy — options + recommendation

Tenero client (separate but lands with this)

Implementation outline

References

Open questions for implementer

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Constraint	What it rules out
Hiro budget must stay pooled (one key, one backoff state in `stacks-api-fetch.ts`)	Sibling scheduler Worker (fragments the budget)
Competition is scoped to landing-page; don't fragment the product surface	Sibling scheduler Worker (two repos / deploys / log namespaces)
OpenNext + `scheduled()` compatibility unverified, potentially fiddly	Cron Triggers in the OpenNext Worker (we'd be debugging the build)
Phase 3 plan adds chainhook push as a third write path into `swaps` — single-writer guarantee on the sweep eliminates one race surface (idempotency at the data layer still handles user-vs-chainhook)	The current "POST endpoint poked externally" pattern (no single-writer guarantee)
Adaptive cadence on rate-limit signals (Hiro 429, Tenero quota low → slow down) is genuinely useful	Cron Triggers (declarative schedule, no easy way to back off)
Want a clean RPC surface for status / manual trigger / pause from elsewhere in the codebase	The current secret-authed HTTPS endpoint dance

feat(scheduler): SchedulerDO + alarm for periodic landing-page jobs (Tenero prices + competition Hiro sweep) #768

Description

Background

Decision

Why DO+alarm (not Cron Triggers, not a sibling Worker, not the current pattern)

Naming + scope

DO surface

Alarm-failure policy — options + recommendation

Tenero client (separate but lands with this)

Implementation outline

References

Open questions for implementer

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions