Skip to content

feat(dashboard): rework trading-comp dashboard to D1 balances + cron chunking (blocks #651) #764

@whoabuddy

Description

@whoabuddy

Summary

#651's dashboard rebuilder fan-outs to ~2 subrequests per agent (Hiro /balances + mempool.space) inside a single Worker invocation. At ~500 agents this hits the 1000-subrequest-per-invocation cap; at ~600 agents it hits the 30s Worker CPU limit. This is the same architectural wall Phase 1.4 hit during inbox reconciliation last week (see CHECKPOINT-2026-05-10T0203Z-inbox-subrequest-limit.md in the D1-migration quest planning folder).

Rework #651 to use a Durable Object alarm as the chunked rebuilder, writing per-agent rows to the existing D1 balances table.

Failure thresholds (current PR design)

Agents Subrequests per rebuild Worker CPU per rebuild Status
430 (current) ~860 ~21s Under both limits, barely
500 ~1000 ~25s At subrequest cap — first failures
600 ~1200 ~30s Both walls breached
1000 ~2000 ~50s Architecturally impossible in one invocation

Neither limit is raisable via wrangler config or plan upgrade.

Architecture

Scheduler: Durable Object alarm. Chosen over Cron Triggers because we need cursor state, strict-once execution per tick, and adaptive interval (back off on Hiro 429s). Established org pattern — see Reference patterns below.

  • DashboardRebuildDO (new): single global instance, holds rebuild cursor in DO storage. alarm() handler does one chunk per tick.
  • Chunking: CHUNK_SIZE = 100 agents per alarm. Per-tick subrequests: ~200 (100 Hiro + 100 mempool) + 1–2 for prices. Well under the 1000 cap with margin.
  • Pre-fetch outside the lock: per the NonceDO precedent — do all Hiro/mempool I/O before entering blockConcurrencyWhile, then apply writes inside the lock. Sequential fetches inside the lock can exceed 30s and crash the DO.
  • Adaptive interval: ALARM_INTERVAL_ACTIVE_MS (~2 min) when the agent set is healthy; ALARM_INTERVAL_IDLE_MS (~10 min) when most recent ticks saw upstream failures. Always re-schedule from the alarm handler, including the catch path.
  • Storage: writes per-agent rows to balances D1 table (migration 006_balances.sql, currently dead schema — this is its intended writer). One row per (agent, token) — schema already defined in RFC §balances.
  • Reader: app/api/dashboard/route.ts + /dashboard page read from D1, ranked + paginated server-side. Edge cache via Cache-Control: s-maxage=60, stale-while-revalidate=300 (same as feat: trading-comp dashboard with multi-token portfolio + USD totals #651) so most reads never reach D1.
  • Reusable from feat: trading-comp dashboard with multi-token portfolio + USD totals #651 unchanged: lib/balances/fetch.ts, lib/balances/prices.ts, lib/balances/types.ts, /dashboard page UI, all discovery wiring (llms.txt, openapi.json, agent.json).
  • Removed: lib/balances/snapshot.ts (KV snapshot + SWR + building sentinel — replaced by DO alarm + D1).

Reference patterns

Org has established DO patterns to copy from — this isn't a new architectural primitive for the team:

  • aibtcdev/x402-sponsor-relay/src/durable-objects/nonce-do.ts — the closest analog. Cursor-rotated wallet chunks per alarm, parallel Hiro pre-fetch outside the lock, adaptive ALARM_INTERVAL_ACTIVE_MS / ALARM_INTERVAL_IDLE_MS, self-rescheduling via state.storage.setAlarm() including the catch path. Encodes the exact "don't fan out inside blockConcurrencyWhile" lesson.
  • aibtcdev/x402-sponsor-relay/src/durable-objects/stats-do.ts — DO + DurableObjectStorage["sql"] for aggregate state with periodic refresh.
  • aibtcdev/agent-news/src/objects/news-do.tsDurableObject<Env> + SQL storage shape; confirms the pattern is established across the org.
  • wrangler.jsonc binding shape (from x402-sponsor-relay): durable_objects.bindings: [{ name: "DASHBOARD_REBUILD_DO", class_name: "DashboardRebuildDO" }] + migrations: [{ tag: "v1", new_sqlite_classes: ["DashboardRebuildDO"] }]. Add to top-level + each env block per the fix(rate-limit): env separation + DEPLOY_ENV + bucket rename + test handler exercise (#663) #666 pattern.

Acceptance

  • DO alarm processes 100-agent chunks; full agent set covered every ~10 min at current scale (~4–5 chunks)
  • All Hiro/mempool I/O happens outside blockConcurrencyWhile; DO write phase inside the lock only does SQL + alarm reschedule
  • D1 balances rows present for every agent in cache:agent-list within one full rotation
  • /api/dashboard reads from D1, returns identical JSON shape to current PR
  • Subrequest count per alarm invocation never exceeds ~250 (chunk + price fetch + headroom)
  • Alarm always reschedules (success path + catch path), with adaptive interval based on observed upstream health
  • Vitest covers: cursor advance, chunk pre-fetch outside lock, partial-failure handling, D1 read pagination, edge-cache headers preserved, alarm reschedule on success + on throw
  • Smoke window post-merge: 60 min, confirm D1 row counts grow monotonically + dashboard renders + alarm log shows rescheduling + no Worker errors

Out of scope

Context

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions