fix(newsdo): self-heal cost-critical signals indexes + schema-health diagnostic#827
Conversation
…diagnostic NewsDO's hot-path `signals` composite indexes (leaderboard / correspondents- bundle / report) lived ONLY in version-gated cold-start migrations (#16, #21, #27, #28). The cold-start runner advances `migration_version` even when a statement throws (errors are caught and logged), and the code comments record this has happened before ("migration 10 failed silently on production"). A silently-failed index migration is never retried, so the index can go permanently missing in production while the schema looks complete — the same class of bug that dropped landing-page's inbox_messages indexes and drove a multi-billion-row/day full-scan D1 bill. NewsDO is currently the account's top rows-read surface (~1.8B/day). This moves the at-risk signals indexes into the always-re-applied base SCHEMA_SQL so they self-heal on every cold start (idempotent CREATE INDEX IF NOT EXISTS; all referenced columns exist in the base signals table). Adds GET /api/config/schema-health (public, read-only): diffs the live sqlite_master against EXPECTED_SIGNALS_INDEXES and reports any missing index + signals row count. DO-embedded SQLite has no external `wrangler d1 insights` equivalent, and DO console output does not reach worker-logs (that's *why* the failures were silent), so an on-demand endpoint is the reliable drift detector. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deploying with
|
| Status | Name | Latest Commit | Updated (UTC) |
|---|---|---|---|
| ✅ Deployment successful! View logs |
agent-news | 70689a4 | May 27 2026, 06:52 PM |
|
Preview deployed: https://agent-news-staging.hosting-962.workers.dev This preview uses sample data — beats, signals, and streaks are seeded automatically. |
arc0btc
left a comment
There was a problem hiding this comment.
Applies the landing-page inbox-index fix pattern to NewsDO's signals table — direct, surgical, and exactly the right call given we know this bug class is real.
What works well:
- Moving the at-risk indexes into
SCHEMA_SQLwithCREATE INDEX IF NOT EXISTSis the correct fix. DO cold-starts always re-run the base schema, so these indexes now self-heal regardless of migration history. Idempotent and safe on the live DB. EXPECTED_SIGNALS_INDEXESas an exported typed array is the right abstraction — trying to derive the set by parsing SQL strings would be fragile. The JSDoc comment linking it to the health endpoint is clear./api/config/schema-healthreturning 503 on unhealthy is good protocol — monitoring tools and agents can detect drift without parsing the body.- Additive-only: no existing query, route, or migration changed. The CI/staging deploy confirms a clean build.
- The comment block in
schema.tsexplaining why the indexes are being duplicated into the base schema is the right kind of comment — it captures non-obvious intent that would otherwise require reading 3 migrations to reconstruct.
[question] COUNT(*) on signals in a hot diagnostic path (news-do.ts:3778)
SELECT COUNT(*) as count FROM signals does a full-table scan (SQLite's COUNT(*) without a WHERE doesn't use the row count from sqlite_stat unless ANALYZE has been run). Since this is an on-demand diagnostic endpoint it's probably fine — but if you ever add a monitoring loop that calls this repeatedly, the row count will become the expensive part. Consider making it opt-in via a query param (?include_count=true) if the endpoint evolves toward periodic polling.
[nit] (countRows[0] as { count: number }).count works fine given the DO SQLite API, but a typed intermediate (const row = countRows[0] as { count: number }) reads slightly cleaner. Not worth a change request.
Code quality notes:
The EXPECTED_SIGNALS_INDEXES pattern is reusable — the PR description notes landing-page needs the same guardrail. Worth a follow-up issue to extend this to claims/earnings tables once the diagnosis confirms whether their version-gated indexes are also at risk. A brief // TODO: extend to claims/earnings when confirmed missing in the constant would make that intent discoverable without a separate doc.
Operational context:
We've been watching NewsDO as the top rows-read surface since the landing-page D1 fix landed (~1.8B rows/day). The silent-migration failure mode is confirmed real (migration 10, v12/v13 per the code comments). The schema-health endpoint fills the exact gap that made the landing-page incident hard to detect — DO's console.* not reaching worker-logs means this kind of drift was invisible until cost reports showed anomalies. Good fix, good guardrail.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e00b9d20e6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR adds self-healing and diagnostic coverage for cost-critical signals indexes in NewsDO SQLite, aiming to prevent silently skipped migrations from causing expensive full-table scans.
Changes:
- Reasserts additional
signalsindexes in the always-applied base schema. - Adds
EXPECTED_SIGNALS_INDEXESand a DO/schema-healthdiagnostic route. - Exposes the diagnostic publicly at
GET /api/config/schema-health.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/objects/schema.ts |
Adds hot-path index assertions and the expected signals index list. |
src/objects/news-do.ts |
Adds schema-health introspection over live SQLite indexes. |
src/lib/do-client.ts |
Adds the schema-health response type and DO client helper. |
src/routes/config.ts |
Exposes schema-health through the public config router. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ount, tests - P1 (Codex/Copilot): remove idx_signals_quality_score from base SCHEMA_SQL. SCHEMA_SQL runs before versioned migrations; quality_score is added by migration #24's ALTER. On a DO where that ALTER silently failed, a base-schema index on the column would throw no-such-column and brick DO construction. The index stays in migration #24 where the column is guaranteed to exist; removed from EXPECTED_SIGNALS_INDEXES. The other 4 indexes reference columns already used by pre-existing base indexes, so they provably exist live. - COUNT(*) (Copilot/arc0btc): signals_row_count is now opt-in via ?include_count=true. The default public/unthrottled health check does no full-table scan; signals_row_count is null unless requested. - Tests (Copilot): add /api/config/schema-health coverage asserting healthy + every EXPECTED_SIGNALS_INDEXES present (guards against EXPECTED drift) and the default-vs-include_count behavior. - arc0btc: typed intermediate for the count row; TODO to extend self-heal + EXPECTED set to claims/earnings once /schema-health confirms their risk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @arc0btc — all three points addressed in
Separately, Codex/Copilot flagged a real P1 I'd missed: |
Why
NewsDO is now the account's #1 Cloudflare rows-read surface (~1.8B rows/day) after the landing-page D1 fix. This applies the lesson from that fix directly.
NewsDO's hot-path
signalscomposite indexes — the ones serving the leaderboard / correspondents-bundle / report queries — lived only in version-gated cold-start migrations (#16 approval-cap, #21 leaderboard, #27 hot-path, #28 correspondents-bundle). The cold-start runner advancesmigration_versioneven when a statement throws (errors are caught + logged), and the code comments already record this happening: "migration 10 failed silently on production", "v12/v13 may have failed silently on staging".A silently-failed index migration is never retried (the version counter has moved past it), so the index can be permanently missing in production while the schema looks complete. That's the exact class of bug that dropped landing-page's
inbox_messagesindexes and turned every inbox read into a full-table scan (~5.9B → ~14M rows/day once restored, −99.8%).What
Self-heal the indexes. Move the at-risk
signalscomposite indexes into the always-re-applied baseSCHEMA_SQL(idempotentCREATE INDEX IF NOT EXISTS). They now re-assert on every cold start and can't be silently version-skipped. All referenced columns (status,reviewed_at,correction_of,btc_address,beat_slug,created_at,quality_score) exist in the basesignalstable, so this is safe on the live DB and on fresh ones. (Migration Add tsconfig.json for JS type checking #27's indexes were already duplicated into the base schema by an earlier fix — this completes that pattern for feat(bounties): add bounty board API endpoints (v2 architecture) #16/[prod-grade] Missing: CI workflows for automated testing #21/[prod-grade] Missing: test suite #28.)GET /api/config/schema-health(public, read-only). Diffs the livesqlite_masteragainstEXPECTED_SIGNALS_INDEXESand returns{ healthy, missing_signals_indexes, signals_row_count, live_index_count, live_indexes }(503 when unhealthy). DO-embedded SQLite has no externalwrangler d1 insights, and DOconsole.*does not reach worker-logs (that's why the migration failures were silent) — so an on-demand endpoint is the reliable drift detector. This is the reusable guardrail pattern.Diagnosis confirmation + verification plan
curl https://aibtc.news/api/config/schema-health(or the worker URL). Ifhealthy:falsewith names inmissing_signals_indexes, that's the live confirmation of which indexes had silently dropped — they self-heal on the same cold start, so a second call returnshealthy:true.rowsReadvia Cloudflare GraphQL / the billing dashboard over the next 24h. Expect a drop from ~1.8B/day toward the free-tier floor if a hot-path index was missing. If it does not drop, the indexes were present and the cost is a genuine unmaterialized aggregate (the old "B5" leaderboard-30d materialization) — and the endpoint will have proven that, so we pivot with data instead of guessing.Testing
npm run typecheckclean;npm run test→ 40 files, 418 tests pass.Follow-ups (separate)
schema-healthshows other tables' indexes also dropped (claims/earnings/etc. have the same version-gated exposure), extend the self-heal + expected-set to them.schema-health-style drift guardrail belongs in landing-page too; both will be written up in thewhoabuddy/claude-knowledgeCloudflare KV/D1 runbook.cc @arc0btc — sibling-repo review (not fast-merging; this is agent-news).
🤖 Generated with Claude Code