Skip to content

fix(newsdo): self-heal cost-critical signals indexes + schema-health diagnostic#827

Merged
whoabuddy merged 2 commits into
mainfrom
fix/newsdo-self-healing-indexes
May 27, 2026
Merged

fix(newsdo): self-heal cost-critical signals indexes + schema-health diagnostic#827
whoabuddy merged 2 commits into
mainfrom
fix/newsdo-self-healing-indexes

Conversation

@whoabuddy
Copy link
Copy Markdown
Contributor

Why

NewsDO is now the account's #1 Cloudflare rows-read surface (~1.8B rows/day) after the landing-page D1 fix. This applies the lesson from that fix directly.

NewsDO's hot-path signals composite indexes — the ones serving the leaderboard / correspondents-bundle / report queries — lived only in version-gated cold-start migrations (#16 approval-cap, #21 leaderboard, #27 hot-path, #28 correspondents-bundle). The cold-start runner advances migration_version even when a statement throws (errors are caught + logged), and the code comments already record this happening: "migration 10 failed silently on production", "v12/v13 may have failed silently on staging".

A silently-failed index migration is never retried (the version counter has moved past it), so the index can be permanently missing in production while the schema looks complete. That's the exact class of bug that dropped landing-page's inbox_messages indexes and turned every inbox read into a full-table scan (~5.9B → ~14M rows/day once restored, −99.8%).

What

  1. Self-heal the indexes. Move the at-risk signals composite indexes into the always-re-applied base SCHEMA_SQL (idempotent CREATE INDEX IF NOT EXISTS). They now re-assert on every cold start and can't be silently version-skipped. All referenced columns (status, reviewed_at, correction_of, btc_address, beat_slug, created_at, quality_score) exist in the base signals table, so this is safe on the live DB and on fresh ones. (Migration Add tsconfig.json for JS type checking #27's indexes were already duplicated into the base schema by an earlier fix — this completes that pattern for feat(bounties): add bounty board API endpoints (v2 architecture) #16/[prod-grade] Missing: CI workflows for automated testing #21/[prod-grade] Missing: test suite #28.)

  2. GET /api/config/schema-health (public, read-only). Diffs the live sqlite_master against EXPECTED_SIGNALS_INDEXES and returns { healthy, missing_signals_indexes, signals_row_count, live_index_count, live_indexes } (503 when unhealthy). DO-embedded SQLite has no external wrangler d1 insights, and DO console.* does not reach worker-logs (that's why the migration failures were silent) — so an on-demand endpoint is the reliable drift detector. This is the reusable guardrail pattern.

Diagnosis confirmation + verification plan

  • After deploy: curl https://aibtc.news/api/config/schema-health (or the worker URL). If healthy:false with names in missing_signals_indexes, that's the live confirmation of which indexes had silently dropped — they self-heal on the same cold start, so a second call returns healthy:true.
  • Cost: watch NewsDO rowsRead via Cloudflare GraphQL / the billing dashboard over the next 24h. Expect a drop from ~1.8B/day toward the free-tier floor if a hot-path index was missing. If it does not drop, the indexes were present and the cost is a genuine unmaterialized aggregate (the old "B5" leaderboard-30d materialization) — and the endpoint will have proven that, so we pivot with data instead of guessing.

Testing

  • npm run typecheck clean; npm run test → 40 files, 418 tests pass.
  • Additive only: no existing query, route, or migration changed.

Follow-ups (separate)

  • If schema-health shows other tables' indexes also dropped (claims/earnings/etc. have the same version-gated exposure), extend the self-heal + expected-set to them.
  • Cross-repo: same schema-health-style drift guardrail belongs in landing-page too; both will be written up in the whoabuddy/claude-knowledge Cloudflare KV/D1 runbook.

cc @arc0btc — sibling-repo review (not fast-merging; this is agent-news).

🤖 Generated with Claude Code

…diagnostic

NewsDO's hot-path `signals` composite indexes (leaderboard / correspondents-
bundle / report) lived ONLY in version-gated cold-start migrations (#16, #21,
#27, #28). The cold-start runner advances `migration_version` even when a
statement throws (errors are caught and logged), and the code comments record
this has happened before ("migration 10 failed silently on production"). A
silently-failed index migration is never retried, so the index can go
permanently missing in production while the schema looks complete — the same
class of bug that dropped landing-page's inbox_messages indexes and drove a
multi-billion-row/day full-scan D1 bill.

NewsDO is currently the account's top rows-read surface (~1.8B/day). This moves
the at-risk signals indexes into the always-re-applied base SCHEMA_SQL so they
self-heal on every cold start (idempotent CREATE INDEX IF NOT EXISTS; all
referenced columns exist in the base signals table).

Adds GET /api/config/schema-health (public, read-only): diffs the live
sqlite_master against EXPECTED_SIGNALS_INDEXES and reports any missing index +
signals row count. DO-embedded SQLite has no external `wrangler d1 insights`
equivalent, and DO console output does not reach worker-logs (that's *why* the
failures were silent), so an on-demand endpoint is the reliable drift detector.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 16:58
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 27, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
✅ Deployment successful!
View logs
agent-news 70689a4 May 27 2026, 06:52 PM

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

Preview deployed: https://agent-news-staging.hosting-962.workers.dev

This preview uses sample data — beats, signals, and streaks are seeded automatically.

Copy link
Copy Markdown
Contributor

@arc0btc arc0btc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applies the landing-page inbox-index fix pattern to NewsDO's signals table — direct, surgical, and exactly the right call given we know this bug class is real.

What works well:

  • Moving the at-risk indexes into SCHEMA_SQL with CREATE INDEX IF NOT EXISTS is the correct fix. DO cold-starts always re-run the base schema, so these indexes now self-heal regardless of migration history. Idempotent and safe on the live DB.
  • EXPECTED_SIGNALS_INDEXES as an exported typed array is the right abstraction — trying to derive the set by parsing SQL strings would be fragile. The JSDoc comment linking it to the health endpoint is clear.
  • /api/config/schema-health returning 503 on unhealthy is good protocol — monitoring tools and agents can detect drift without parsing the body.
  • Additive-only: no existing query, route, or migration changed. The CI/staging deploy confirms a clean build.
  • The comment block in schema.ts explaining why the indexes are being duplicated into the base schema is the right kind of comment — it captures non-obvious intent that would otherwise require reading 3 migrations to reconstruct.

[question] COUNT(*) on signals in a hot diagnostic path (news-do.ts:3778)
SELECT COUNT(*) as count FROM signals does a full-table scan (SQLite's COUNT(*) without a WHERE doesn't use the row count from sqlite_stat unless ANALYZE has been run). Since this is an on-demand diagnostic endpoint it's probably fine — but if you ever add a monitoring loop that calls this repeatedly, the row count will become the expensive part. Consider making it opt-in via a query param (?include_count=true) if the endpoint evolves toward periodic polling.

[nit] (countRows[0] as { count: number }).count works fine given the DO SQLite API, but a typed intermediate (const row = countRows[0] as { count: number }) reads slightly cleaner. Not worth a change request.

Code quality notes:
The EXPECTED_SIGNALS_INDEXES pattern is reusable — the PR description notes landing-page needs the same guardrail. Worth a follow-up issue to extend this to claims/earnings tables once the diagnosis confirms whether their version-gated indexes are also at risk. A brief // TODO: extend to claims/earnings when confirmed missing in the constant would make that intent discoverable without a separate doc.

Operational context:
We've been watching NewsDO as the top rows-read surface since the landing-page D1 fix landed (~1.8B rows/day). The silent-migration failure mode is confirmed real (migration 10, v12/v13 per the code comments). The schema-health endpoint fills the exact gap that made the landing-page incident hard to detect — DO's console.* not reaching worker-logs means this kind of drift was invisible until cost reports showed anomalies. Good fix, good guardrail.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e00b9d20e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/objects/schema.ts Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds self-healing and diagnostic coverage for cost-critical signals indexes in NewsDO SQLite, aiming to prevent silently skipped migrations from causing expensive full-table scans.

Changes:

  • Reasserts additional signals indexes in the always-applied base schema.
  • Adds EXPECTED_SIGNALS_INDEXES and a DO /schema-health diagnostic route.
  • Exposes the diagnostic publicly at GET /api/config/schema-health.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/objects/schema.ts Adds hot-path index assertions and the expected signals index list.
src/objects/news-do.ts Adds schema-health introspection over live SQLite indexes.
src/lib/do-client.ts Adds the schema-health response type and DO client helper.
src/routes/config.ts Exposes schema-health through the public config router.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/objects/schema.ts Outdated
Comment thread src/objects/news-do.ts Outdated
Comment thread src/routes/config.ts
…ount, tests

- P1 (Codex/Copilot): remove idx_signals_quality_score from base SCHEMA_SQL.
  SCHEMA_SQL runs before versioned migrations; quality_score is added by
  migration #24's ALTER. On a DO where that ALTER silently failed, a base-schema
  index on the column would throw no-such-column and brick DO construction. The
  index stays in migration #24 where the column is guaranteed to exist; removed
  from EXPECTED_SIGNALS_INDEXES. The other 4 indexes reference columns already
  used by pre-existing base indexes, so they provably exist live.
- COUNT(*) (Copilot/arc0btc): signals_row_count is now opt-in via
  ?include_count=true. The default public/unthrottled health check does no
  full-table scan; signals_row_count is null unless requested.
- Tests (Copilot): add /api/config/schema-health coverage asserting healthy +
  every EXPECTED_SIGNALS_INDEXES present (guards against EXPECTED drift) and the
  default-vs-include_count behavior.
- arc0btc: typed intermediate for the count row; TODO to extend self-heal +
  EXPECTED set to claims/earnings once /schema-health confirms their risk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@whoabuddy
Copy link
Copy Markdown
Contributor Author

Thanks @arc0btc — all three points addressed in 70689a4:

  • [question] COUNT(*) on the hot diagnostic path — agreed, and since the route is public + unthrottled I made the count opt-in rather than just guarding for a future polling loop. signals_row_count is now null by default (no scan); callers pass ?include_count=true to run the COUNT(*). So routine/probing health checks add zero rows-read. (This also resolved Copilot's identical inline note.)
  • [nit] typed intermediate — done; the count row is now const countRow = …toArray()[0] as { count: number } before reading .count.
  • [code quality] extend to claims/earnings — added a // TODO: extend this self-heal + EXPECTED set to claims/earnings indexes once /schema-health confirms their version-gated indexes are also at risk right above EXPECTED_SIGNALS_INDEXES, so the intent is discoverable. I'll file the follow-up issue once /schema-health is live and we can confirm from production whether their indexes actually dropped (rather than speculatively widening scope now).

Separately, Codex/Copilot flagged a real P1 I'd missed: idx_signals_quality_score can't live in base SCHEMA_SQL because that runs before migration #24's quality_score ALTER — on a DO where that ALTER silently failed it would throw no such column and brick construction. Removed it from base + the expected set (it stays in migration #24); the other four indexes only reference columns already used by pre-existing base indexes, so they're safe pre-migration. Added /api/config/schema-health test coverage too. Full suite green (420 tests).

@whoabuddy whoabuddy merged commit c3fc995 into main May 27, 2026
7 checks passed
@whoabuddy whoabuddy deleted the fix/newsdo-self-healing-indexes branch May 27, 2026 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants