Skip to content

fix(reviewer): verdict severity tier + diagnostic write + UI banner + SSR lineage#96

Merged
chorus-codes merged 1 commit into
mainfrom
fix/verdict-severity-and-diagnostics
May 28, 2026
Merged

fix(reviewer): verdict severity tier + diagnostic write + UI banner + SSR lineage#96
chorus-codes merged 1 commit into
mainfrom
fix/verdict-severity-and-diagnostics

Conversation

@chorus-codes
Copy link
Copy Markdown
Owner

Summary

Five connected defects in the chorus reviewer chain, surfaced by live bowerbird chat 019E6E3318873C26DCA60409B84F90E9:

  • Gemini-3.1-pro-preview wrote 29 lines of ### CRITICAL / ### HIGH findings, billed $0.05, was falsely marked verdict_ambiguous → fallback chain fired → collided with kimi's prior claim on claude-sonnet-4-6 → chain exhausted. UI rendered DONE + gemini's real content + three contradictory amber banners (LINEAGE_FALLBACK + FALLBACK_COLLISION + "actually ran" badge on a voice that didn't run).
  • Kimi-k2.6 same chat: 0-byte answer.md, empty _attempts.jsonl, daemon log shows only fallback fired — no record of why kimi failed.
  • Phantom Reviewer · CLAUDE card next to the real ANTIGRAVITY · gemini-3.5-flash placeholder on initial page load (appears on refresh, vanishes after seconds when client-side /api/run-artifacts poll lands).

Fixes

# Fix Files
1 Severity-style reviews (### CRITICAL / **HIGH** with >=200-char body) → implicit request_changes. Trailing lookahead requires a heading terminator (:, \n, **, EOL) so ### High-level review etc. don't false-positive src/daemon/runner/verdict.ts
2 Diagnostic write helper called from every null-return path (errored, new empty_no_error, verdict_ambiguous) and thrown-exception catch. Closes the silent-failure gap — zero rows in _attempts.jsonl is now a true bug signal reviewer.ts, reviewer-driver.ts
3 API derives per-swap actuallyRan: boolean from the slot's _stats.json lineage+model stamp. UI suppresses banner block when primary produced the answer and no swap actually ran. Per-entry actuallyRan (not isLast) drives the strikethrough + badge route.ts, participant-card.tsx, types.ts
4 SSR readChatRounds had its own inline AGENT_TO_LINEAGE missing antigravity-cli / grok-cli, defaulting unknown agents to literal "claude". Now uses the shared AGENT_TO_UI_LINEAGE map and passes raw agent name through on miss src/app/runs/[runId]/page.tsx

Chorus self-review

Fired on this exact diff (chat 019E6E7A5D1DF943AD275D5595460D9A). 5/7 reviewers approve. Codex and gemini convergently flagged the severity-regex hyphen-suffix issue (### High-level review would match) — applied trailing-lookahead fix and added 4 false-positive test cases.

Test plan

  • pnpm typecheck clean
  • pnpm test — 993 pass / 2 skipped (was 989; +4 hyphen-suffix false-positive cases including verbatim replay of the real bowerbird gemini review)
  • pnpm build && pnpm build:server clean
  • Daemon restarted on the new build; cockpit + daemon both HTTP 200
  • Existing bowerbird chat API now returns actuallyRan: false on the gemini swap → UI suppresses banner on reload
  • New diagnostic write verified live: [reviewer] attempt failed kind=verdict_ambiguous for qwen3.7-max in the chorus self-review chat
  • SSR antigravity participant now serialised with lineage: "antigravity" instead of falling through to "claude"
  • Verify in browser: hard-reload /runs/<chat> and confirm no phantom CLAUDE card on initial load
  • Verify in browser: gemini severity-style review classifies as request_changes end-to-end on a fresh chat

… SSR lineage

Five connected defects surfaced by live bowerbird chat
019E6E3318873C26DCA60409B84F90E9 (gemini wrote 29 lines of severity findings,
billed $0.05, was falsely marked verdict_ambiguous → fallback chain fired →
collision → contradictory UI banners; kimi failed silently with no
diagnostic trace; SSR phantom CLAUDE card for antigravity slots).

1. verdict.ts — recognise `### CRITICAL` / `**HIGH**` headers (with >=200-
   char body) as implicit request_changes. Trailing lookahead requires a
   heading terminator (`:`, `\n`, `**`, EOL) so prose like
   `### High-level review` or `**High-traffic endpoint**` doesn't false-
   positive. Codex + gemini both caught the hyphen-suffix issue in the
   chorus self-review 019E6E7A5D1DF943AD275D5595460D9A.

2. reviewer.ts + reviewer-driver.ts — factored `writeAttemptRow()` helper.
   Every null return (errored, new `empty_no_error`, `verdict_ambiguous`)
   AND every thrown exception now writes a row to `_attempts.jsonl` plus a
   `[reviewer]` daemon log line. Zero rows in a slot's _attempts.jsonl is
   now a true bug signal, not a diagnostic gap. Also stamps
   lineage+model into `_stats.json` on every successful completion.

3. run-artifacts route + participant-card — API derives per-swap
   `actuallyRan: boolean` by matching the swap's `to` against the slot's
   final _stats.json lineage stamp. UI suppresses the swap banner block
   entirely when primary produced the displayed answer AND no swap
   actually ran. Per-entry `actuallyRan` (not `isLast`) drives the
   strikethrough + "actually ran" badge.

4. runs/[runId]/page.tsx — SSR readChatRounds had its own inline
   AGENT_TO_LINEAGE missing `antigravity-cli` / `grok-cli` and defaulting
   unknown agents to literal "claude". Initial server-rendered HTML
   classified antigravity participants as claude → phantom CLAUDE card
   alongside the synthesised ANTIGRAVITY placeholder, vanishing on the
   first client-side /api/run-artifacts poll. Now uses the shared
   AGENT_TO_UI_LINEAGE map and passes raw agent name through on miss.

Tests: 993 pass (was 989; added 4 hyphen-suffix false-positive cases to
verdict tests, including a verbatim replay of the real bowerbird gemini
review classifying as request_changes).
@chorus-codes chorus-codes merged commit 8fbe6c3 into main May 28, 2026
2 checks passed
@chorus-codes chorus-codes deleted the fix/verdict-severity-and-diagnostics branch May 28, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant