Skip to content

Diagnose: durable history, radar diagnose CLI, instant context#1081

Open
nadaverell wants to merge 11 commits into
feature/radar-opensre-diagnosefrom
feature/radar-ai-run-persistence
Open

Diagnose: durable history, radar diagnose CLI, instant context#1081
nadaverell wants to merge 11 commits into
feature/radar-opensre-diagnosefrom
feature/radar-ai-run-persistence

Conversation

@nadaverell

@nadaverell nadaverell commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Stacked on #947 (Diagnose with AI) — merge that first, then retarget/rebase this to main. Together they ship as one feature.

Three pieces:

1. Runs persist to SQLite (~/.radar/ai-runs.db)

Runs were RAM-only: a restart lost every transcript, verdict, and the sessionId behind "continue in your agent" — even though the agent CLIs' own sessions survive on disk. Default-on, dir 0700 / file 0600, --ai-history / aiHistory config to disable.

  • Two tables (runs + run_events (run_id, seq)); async single-writer enqueued under the run mutex; terminal events commit with their status in one transaction so crash recovery trusts the status column; loads drain the queue (read-your-writes).
  • Run rows hydrate eagerly (seeds nextID; interrupted running rows — initial turn or follow-up — flip to error with a restart marker; Cursor sessions cleared: workspace-scoped resume can't survive the temp workdir). Event logs hydrate lazily; a failed load never marks hydrated (appending against an unknown prefix would overwrite history — follow-ups refuse, Subscribe returns a closed stream for EventSource retry).
  • Cross-context history loads view-only (lazy staleness sweep, store-assigned terminal markers). Graceful shutdown appends a terminal marker before cancelling agents. Retention 100 runs / 30 days; eviction deletes rows. Any persistence failure (open/load/write) degrades loudly: log + historyDegraded on the list response + panel notice — and a broken DB's files are removed by Clear (that's also the recovery).
  • Consent card discloses local retention (keys bumped — existing installs re-acknowledge once); Settings gains a confirmed Clear history (POST /api/diagnose/history/clear, one keep-aware transaction; live runs survive).

2. radar diagnose <kind>/<name> — the terminal front-end

A thin client for the running instance (discovered via ~/.radar/mcp-port, or --server) that starts the same durable server-side run the panel uses:

$ radar diagnose pod/checkout-6f4d -n prod
◉ Investigating Pod prod/checkout-6f4d
run run-3 · via Claude Code · watch: http://localhost:9280/?ai-run=run-3

<streamed reasoning + ✓ tool calls with timings>

▲ Root cause · confidence high
...
Remediation
  ★1. ...

--json for CI (clean stdout, progress on stderr) · --open to watch in the UI (new ?ai-run=<id> deep link, consumed once + stripped) · kubectl-style kind aliases (po, deploy, sts, …) · interleaved flags · one-time consent prompt on real TTYs (term.IsTerminal; non-interactive callers get the disclosure and proceed) · actionable errors when no Radar is running or AI is disabled.

3. Instant context card

The transcript opens with "Radar's read at start" — active issues + top reason, audit findings, managedBy — rendered from the health frame the server captured at run start. Fills the agent's boot time (~10-20s of CLI startup + first-token latency) with real substance and anchors the verdict against Radar's own signal.

Verification

  • go test ./internal/ai ./internal/diagnosecli (incl. -race): store roundtrip/auto-seq/perms, restart replay parity, interrupted initial turn and follow-up, graceful-shutdown roundtrip, hydration-failure refusal, foreign-context sweep, Cursor non-resumable, eviction deletes rows, clear-keeps-running, degraded-visibility (open + load failure), kind aliases, interleaved flag parsing, all verdict shapes render.
  • Live end-to-end (stubbed agent CLI, temp HOME): run → kill radar → restart → list + full transcript replay + follow-up seq continuity + clear; radar diagnose pod/web -n default streaming and --json runs; ?ai-run deep link opens the panel focused (screenshot-verified); context card renders issues/audit/managedBy on a real unschedulable deployment.
  • Review trail: plan settled via 2 codex cross-review cycles; implementation passed 1 codex adversarial round (5/5 findings fixed) + 3 Bugbot rounds (5/5 findings fixed) — all pinned by regression tests.

Note

Medium Risk
New local SQLite persistence and transcript hydration paths affect restart recovery and follow-up sequencing; failures are designed to degrade safely but warrant careful review of edge cases (broken DB, context switches, shutdown).

Overview
Adds durable AI investigation history (default-on SQLite at ~/.radar/ai-runs.db, --ai-history / config to disable). RunManager persists summaries and event logs with lazy transcript hydration, crash/shutdown recovery, cross-context staleness, retention/eviction, and loud degradation when the DB fails; the UI shows historyDegraded, updated consent, Clear history in Settings, and POST /api/diagnose/history/clear.

Introduces radar diagnose <kind>/<name> as a terminal client against a running instance (or ephemeral boot): streams the same server-side run as the panel, with --json, --open + ?ai-run= deep link, and consent on first use.

Health signals now carry capped issue/audit rows for prompts and a “Radar's read at start” context card in the UI/CLI so investigations open with concrete Radar findings instead of only aggregate counts.

Reviewed by Cursor Bugbot for commit b412dc4. Bugbot is set up for automated code reviews on this repo. Configure here.

Runs were RAM-only: a Radar restart lost every transcript, verdict, and the
sessionId that makes "continue in your agent" work — even though the agent
CLIs' own sessions survive on disk. Persist them to ~/.radar/ai-runs.db
(modernc sqlite, 0700/0600, default-on, --ai-history / aiHistory config to
disable).

Design (settled via plan-loop, 2 codex cross-review cycles):
- Two tables: runs (summary_json) + run_events keyed (run_id, seq). Writes go
  through an async single-writer goroutine, enqueued under the run mutex so
  per-run order matches seq; terminal events commit WITH their status in one
  transaction, so crash recovery can trust that a "running" row has no
  terminal marker. Loads drain the queue first (read-your-writes).
- Hydration: run rows load eagerly at startup (seeds nextID past persisted
  ids; interrupted "running" rows flip to error + a restart marker; Cursor
  sessions are cleared — workspace-scoped resume can't survive the temp
  workdir). Event logs hydrate lazily on first Subscribe/turn/Stop.
- History from another kube-context loads view-only: a lazy sweep (get/List)
  marks it stale with store-assigned terminal markers, and beginTurn refuses
  cross-context turns.
- Shutdown marks in-flight runs stopped BEFORE cancelling so goroutines don't
  persist spurious "context canceled" errors, then drains and closes the store.
- Retention: 100 runs / 30 days; eviction deletes rows. Store failure degrades
  loudly to memory-only (log + historyDegraded on the list response + panel
  notice).
- UX: consent card now discloses local retention; Settings gains a confirmed
  "Clear history" action (POST /api/diagnose/history/clear — wipes finished
  runs, live ones survive and re-persist); stale copy updated.
- Persist the running transition at beginTurn, so a crash mid-FOLLOW-UP (on a
  run whose row was already terminal) still recovers with the restart marker
  instead of replaying an unterminated turn under a done status.
- A failed transcript load no longer marks the run hydrated: appending against
  an unknown prefix would re-sequence from 1 and overwrite stored history.
  Follow-ups refuse with ErrHistoryUnavailable; Subscribe returns an
  immediately-closed stream so the client's EventSource retry re-attempts
  hydration.
- Graceful Shutdown appends a terminal "Radar was shutting down" marker (one
  transaction with the stopped status) before cancelling agents — replay can
  never end mid-turn and spin the UI.
- ClearHistory is now one keep-aware DELETE transaction instead of
  wipe-then-replay, so a crash mid-clear can't lose a live investigation.
- Bump consent keys (v3 / cursor-v2): transcripts persisting to local history
  is a material disclosure change, so prior consent doesn't carry over.

Each fix is pinned by a new regression test (interrupted follow-up, graceful
shutdown roundtrip, hydration-failure refusal, clear-keeps-running).
@nadaverell nadaverell requested a review from hisco as a code owner July 1, 2026 23:47
Comment thread internal/ai/runs.go
Comment thread internal/ai/runs.go Outdated
…irst clear

- A second context switch no longer re-terminalizes an already-stale run
  (its persisted log must keep ending in the closed sentinel; appending after
  it broke the replay contract durably).
- ClearHistory deletes from the store FIRST and drops memory only on success —
  the old order could show an empty list while an intact DB resurrected
  everything on restart. Dropped runs detach from the store before finalize so
  the closed sentinel can't re-create their just-deleted rows.
Comment thread internal/ai/runs.go
Comment thread internal/ai/runs.go
…dable DB

Bugbot round 2:
- OpenRunStore failure left HistoryDegraded() false (it required a non-nil
  store) — the loudest failure mode produced no UI notice. The server now marks
  the manager unavailable so the panel says history won't survive a restart.
- A LoadRuns failure showed empty history over an intact DB — and worse, new
  runs would mint colliding run-N ids and INSERT OR REPLACE the stored
  transcripts. The manager now closes and detaches the store on load failure
  (memory-only, degraded surfaced) rather than writing against unknown contents.
Comment thread internal/ai/runs.go
Bugbot round 3: when the history DB failed to open or load, ClearHistory only
touched memory and reported success — a later healthy startup would resurrect
investigations the user had "cleared". The manager now remembers the broken
DB's path and Clear removes the file (+wal/+shm); deleting IS the recovery for
an unusable DB. Errors surface instead of a false success.
Comment thread internal/ai/runs.go
…link

Two additions that round out the feature as one deliverable:

radar diagnose <kind>/<name> [-n ns] — a terminal client for the running
Radar instance (discovered via ~/.radar/mcp-port, or --server). It starts the
SAME durable server-side run the web panel uses: streamed transcript with tool
timings, verdict with ★-recommended remediation, --json for CI (clean stdout,
progress on stderr), --open to watch in the UI, kubectl-style kind aliases,
interleaved flags (flag parsing doesn't stop at the positional), one-time
consent prompt on real TTYs (term.IsTerminal — ModeCharDevice misreads
/dev/null), and actionable no-server/disabled errors.

Instant context card — the transcript now opens with "Radar's read at start"
(active issues + top reason, audit findings, managedBy) rendered from the
health frame the server captured at run start. The agent's boot time reads as
context-then-deepening instead of dead air, and the verdict stays anchored
against Radar's own signal.

?ai-run=<id> deep link — opens the panel focused on a run (what
`radar diagnose --open` opens); consumed once and stripped from the URL.
@nadaverell nadaverell changed the title Diagnose: persist investigations to SQLite so history survives restarts Diagnose: durable history, radar diagnose CLI, instant context Jul 2, 2026
Comment thread internal/ai/runs.go
…nner

The "Radar's read at start" card rendered aggregates ("2 active issues —
Unschedulable") — vague, when the issues engine had the full sentences all
along. The health frame now carries the top actual rows (3 issues + 2 audit
findings, engine messages capped at 220 chars), so:
- the context card shows "Unschedulable — 2 node(s) insufficient cpu
  (0/2 nodes available)" with per-row severity dots and +N-more lines;
- the first-turn prompt frames the agent with the same specifics;
- `radar diagnose` prints them under its header before the agent boots.

The CLI also gains a thinking indicator: on real TTYs, >1s of silence shows a
live "⠙ · thinking… 12s" line (erased before any real output, suppressed
mid-reasoning-line and under NO_COLOR/pipes) — the model's long quiet gaps
read as work, not a hang. Renderer writes are now mutex-serialized against
the spinner goroutine.
Comment thread internal/ai/runs.go
The wait line now speaks the web panel's vocabulary instead of a flat
"thinking" — while a tool runs the spinner narrates it ("reading logs… 4s",
"tracing dependencies…", from the step-running events), "starting
investigation…" covers the agent boot, and "thinking…" only the genuine
model gaps. Spinner glyph turned accent-amber.

Output pass, element by element:
- tool args render as terse k=v pairs (kind/namespace/name first) instead of
  raw JSON braces and quotes;
- target in the header is bold-cyan; the meta line drops the "run run-4"
  doubling;
- confidence band is color-coded (green/amber/red), not just dim text;
- non-recommended remediation numbers dim so the ★ pick carries the row;
  "★ recommended —" phrasing;
- rewrote render.go with the ANSI escapes as escape TEXT — a previous edit
  had embedded raw CR/ESC control bytes into the source string literals
  (compiled, but invisible-character source rot).
Comment thread web/src/components/diagnose/DiagnoseSurface.tsx
…of raw 404

Multiple radar instances share ~/.radar/mcp-port (last writer wins), so
discovery can land on a stale pre-Diagnose instance — which answers 404 on
/api/agents. Explain that and point at --server rather than printing the
bare status.
Comment thread web/src/components/diagnose/parts.tsx
--standalone (and automatic fallback when discovery finds nothing listening)
boots a temporary in-process Radar for the one investigation: headless,
random port, timeline in memory, --kubeconfig to pick a cluster. It skips
the ~/.radar/mcp-port claim entirely (new app.DisableMCPPortFile guards both
write AND the shutdown-time removal, so an ephemeral run can't clobber or
delete a real instance's discovery slot) — but SHARES ai-runs.db, so a cold
run's transcript, verdict, and session land in the same history the UI shows
(run ids continue the persisted sequence).

Boot UX: consent prompt BEFORE the cluster connect, a single "starting a
temporary Radar — connecting to cluster… 12s" spinner during it, and all of
Radar's boot + request logging (stdlib AND chi's own request logger, which
binds os.Stdout at init) captured to a 64KB tail buffer that only surfaces
on failure. Readiness = the diagnose endpoints stopping 503 (requireConnected),
with a 2-minute ceiling; an explicit --server that isn't answering errors
instead of falling back.

Verified cold end-to-end against EKS with a stubbed agent: fallback note →
connect → clean transcript → verdict → exit 0, no port file written.
Comment thread internal/ai/diagnoser.go
client-go logs (apiserver deprecation warnings etc.) go through klog, which
writes directly to stderr and bypasses the stdlib-log redirect — in
standalone mode they stomped the streaming transcript. Route klog into the
same tail buffer, mirroring main()'s server-path setup (the subcommand exits
before reaching it).

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b412dc4. Configure here.

Comment thread internal/ai/runs.go
m.mu.Unlock()
if m.store != nil {
m.store.SaveRun(r.Summary())
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared DB duplicate run IDs

High Severity

With --standalone, ephemeral Radar persists to the same ~/.radar/ai-runs.db as a long-running instance while minting run-N IDs only in each process’s memory. Two processes can assign the same ID and upsert the same row, overwriting an existing investigation’s summary and transcript.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b412dc4. Configure here.

Comment thread internal/ai/runs.go
if r.status == "stale" || r.inFlight {
r.mu.Unlock()
return
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Foreign sweep skips in-flight

Medium Severity

When the foreign-context sweep runs, markStale returns immediately if the run is in flight, so a follow-up started before the run was marked stale can keep executing. beginTurn also never checks that the run’s stored Context matches the current kube-context label.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b412dc4. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant