Diagnose: durable history, radar diagnose CLI, instant context#1081
Diagnose: durable history, radar diagnose CLI, instant context#1081nadaverell wants to merge 11 commits into
Conversation
Runs were RAM-only: a Radar restart lost every transcript, verdict, and the sessionId that makes "continue in your agent" work — even though the agent CLIs' own sessions survive on disk. Persist them to ~/.radar/ai-runs.db (modernc sqlite, 0700/0600, default-on, --ai-history / aiHistory config to disable). Design (settled via plan-loop, 2 codex cross-review cycles): - Two tables: runs (summary_json) + run_events keyed (run_id, seq). Writes go through an async single-writer goroutine, enqueued under the run mutex so per-run order matches seq; terminal events commit WITH their status in one transaction, so crash recovery can trust that a "running" row has no terminal marker. Loads drain the queue first (read-your-writes). - Hydration: run rows load eagerly at startup (seeds nextID past persisted ids; interrupted "running" rows flip to error + a restart marker; Cursor sessions are cleared — workspace-scoped resume can't survive the temp workdir). Event logs hydrate lazily on first Subscribe/turn/Stop. - History from another kube-context loads view-only: a lazy sweep (get/List) marks it stale with store-assigned terminal markers, and beginTurn refuses cross-context turns. - Shutdown marks in-flight runs stopped BEFORE cancelling so goroutines don't persist spurious "context canceled" errors, then drains and closes the store. - Retention: 100 runs / 30 days; eviction deletes rows. Store failure degrades loudly to memory-only (log + historyDegraded on the list response + panel notice). - UX: consent card now discloses local retention; Settings gains a confirmed "Clear history" action (POST /api/diagnose/history/clear — wipes finished runs, live ones survive and re-persist); stale copy updated.
- Persist the running transition at beginTurn, so a crash mid-FOLLOW-UP (on a run whose row was already terminal) still recovers with the restart marker instead of replaying an unterminated turn under a done status. - A failed transcript load no longer marks the run hydrated: appending against an unknown prefix would re-sequence from 1 and overwrite stored history. Follow-ups refuse with ErrHistoryUnavailable; Subscribe returns an immediately-closed stream so the client's EventSource retry re-attempts hydration. - Graceful Shutdown appends a terminal "Radar was shutting down" marker (one transaction with the stopped status) before cancelling agents — replay can never end mid-turn and spin the UI. - ClearHistory is now one keep-aware DELETE transaction instead of wipe-then-replay, so a crash mid-clear can't lose a live investigation. - Bump consent keys (v3 / cursor-v2): transcripts persisting to local history is a material disclosure change, so prior consent doesn't carry over. Each fix is pinned by a new regression test (interrupted follow-up, graceful shutdown roundtrip, hydration-failure refusal, clear-keeps-running).
…irst clear - A second context switch no longer re-terminalizes an already-stale run (its persisted log must keep ending in the closed sentinel; appending after it broke the replay contract durably). - ClearHistory deletes from the store FIRST and drops memory only on success — the old order could show an empty list while an intact DB resurrected everything on restart. Dropped runs detach from the store before finalize so the closed sentinel can't re-create their just-deleted rows.
…dable DB Bugbot round 2: - OpenRunStore failure left HistoryDegraded() false (it required a non-nil store) — the loudest failure mode produced no UI notice. The server now marks the manager unavailable so the panel says history won't survive a restart. - A LoadRuns failure showed empty history over an intact DB — and worse, new runs would mint colliding run-N ids and INSERT OR REPLACE the stored transcripts. The manager now closes and detaches the store on load failure (memory-only, degraded surfaced) rather than writing against unknown contents.
Bugbot round 3: when the history DB failed to open or load, ClearHistory only touched memory and reported success — a later healthy startup would resurrect investigations the user had "cleared". The manager now remembers the broken DB's path and Clear removes the file (+wal/+shm); deleting IS the recovery for an unusable DB. Errors surface instead of a false success.
…link Two additions that round out the feature as one deliverable: radar diagnose <kind>/<name> [-n ns] — a terminal client for the running Radar instance (discovered via ~/.radar/mcp-port, or --server). It starts the SAME durable server-side run the web panel uses: streamed transcript with tool timings, verdict with ★-recommended remediation, --json for CI (clean stdout, progress on stderr), --open to watch in the UI, kubectl-style kind aliases, interleaved flags (flag parsing doesn't stop at the positional), one-time consent prompt on real TTYs (term.IsTerminal — ModeCharDevice misreads /dev/null), and actionable no-server/disabled errors. Instant context card — the transcript now opens with "Radar's read at start" (active issues + top reason, audit findings, managedBy) rendered from the health frame the server captured at run start. The agent's boot time reads as context-then-deepening instead of dead air, and the verdict stays anchored against Radar's own signal. ?ai-run=<id> deep link — opens the panel focused on a run (what `radar diagnose --open` opens); consumed once and stripped from the URL.
…nner
The "Radar's read at start" card rendered aggregates ("2 active issues —
Unschedulable") — vague, when the issues engine had the full sentences all
along. The health frame now carries the top actual rows (3 issues + 2 audit
findings, engine messages capped at 220 chars), so:
- the context card shows "Unschedulable — 2 node(s) insufficient cpu
(0/2 nodes available)" with per-row severity dots and +N-more lines;
- the first-turn prompt frames the agent with the same specifics;
- `radar diagnose` prints them under its header before the agent boots.
The CLI also gains a thinking indicator: on real TTYs, >1s of silence shows a
live "⠙ · thinking… 12s" line (erased before any real output, suppressed
mid-reasoning-line and under NO_COLOR/pipes) — the model's long quiet gaps
read as work, not a hang. Renderer writes are now mutex-serialized against
the spinner goroutine.
The wait line now speaks the web panel's vocabulary instead of a flat
"thinking" — while a tool runs the spinner narrates it ("reading logs… 4s",
"tracing dependencies…", from the step-running events), "starting
investigation…" covers the agent boot, and "thinking…" only the genuine
model gaps. Spinner glyph turned accent-amber.
Output pass, element by element:
- tool args render as terse k=v pairs (kind/namespace/name first) instead of
raw JSON braces and quotes;
- target in the header is bold-cyan; the meta line drops the "run run-4"
doubling;
- confidence band is color-coded (green/amber/red), not just dim text;
- non-recommended remediation numbers dim so the ★ pick carries the row;
"★ recommended —" phrasing;
- rewrote render.go with the ANSI escapes as escape TEXT — a previous edit
had embedded raw CR/ESC control bytes into the source string literals
(compiled, but invisible-character source rot).
…of raw 404 Multiple radar instances share ~/.radar/mcp-port (last writer wins), so discovery can land on a stale pre-Diagnose instance — which answers 404 on /api/agents. Explain that and point at --server rather than printing the bare status.
--standalone (and automatic fallback when discovery finds nothing listening) boots a temporary in-process Radar for the one investigation: headless, random port, timeline in memory, --kubeconfig to pick a cluster. It skips the ~/.radar/mcp-port claim entirely (new app.DisableMCPPortFile guards both write AND the shutdown-time removal, so an ephemeral run can't clobber or delete a real instance's discovery slot) — but SHARES ai-runs.db, so a cold run's transcript, verdict, and session land in the same history the UI shows (run ids continue the persisted sequence). Boot UX: consent prompt BEFORE the cluster connect, a single "starting a temporary Radar — connecting to cluster… 12s" spinner during it, and all of Radar's boot + request logging (stdlib AND chi's own request logger, which binds os.Stdout at init) captured to a 64KB tail buffer that only surfaces on failure. Readiness = the diagnose endpoints stopping 503 (requireConnected), with a 2-minute ceiling; an explicit --server that isn't answering errors instead of falling back. Verified cold end-to-end against EKS with a stubbed agent: fallback note → connect → clean transcript → verdict → exit 0, no port file written.
client-go logs (apiserver deprecation warnings etc.) go through klog, which writes directly to stderr and bypasses the stdlib-log redirect — in standalone mode they stomped the streaming transcript. Route klog into the same tail buffer, mirroring main()'s server-path setup (the subcommand exits before reaching it).
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b412dc4. Configure here.
| m.mu.Unlock() | ||
| if m.store != nil { | ||
| m.store.SaveRun(r.Summary()) | ||
| } |
There was a problem hiding this comment.
Shared DB duplicate run IDs
High Severity
With --standalone, ephemeral Radar persists to the same ~/.radar/ai-runs.db as a long-running instance while minting run-N IDs only in each process’s memory. Two processes can assign the same ID and upsert the same row, overwriting an existing investigation’s summary and transcript.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit b412dc4. Configure here.
| if r.status == "stale" || r.inFlight { | ||
| r.mu.Unlock() | ||
| return | ||
| } |
There was a problem hiding this comment.
Foreign sweep skips in-flight
Medium Severity
When the foreign-context sweep runs, markStale returns immediately if the run is in flight, so a follow-up started before the run was marked stale can keep executing. beginTurn also never checks that the run’s stored Context matches the current kube-context label.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit b412dc4. Configure here.


Stacked on #947 (Diagnose with AI) — merge that first, then retarget/rebase this to main. Together they ship as one feature.
Three pieces:
1. Runs persist to SQLite (
~/.radar/ai-runs.db)Runs were RAM-only: a restart lost every transcript, verdict, and the
sessionIdbehind "continue in your agent" — even though the agent CLIs' own sessions survive on disk. Default-on, dir 0700 / file 0600,--ai-history/aiHistoryconfig to disable.runs+run_events (run_id, seq)); async single-writer enqueued under the run mutex; terminal events commit with their status in one transaction so crash recovery trusts the status column; loads drain the queue (read-your-writes).nextID; interruptedrunningrows — initial turn or follow-up — flip to error with a restart marker; Cursor sessions cleared: workspace-scoped resume can't survive the temp workdir). Event logs hydrate lazily; a failed load never marks hydrated (appending against an unknown prefix would overwrite history — follow-ups refuse, Subscribe returns a closed stream for EventSource retry).historyDegradedon the list response + panel notice — and a broken DB's files are removed by Clear (that's also the recovery).POST /api/diagnose/history/clear, one keep-aware transaction; live runs survive).2.
radar diagnose <kind>/<name>— the terminal front-endA thin client for the running instance (discovered via
~/.radar/mcp-port, or--server) that starts the same durable server-side run the panel uses:--jsonfor CI (clean stdout, progress on stderr) ·--opento watch in the UI (new?ai-run=<id>deep link, consumed once + stripped) · kubectl-style kind aliases (po,deploy,sts, …) · interleaved flags · one-time consent prompt on real TTYs (term.IsTerminal; non-interactive callers get the disclosure and proceed) · actionable errors when no Radar is running or AI is disabled.3. Instant context card
The transcript opens with "Radar's read at start" — active issues + top reason, audit findings, managedBy — rendered from the health frame the server captured at run start. Fills the agent's boot time (~10-20s of CLI startup + first-token latency) with real substance and anchors the verdict against Radar's own signal.
Verification
go test ./internal/ai ./internal/diagnosecli(incl.-race): store roundtrip/auto-seq/perms, restart replay parity, interrupted initial turn and follow-up, graceful-shutdown roundtrip, hydration-failure refusal, foreign-context sweep, Cursor non-resumable, eviction deletes rows, clear-keeps-running, degraded-visibility (open + load failure), kind aliases, interleaved flag parsing, all verdict shapes render.radar diagnose pod/web -n defaultstreaming and--jsonruns;?ai-rundeep link opens the panel focused (screenshot-verified); context card renders issues/audit/managedBy on a real unschedulable deployment.Note
Medium Risk
New local SQLite persistence and transcript hydration paths affect restart recovery and follow-up sequencing; failures are designed to degrade safely but warrant careful review of edge cases (broken DB, context switches, shutdown).
Overview
Adds durable AI investigation history (default-on SQLite at
~/.radar/ai-runs.db,--ai-history/ config to disable).RunManagerpersists summaries and event logs with lazy transcript hydration, crash/shutdown recovery, cross-context staleness, retention/eviction, and loud degradation when the DB fails; the UI showshistoryDegraded, updated consent, Clear history in Settings, andPOST /api/diagnose/history/clear.Introduces
radar diagnose <kind>/<name>as a terminal client against a running instance (or ephemeral boot): streams the same server-side run as the panel, with--json,--open+?ai-run=deep link, and consent on first use.Health signals now carry capped issue/audit rows for prompts and a “Radar's read at start” context card in the UI/CLI so investigations open with concrete Radar findings instead of only aggregate counts.
Reviewed by Cursor Bugbot for commit b412dc4. Bugbot is set up for automated code reviews on this repo. Configure here.