Diagnose: durable history, radar diagnose CLI, instant context by nadaverell · Pull Request #1081 · skyhook-io/radar

nadaverell · 2026-07-01T23:47:29Z

Stacked on #947 (Diagnose with AI) — merge that first, then retarget/rebase this to main. Together they ship as one feature.

Three pieces:

1. Runs persist to SQLite (`~/.radar/ai-runs.db`)

Runs were RAM-only: a restart lost every transcript, verdict, and the sessionId behind "continue in your agent" — even though the agent CLIs' own sessions survive on disk. Default-on, dir 0700 / file 0600, --ai-history / aiHistory config to disable.

Two tables (runs + run_events (run_id, seq)); async single-writer enqueued under the run mutex; terminal events commit with their status in one transaction so crash recovery trusts the status column; loads drain the queue (read-your-writes).
Run rows hydrate eagerly (seeds nextID; interrupted running rows — initial turn or follow-up — flip to error with a restart marker; Cursor sessions cleared: workspace-scoped resume can't survive the temp workdir). Event logs hydrate lazily; a failed load never marks hydrated (appending against an unknown prefix would overwrite history — follow-ups refuse, Subscribe returns a closed stream for EventSource retry).
Cross-context history loads view-only (lazy staleness sweep, store-assigned terminal markers). Graceful shutdown appends a terminal marker before cancelling agents. Retention 100 runs / 30 days; eviction deletes rows. Any persistence failure (open/load/write) degrades loudly: log + historyDegraded on the list response + panel notice — and a broken DB's files are removed by Clear (that's also the recovery).
Consent card discloses local retention (keys bumped — existing installs re-acknowledge once); Settings gains a confirmed Clear history (POST /api/diagnose/history/clear, one keep-aware transaction; live runs survive).

2. `radar diagnose <kind>/<name>` — the terminal front-end

A thin client for the running instance (discovered via ~/.radar/mcp-port, or --server) that starts the same durable server-side run the panel uses:

$ radar diagnose pod/checkout-6f4d -n prod
◉ Investigating Pod prod/checkout-6f4d
run run-3 · via Claude Code · watch: http://localhost:9280/?ai-run=run-3

<streamed reasoning + ✓ tool calls with timings>

▲ Root cause · confidence high
...
Remediation
  ★1. ...

--json for CI (clean stdout, progress on stderr) · --open to watch in the UI (new ?ai-run=<id> deep link, consumed once + stripped) · kubectl-style kind aliases (po, deploy, sts, …) · interleaved flags · one-time consent prompt on real TTYs (term.IsTerminal; non-interactive callers get the disclosure and proceed) · actionable errors when no Radar is running or AI is disabled.

3. Instant context card

The transcript opens with "Radar's read at start" — active issues + top reason, audit findings, managedBy — rendered from the health frame the server captured at run start. Fills the agent's boot time (~10-20s of CLI startup + first-token latency) with real substance and anchors the verdict against Radar's own signal.

Verification

go test ./internal/ai ./internal/diagnosecli (incl. -race): store roundtrip/auto-seq/perms, restart replay parity, interrupted initial turn and follow-up, graceful-shutdown roundtrip, hydration-failure refusal, foreign-context sweep, Cursor non-resumable, eviction deletes rows, clear-keeps-running, degraded-visibility (open + load failure), kind aliases, interleaved flag parsing, all verdict shapes render.
Live end-to-end (stubbed agent CLI, temp HOME): run → kill radar → restart → list + full transcript replay + follow-up seq continuity + clear; radar diagnose pod/web -n default streaming and --json runs; ?ai-run deep link opens the panel focused (screenshot-verified); context card renders issues/audit/managedBy on a real unschedulable deployment.
Review trail: plan settled via 2 codex cross-review cycles; implementation passed 1 codex adversarial round (5/5 findings fixed) + 3 Bugbot rounds (5/5 findings fixed) — all pinned by regression tests.

Note

Medium Risk
New local SQLite persistence and transcript hydration paths affect restart recovery and follow-up sequencing; failures are designed to degrade safely but warrant careful review of edge cases (broken DB, context switches, shutdown).

Overview
Adds durable AI investigation history (default-on SQLite at ~/.radar/ai-runs.db, --ai-history / config to disable). RunManager persists summaries and event logs with lazy transcript hydration, crash/shutdown recovery, cross-context staleness, retention/eviction, and loud degradation when the DB fails; the UI shows historyDegraded, updated consent, Clear history in Settings, and POST /api/diagnose/history/clear.

Introduces radar diagnose <kind>/<name> as a terminal client against a running instance (or ephemeral boot): streams the same server-side run as the panel, with --json, --open + ?ai-run= deep link, and consent on first use.

Health signals now carry capped issue/audit rows for prompts and a “Radar's read at start” context card in the UI/CLI so investigations open with concrete Radar findings instead of only aggregate counts.

^{Reviewed by Cursor Bugbot for commit b412dc4. Bugbot is set up for automated code reviews on this repo. Configure here.}

Runs were RAM-only: a Radar restart lost every transcript, verdict, and the sessionId that makes "continue in your agent" work — even though the agent CLIs' own sessions survive on disk. Persist them to ~/.radar/ai-runs.db (modernc sqlite, 0700/0600, default-on, --ai-history / aiHistory config to disable). Design (settled via plan-loop, 2 codex cross-review cycles): - Two tables: runs (summary_json) + run_events keyed (run_id, seq). Writes go through an async single-writer goroutine, enqueued under the run mutex so per-run order matches seq; terminal events commit WITH their status in one transaction, so crash recovery can trust that a "running" row has no terminal marker. Loads drain the queue first (read-your-writes). - Hydration: run rows load eagerly at startup (seeds nextID past persisted ids; interrupted "running" rows flip to error + a restart marker; Cursor sessions are cleared — workspace-scoped resume can't survive the temp workdir). Event logs hydrate lazily on first Subscribe/turn/Stop. - History from another kube-context loads view-only: a lazy sweep (get/List) marks it stale with store-assigned terminal markers, and beginTurn refuses cross-context turns. - Shutdown marks in-flight runs stopped BEFORE cancelling so goroutines don't persist spurious "context canceled" errors, then drains and closes the store. - Retention: 100 runs / 30 days; eviction deletes rows. Store failure degrades loudly to memory-only (log + historyDegraded on the list response + panel notice). - UX: consent card now discloses local retention; Settings gains a confirmed "Clear history" action (POST /api/diagnose/history/clear — wipes finished runs, live ones survive and re-persist); stale copy updated.

- Persist the running transition at beginTurn, so a crash mid-FOLLOW-UP (on a run whose row was already terminal) still recovers with the restart marker instead of replaying an unterminated turn under a done status. - A failed transcript load no longer marks the run hydrated: appending against an unknown prefix would re-sequence from 1 and overwrite stored history. Follow-ups refuse with ErrHistoryUnavailable; Subscribe returns an immediately-closed stream so the client's EventSource retry re-attempts hydration. - Graceful Shutdown appends a terminal "Radar was shutting down" marker (one transaction with the stopped status) before cancelling agents — replay can never end mid-turn and spin the UI. - ClearHistory is now one keep-aware DELETE transaction instead of wipe-then-replay, so a crash mid-clear can't lose a live investigation. - Bump consent keys (v3 / cursor-v2): transcripts persisting to local history is a material disclosure change, so prior consent doesn't carry over. Each fix is pinned by a new regression test (interrupted follow-up, graceful shutdown roundtrip, hydration-failure refusal, clear-keeps-running).

…irst clear - A second context switch no longer re-terminalizes an already-stale run (its persisted log must keep ending in the closed sentinel; appending after it broke the replay contract durably). - ClearHistory deletes from the store FIRST and drops memory only on success — the old order could show an empty list while an intact DB resurrected everything on restart. Dropped runs detach from the store before finalize so the closed sentinel can't re-create their just-deleted rows.

…dable DB Bugbot round 2: - OpenRunStore failure left HistoryDegraded() false (it required a non-nil store) — the loudest failure mode produced no UI notice. The server now marks the manager unavailable so the panel says history won't survive a restart. - A LoadRuns failure showed empty history over an intact DB — and worse, new runs would mint colliding run-N ids and INSERT OR REPLACE the stored transcripts. The manager now closes and detaches the store on load failure (memory-only, degraded surfaced) rather than writing against unknown contents.

Bugbot round 3: when the history DB failed to open or load, ClearHistory only touched memory and reported success — a later healthy startup would resurrect investigations the user had "cleared". The manager now remembers the broken DB's path and Clear removes the file (+wal/+shm); deleting IS the recovery for an unusable DB. Errors surface instead of a false success.

…link Two additions that round out the feature as one deliverable: radar diagnose <kind>/<name> [-n ns] — a terminal client for the running Radar instance (discovered via ~/.radar/mcp-port, or --server). It starts the SAME durable server-side run the web panel uses: streamed transcript with tool timings, verdict with ★-recommended remediation, --json for CI (clean stdout, progress on stderr), --open to watch in the UI, kubectl-style kind aliases, interleaved flags (flag parsing doesn't stop at the positional), one-time consent prompt on real TTYs (term.IsTerminal — ModeCharDevice misreads /dev/null), and actionable no-server/disabled errors. Instant context card — the transcript now opens with "Radar's read at start" (active issues + top reason, audit findings, managedBy) rendered from the health frame the server captured at run start. The agent's boot time reads as context-then-deepening instead of dead air, and the verdict stays anchored against Radar's own signal. ?ai-run=<id> deep link — opens the panel focused on a run (what `radar diagnose --open` opens); consumed once and stripped from the URL.

…nner The "Radar's read at start" card rendered aggregates ("2 active issues — Unschedulable") — vague, when the issues engine had the full sentences all along. The health frame now carries the top actual rows (3 issues + 2 audit findings, engine messages capped at 220 chars), so: - the context card shows "Unschedulable — 2 node(s) insufficient cpu (0/2 nodes available)" with per-row severity dots and +N-more lines; - the first-turn prompt frames the agent with the same specifics; - `radar diagnose` prints them under its header before the agent boots. The CLI also gains a thinking indicator: on real TTYs, >1s of silence shows a live "⠙ · thinking… 12s" line (erased before any real output, suppressed mid-reasoning-line and under NO_COLOR/pipes) — the model's long quiet gaps read as work, not a hang. Renderer writes are now mutex-serialized against the spinner goroutine.

The wait line now speaks the web panel's vocabulary instead of a flat "thinking" — while a tool runs the spinner narrates it ("reading logs… 4s", "tracing dependencies…", from the step-running events), "starting investigation…" covers the agent boot, and "thinking…" only the genuine model gaps. Spinner glyph turned accent-amber. Output pass, element by element: - tool args render as terse k=v pairs (kind/namespace/name first) instead of raw JSON braces and quotes; - target in the header is bold-cyan; the meta line drops the "run run-4" doubling; - confidence band is color-coded (green/amber/red), not just dim text; - non-recommended remediation numbers dim so the ★ pick carries the row; "★ recommended —" phrasing; - rewrote render.go with the ANSI escapes as escape TEXT — a previous edit had embedded raw CR/ESC control bytes into the source string literals (compiled, but invisible-character source rot).

…of raw 404 Multiple radar instances share ~/.radar/mcp-port (last writer wins), so discovery can land on a stale pre-Diagnose instance — which answers 404 on /api/agents. Explain that and point at --server rather than printing the bare status.

--standalone (and automatic fallback when discovery finds nothing listening) boots a temporary in-process Radar for the one investigation: headless, random port, timeline in memory, --kubeconfig to pick a cluster. It skips the ~/.radar/mcp-port claim entirely (new app.DisableMCPPortFile guards both write AND the shutdown-time removal, so an ephemeral run can't clobber or delete a real instance's discovery slot) — but SHARES ai-runs.db, so a cold run's transcript, verdict, and session land in the same history the UI shows (run ids continue the persisted sequence). Boot UX: consent prompt BEFORE the cluster connect, a single "starting a temporary Radar — connecting to cluster… 12s" spinner during it, and all of Radar's boot + request logging (stdlib AND chi's own request logger, which binds os.Stdout at init) captured to a 64KB tail buffer that only surfaces on failure. Readiness = the diagnose endpoints stopping 503 (requireConnected), with a 2-minute ceiling; an explicit --server that isn't answering errors instead of falling back. Verified cold end-to-end against EKS with a stubbed agent: fallback note → connect → clean transcript → verdict → exit 0, no port file written.

client-go logs (apiserver deprecation warnings etc.) go through klog, which writes directly to stderr and bypasses the stdlib-log redirect — in standalone mode they stomped the streaming transcript. Route klog into the same tail buffer, mirroring main()'s server-path setup (the subcommand exits before reaching it).

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit b412dc4. Configure here.}

cursor · 2026-07-02T12:30:11Z

 	m.mu.Unlock()
+	if m.store != nil {
+		m.store.SaveRun(r.Summary())
+	}


Shared DB duplicate run IDs

High Severity

With --standalone, ephemeral Radar persists to the same ~/.radar/ai-runs.db as a long-running instance while minting run-N IDs only in each process’s memory. Two processes can assign the same ID and upsert the same row, overwriting an existing investigation’s summary and transcript.

Additional Locations (1)

internal/diagnosecli/standalone.go#L53-L54

^{Reviewed by Cursor Bugbot for commit b412dc4. Configure here.}

cursor · 2026-07-02T12:30:11Z

+	if r.status == "stale" || r.inFlight {
+		r.mu.Unlock()
+		return
+	}


Foreign sweep skips in-flight

Medium Severity

When the foreign-context sweep runs, markStale returns immediately if the run is in flight, so a follow-up started before the run was marked stale can keep executing. beginTurn also never checks that the run’s stored Context matches the current kube-context label.

Additional Locations (1)

internal/ai/runs.go#L624-L634

^{Reviewed by Cursor Bugbot for commit b412dc4. Configure here.}

nadaverell added 2 commits July 2, 2026 02:40

nadaverell requested a review from hisco as a code owner July 1, 2026 23:47

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread internal/ai/runs.go

Comment thread internal/ai/runs.go Outdated

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread internal/ai/runs.go

Comment thread internal/ai/runs.go

cursor Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread internal/ai/runs.go

cursor Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread internal/ai/runs.go

nadaverell changed the title ~~Diagnose: persist investigations to SQLite so history survives restarts~~ Diagnose: durable history, radar diagnose CLI, instant context Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Diagnose: durable history, radar diagnose CLI, instant context#1081

Diagnose: durable history, radar diagnose CLI, instant context#1081
nadaverell wants to merge 11 commits into
feature/radar-opensre-diagnosefrom
feature/radar-ai-run-persistence

nadaverell commented Jul 1, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jul 2, 2026

Uh oh!

cursor Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nadaverell commented Jul 1, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Runs persist to SQLite (~/.radar/ai-runs.db)

2. radar diagnose <kind>/<name> — the terminal front-end

3. Instant context card

Verification

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jul 2, 2026

Choose a reason for hiding this comment

Shared DB duplicate run IDs

Uh oh!

cursor Bot Jul 2, 2026

Choose a reason for hiding this comment

Foreign sweep skips in-flight

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nadaverell commented Jul 1, 2026 •

edited by cursor Bot

Loading

1. Runs persist to SQLite (`~/.radar/ai-runs.db`)

2. `radar diagnose <kind>/<name>` — the terminal front-end