Improve connection error recovery by nadaverell · Pull Request #1062 · skyhook-io/radar

nadaverell · 2026-06-29T23:54:35Z

Why

Exec-credential providers can fail in ways that look like generic Kubernetes timeouts, especially when AWS SSO or another plugin hangs while refreshing an expired token. This PR makes Radar point operators at the credential problem instead of network/load advice, and lets the disconnected UI recover automatically once credentials are fixed.

What changed

Classifies exec-auth context deadline exceeded failures as authentication problems, including the cluster unreachable: context deadline exceeded wrapper reported by client-go.
Keeps explicit network timeouts, such as cluster unreachable: i/o timeout and read-side i/o timeout, in the timeout bucket so offline clusters are not mislabeled as credential failures.
Adds explicit AWS SSO expired-token strings to the auth classifier.
Sizes the cluster HTTP probe from the parent probe deadline and returns an auth-specific timeout when an exec-auth probe expires.
Retries /api/connection/retry while the UI is disconnected, avoids overlapping automatic/manual retries, and refreshes query data after recovery.
Uses capped backoff for automatic reconnect attempts so an unrecoverable auth/config error does not run a full reconnect every 10 seconds forever.
Clears the automatic-retry spinner even when an SSE connected event arrives before the retry promise settles.
Returns a classified errorType from failed retry responses so manual and automatic retry failures can update stale auth/timeout/network guidance when the server learns a more precise cause.
Preserves the current auth/timeout/network guidance when a retry fails without a classified error, rather than falling back to generic unknown-error copy.
Drives the header status dot and reconnect action from cluster connection state, while keeping a separate Live updates disconnected state for SSE-only drops.

Testing

go test ./internal/k8s -run 'TestClassifyError|TestConnectionProbeHTTPTimeout'
go test ./internal/k8s
make tsc
make test
make build
Live API smoke on kind-radar-gitops-demo: /api/cluster-info, /api/connection, POST /api/connection/retry, /api/resource-counts after cache warmup, and /api/events/stream all returned healthy connected state.

Visual-test skipped: this is connection-state/error-classification behavior, and the meaningful failure UI requires forcing credential expiry. Manual auth-expiry verification is still needed by expiring or invalidating AWS SSO credentials, confirming auth/re-login copy, running aws sso login, and confirming recovery without reload.

Notes

Risk/blast radius: this touches cluster connection classification and disconnected-state retry behavior for all local Radar sessions. The main false-positive risk is labeling a slow exec-auth plugin deadline as auth; explicit i/o timeout, connection-refused, and dial paths remain network/timeout, and the behavior is pinned by unit tests plus a healthy live-cluster smoke. Automatic retry is bounded to the disconnected state, skips overlapping retries, clears its in-flight UI state on completion, updates retry guidance from classified server responses, and backs off to reduce repeated exec-plugin churn while credentials remain broken.

Before merge, this branch currently conflicts with main; conflict resolution is intentionally left for a separate merge/rebase pass. The substantive overlap is in App.tsx, where main has parallel connection-liveness/header work on the same subsystem. There is also an add/add conflict in internal/k8s/connection_state_test.go; Claude verified the test functions are disjoint and should be reconciled manually during the rebase/merge.

Refs SKY-1095.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit bdb19f1. Configure here.}

nadaverell requested a review from hisco as a code owner June 29, 2026 23:54

nadaverell force-pushed the nadav/sky-1095-connection-error branch 3 times, most recently from 9ccafa7 to 8471cdf Compare June 30, 2026 21:34

cursor Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread internal/k8s/connection_state.go

nadaverell force-pushed the nadav/sky-1095-connection-error branch 6 times, most recently from b51314c to bdb19f1 Compare July 1, 2026 17:13

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread web/src/context/ConnectionContext.tsx

Improve connection error recovery

c07b198

nadaverell force-pushed the nadav/sky-1095-connection-error branch from bdb19f1 to c07b198 Compare July 1, 2026 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve connection error recovery#1062

Improve connection error recovery#1062
nadaverell wants to merge 1 commit into
mainfrom
nadav/sky-1095-connection-error

nadaverell commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nadaverell commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changed

Testing

Notes

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nadaverell commented Jun 29, 2026 •

edited

Loading