Improve connection error recovery#1062
Open
nadaverell wants to merge 1 commit into
Open
Conversation
9ccafa7 to
8471cdf
Compare
b51314c to
bdb19f1
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit bdb19f1. Configure here.
bdb19f1 to
c07b198
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Why
Exec-credential providers can fail in ways that look like generic Kubernetes timeouts, especially when AWS SSO or another plugin hangs while refreshing an expired token. This PR makes Radar point operators at the credential problem instead of network/load advice, and lets the disconnected UI recover automatically once credentials are fixed.
What changed
context deadline exceededfailures as authentication problems, including thecluster unreachable: context deadline exceededwrapper reported by client-go.cluster unreachable: i/o timeoutand read-sidei/o timeout, in the timeout bucket so offline clusters are not mislabeled as credential failures./api/connection/retrywhile the UI is disconnected, avoids overlapping automatic/manual retries, and refreshes query data after recovery.connectedevent arrives before the retry promise settles.errorTypefrom failed retry responses so manual and automatic retry failures can update stale auth/timeout/network guidance when the server learns a more precise cause.Live updates disconnectedstate for SSE-only drops.Testing
go test ./internal/k8s -run 'TestClassifyError|TestConnectionProbeHTTPTimeout'go test ./internal/k8smake tscmake testmake buildkind-radar-gitops-demo:/api/cluster-info,/api/connection,POST /api/connection/retry,/api/resource-countsafter cache warmup, and/api/events/streamall returned healthy connected state.Visual-test skipped: this is connection-state/error-classification behavior, and the meaningful failure UI requires forcing credential expiry. Manual auth-expiry verification is still needed by expiring or invalidating AWS SSO credentials, confirming auth/re-login copy, running
aws sso login, and confirming recovery without reload.Notes
Risk/blast radius: this touches cluster connection classification and disconnected-state retry behavior for all local Radar sessions. The main false-positive risk is labeling a slow exec-auth plugin deadline as auth; explicit
i/o timeout, connection-refused, and dial paths remain network/timeout, and the behavior is pinned by unit tests plus a healthy live-cluster smoke. Automatic retry is bounded to the disconnected state, skips overlapping retries, clears its in-flight UI state on completion, updates retry guidance from classified server responses, and backs off to reduce repeated exec-plugin churn while credentials remain broken.Before merge, this branch currently conflicts with
main; conflict resolution is intentionally left for a separate merge/rebase pass. The substantive overlap is inApp.tsx, wheremainhas parallel connection-liveness/header work on the same subsystem. There is also an add/add conflict ininternal/k8s/connection_state_test.go; Claude verified the test functions are disjoint and should be reconciled manually during the rebase/merge.Refs SKY-1095.