Skip to content

Improve connection error recovery#1062

Open
nadaverell wants to merge 1 commit into
mainfrom
nadav/sky-1095-connection-error
Open

Improve connection error recovery#1062
nadaverell wants to merge 1 commit into
mainfrom
nadav/sky-1095-connection-error

Conversation

@nadaverell

@nadaverell nadaverell commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Why

Exec-credential providers can fail in ways that look like generic Kubernetes timeouts, especially when AWS SSO or another plugin hangs while refreshing an expired token. This PR makes Radar point operators at the credential problem instead of network/load advice, and lets the disconnected UI recover automatically once credentials are fixed.

What changed

  • Classifies exec-auth context deadline exceeded failures as authentication problems, including the cluster unreachable: context deadline exceeded wrapper reported by client-go.
  • Keeps explicit network timeouts, such as cluster unreachable: i/o timeout and read-side i/o timeout, in the timeout bucket so offline clusters are not mislabeled as credential failures.
  • Adds explicit AWS SSO expired-token strings to the auth classifier.
  • Sizes the cluster HTTP probe from the parent probe deadline and returns an auth-specific timeout when an exec-auth probe expires.
  • Retries /api/connection/retry while the UI is disconnected, avoids overlapping automatic/manual retries, and refreshes query data after recovery.
  • Uses capped backoff for automatic reconnect attempts so an unrecoverable auth/config error does not run a full reconnect every 10 seconds forever.
  • Clears the automatic-retry spinner even when an SSE connected event arrives before the retry promise settles.
  • Returns a classified errorType from failed retry responses so manual and automatic retry failures can update stale auth/timeout/network guidance when the server learns a more precise cause.
  • Preserves the current auth/timeout/network guidance when a retry fails without a classified error, rather than falling back to generic unknown-error copy.
  • Drives the header status dot and reconnect action from cluster connection state, while keeping a separate Live updates disconnected state for SSE-only drops.

Testing

  • go test ./internal/k8s -run 'TestClassifyError|TestConnectionProbeHTTPTimeout'
  • go test ./internal/k8s
  • make tsc
  • make test
  • make build
  • Live API smoke on kind-radar-gitops-demo: /api/cluster-info, /api/connection, POST /api/connection/retry, /api/resource-counts after cache warmup, and /api/events/stream all returned healthy connected state.

Visual-test skipped: this is connection-state/error-classification behavior, and the meaningful failure UI requires forcing credential expiry. Manual auth-expiry verification is still needed by expiring or invalidating AWS SSO credentials, confirming auth/re-login copy, running aws sso login, and confirming recovery without reload.

Notes

Risk/blast radius: this touches cluster connection classification and disconnected-state retry behavior for all local Radar sessions. The main false-positive risk is labeling a slow exec-auth plugin deadline as auth; explicit i/o timeout, connection-refused, and dial paths remain network/timeout, and the behavior is pinned by unit tests plus a healthy live-cluster smoke. Automatic retry is bounded to the disconnected state, skips overlapping retries, clears its in-flight UI state on completion, updates retry guidance from classified server responses, and backs off to reduce repeated exec-plugin churn while credentials remain broken.

Before merge, this branch currently conflicts with main; conflict resolution is intentionally left for a separate merge/rebase pass. The substantive overlap is in App.tsx, where main has parallel connection-liveness/header work on the same subsystem. There is also an add/add conflict in internal/k8s/connection_state_test.go; Claude verified the test functions are disjoint and should be reconciled manually during the rebase/merge.

Refs SKY-1095.

@nadaverell nadaverell requested a review from hisco as a code owner June 29, 2026 23:54
@nadaverell nadaverell force-pushed the nadav/sky-1095-connection-error branch 3 times, most recently from 9ccafa7 to 8471cdf Compare June 30, 2026 21:34
Comment thread internal/k8s/connection_state.go
@nadaverell nadaverell force-pushed the nadav/sky-1095-connection-error branch 6 times, most recently from b51314c to bdb19f1 Compare July 1, 2026 17:13

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit bdb19f1. Configure here.

Comment thread web/src/context/ConnectionContext.tsx
@nadaverell nadaverell force-pushed the nadav/sky-1095-connection-error branch from bdb19f1 to c07b198 Compare July 1, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant