Skip to content

Improve audit detector precision#1059

Open
nadaverell wants to merge 1 commit into
mainfrom
nadav/sky-1091-detector-precision
Open

Improve audit detector precision#1059
nadaverell wants to merge 1 commit into
mainfrom
nadav/sky-1091-detector-precision

Conversation

@nadaverell

@nadaverell nadaverell commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Improves audit detector precision for noisy real-cluster findings so Radar separates broken dependencies from benign infrastructure endpoints and scheduler/status noise. Operators should see fewer false positives while genuinely risky missing admission webhook backends and workload blockers remain visible.

Linear: SKY-1091

What changed

  • Admission webhook backend checks now account for failurePolicy: absent/default/explicit Fail reports as critical, explicit Ignore reports as warning, and multiple webhooks that share the same missing backend Service collapse into one issue using the worst severity and stable ordering/fingerprints.
  • Config/env Service reference detection now suppresses node-host endpoints, including node names, hostnames, and node addresses such as kind-style control-plane host:6443 values, while still reporting normal Service-shaped values.
  • Scheduler unschedulable diagnosis drops empty fragments and noisy DRA deallocate-only jargon, but still preserves useful clauses such as host-port conflicts and falls back to a generic scheduler message when DRA text is the only clue.
  • ReadWriteOnce volume-conflict detection now requires a Bound PVC, so pending PVCs stay classified as storage provisioning problems instead of volume-conflict alerts.

Testing

  • go test ./internal/k8s -run 'TestDetectMissingWebhookRefs|TestDetectProblems_ConfigSignals|TestFindEnvServiceRefChecks_SplitHostPort|TestDescribeUnschedulable_DropsDRADeallocateJargon|TestDetectProblems_SharedRWOVolume'
  • go test ./internal/k8s
  • make tsc
  • make test
  • make build

Live/API checks:

  • kind-radar-gitops-demo: Radar connected, CRD discovery ready, and the detector-focused issue filter returned zero webhook backend, kindnet, DRA-jargon, or ReadWriteOnce false positives.
  • GCP nonprod: Radar connected to a 9-node GKE cluster with webhook resources synced. The API surfaced a critical missing backend for a Fail webhook and one aggregated warning for three Ignore Datadog webhooks sharing the same missing backend Service.
  • AWS radar-test-nonprod: Radar connected to a 2-node EKS cluster with webhook resources synced; the same detector-focused issue filter returned zero matching issues.
  • AWS us-east-1-nonprod: not live-verified because the local AWS SSO session expired before the cluster request reached Kubernetes.

Visual-test: skipped, because this is backend detector and diagnosis-copy logic with no rendered UI component changes.

Notes / risk

Blast radius is limited to audit issue generation for admission webhook backends, config/env Service reference detection, scheduler unschedulable copy, and RWO volume-conflict classification. The main residual risks are over-suppressing a genuinely missing Service whose host exactly matches a node name/address, or failing to suppress node endpoints when Nodes cannot be listed; both affect issue visibility only, not cluster mutation. Unit tests and live checks against kind, GKE, and EKS clusters cover the intended scenarios.

@nadaverell nadaverell requested a review from hisco as a code owner June 29, 2026 23:03
Comment thread internal/k8s/detect_scheduling.go
@nadaverell nadaverell force-pushed the nadav/sky-1091-detector-precision branch 3 times, most recently from efda567 to 1299378 Compare July 1, 2026 20:45

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1299378. Configure here.

Comment thread internal/k8s/detect_scheduling.go
@nadaverell nadaverell force-pushed the nadav/sky-1091-detector-precision branch from 1299378 to a314176 Compare July 1, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant