Skip to content

feat(reachability): Network-path diagnosis#1037

Open
hisco wants to merge 1 commit into
mainfrom
network-path-reachability
Open

feat(reachability): Network-path diagnosis#1037
hisco wants to merge 1 commit into
mainfrom
network-path-reachability

Conversation

@hisco

@hisco hisco commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Reachability · Network Path — honest, DevOps-grade path diagnosis

TL;DR

A new Reachability view for the kinds that carry traffic — Service, Ingress, HTTPRoute, GRPCRoute, Gateway — that answers one operator question: "can traffic get to my app, and if not, exactly where and why does it break?" It reads like a DevOps engineer built it for someone who isn't a Kubernetes expert, and its defining constraint is honesty: it never says "broken" when traffic flows, and never says "healthy/reachable" for something it couldn't actually verify.

Two modes: Config check (passive, zero traffic — reads the declared path + config breaks) and Live test (on demand — DNS/TCP/TLS/HTTP probes, reported per-route and per-vantage). Surfaces: a full Reachability tab (Tree + Diagram), a passive drawer glance, and the MCP diagnose tool extended to network kinds.


How to review this (suggested reading order)

It's a large diff but it's one feature with a clear spine. Read in this order:

  1. internal/trace/coverage.go — the honesty spine. RouteResult{Outcome, Confidence}, CoverageVerdict, CoverageHeadline. Understand real vs indirect confidence here and the rest follows.
  2. pkg/probe/probe.go — the probe primitives + the two vantages (real path vs apiserver-proxy) + proxyResult / isClusterUnreachable (couldn't-reach ≠ unreachable).
  3. internal/trace/trace.gocomputeVerdict / reviseVerdictWithProbes (how findings + probes become healthy/degraded/broken/unknown).
  4. internal/trace/entries.go — how each subject's static path graph (hops + findings) is built.
  5. UI: packages/k8s-ui/src/components/trace/reachVerdict.ts (operator-facing verdict), traceToSubgraph.ts (node/edge coloring), TraceSummary.tsx (drawer glance).

What to scrutinize: the honesty invariants below — each is load-bearing and each has a test that pins it. If you can find a case that false-condemns (says broken when traffic flows) or overclaims (says reachable/healthy without real verification), that's the bug that matters most here.


High-level architecture

                         ┌──────────────────────────────────────────────┐
   K8s cache (informers) │  STATIC TRACE            internal/trace        │
   Services, Endpoints,  │  build the path graph: entry → backends → pods │
   Pods, Ingress, GW,    │  attach findings (missing ref, no-ready,       │
   Routes, NetPol,       │  targetPort, scaled-to-0, ingress-controller,  │
   IngressClass          │  TLS, NetworkPolicy prediction)                │
                         └───────────────┬──────────────────────────────┘
                                         │ Hops + Findings
                    ┌────────────────────┼────────────────────┐
                    │ (optional)         │                     │
            ┌───────▼────────┐   ┌───────▼────────┐    ┌───────▼─────────┐
            │ ACTIVE PROBES  │   │ IN-CLUSTER TEST│    │  COVERAGE        │
            │ pkg/probe      │   │ internal/      │    │  PROJECTION      │
            │ DNS/TCP/TLS/   │   │ reachability   │    │  coverage.go     │
            │ HTTP           │   │ short-lived    │    │  per-route       │
            │ 2 vantages:    │   │ Jobs/pods,     │    │  outcome +       │
            │ • real path    │   │ RBAC-gated,    │    │  confidence      │
            │ • apiserver    │   │ self-deleting  │    │  (real|indirect) │
            │   proxy        │   │ (MUTATING)     │    │                  │
            └───────┬────────┘   └───────┬────────┘    └───────┬─────────┘
                    └────────────────────┴─────────────────────┘
                                         │ Routes + Coverage
                            ┌────────────▼────────────┐
                            │  VERDICT                 │
                            │  computeVerdict +        │
                            │  reviseVerdictWithProbes │
                            └────────────┬─────────────┘
                                         │  Trace (JSON contract)
              ┌──────────────────────────┼───────────────────────────┐
        ┌─────▼──────┐          ┌─────────▼─────────┐         ┌────────▼────────┐
        │ Reach tab  │          │  Drawer glance    │         │  MCP diagnose   │
        │ Tree +     │          │  TraceSummary     │         │  (agents)       │
        │ Diagram    │          │  passive, calm    │         │                 │
        └────────────┘          └───────────────────┘         └─────────────────┘

The two ideas that do the heavy lifting

  1. Vantage confidence. Every probe is tagged by how it learned a result. real = tested the way traffic flows (direct dial / in-cluster). indirect = reached only via the API-server proxy — it localizes a problem but never sets the headline and never condemns the real path (the proxy bypasses NetworkPolicy and can fail for vantage-only reasons). The one exception: an authoritative cache fact (e.g. 0 ready endpoints) is definitive regardless of vantage and is promoted to a real failure.
  2. Node = own health, edge = the path. A node's color is the resource's own health (did it answer), never how we reached it or whether a downstream route broke. Per-route reachability lives on the edges.

Honesty invariants (each pinned by a test)

Invariant Where Pinned by
A Service with some ready pods still serves → degraded, never broken trace.go hopReachSeverity; traceToSubgraph.ts TestComputeVerdict_PartialReadyBackendIsDegradedNotBroken; traceToSubgraph "partial-ready backend → amber"
0 ready endpoints is definitive → red, not soft "via the proxy" amber coverage.go upgradeDefinitiveBackendDown TestUpgradeDefinitiveBackendDown (incl. multi-port + uncertain-warning-stays-soft)
Couldn't reach the cluster ≠ unreachable → skip pkg/probe/probe.go isClusterUnreachable TestIsClusterUnreachable
A failed probe never reads as "reached a server" trace.go anyProbeReached TestAnyProbeReached_FailedProbeIsNotReached
An apiserver-proxy-only failure never condemns the real path coverage.go coverageBannerTone, singleRouteHeadline reachVerdict / coverage suites
No-healthy-backend headline blocked when a sibling backend is healthy / unreadable / selectorless / external reachVerdict.ts backendDown reachVerdict "multi-backend", "RBAC-redacted sibling", "selectorless sibling"
Scaled-to-0 is deliberate dormancy, not an outage coverage.go, TraceSummary.tsx TraceSummary "scaled-to-0" + coverage benign-softening
Node color = own health, not route outcome or probe vantage traceToSubgraph.ts traceToSubgraph "node = own health" suite

Screenshots

Captured against a local kind cluster of fixtures (healthy, crashlooping, partially-ready, image-pull, ingress with/without controller, TLS, multi-port, scaled-to-zero, external-name).

1. Backend down — names the cause, links the culprit, doesn't blame the network. The headline leads with the real cause, credits the wiring, and the node panel links straight to the failing Pod + the exact kubectl logs command. The Service node stays neutral (wired fine); the red is on the Pod (the actual fault).

01-backend-down

2. Partial-ready — degraded, not condemned. 1 of 2 pods ready (one crashing, one serving). It still serves, so it reads degraded / reached — not the false "broken". This is the honesty line the feature is built on.

02-partial-ready-degraded

3. Coverage matrix — per-path, per-vantage. Each route × FROM YOUR MACHINE / VIA API PROXY / IN-CLUSTER. A name that doesn't resolve from the laptop is reported as that — never a false "unreachable".

03-coverage-matrix

4. Ingress, controller present — full front-door path. Ingress → Service → Pods, front-door honesty ("backend reachable, but the entry point wasn't tested from here"), missing-TLS-secret finding, copyable curl --resolve.

04-ingress-controller-present

5. Ingress, no controller — the routing tier made visible. Rules exist but nothing serves them; plain explanation + the next step (kubectl get ingressclass). The controller tier other tools skip.

05-no-controller

6. Passive drawer glance. In the resource drawer, Reachability is a calm one-line glance with a "Config check only — not tested" honesty subtitle and one CTA into the live test. It never nags.

06-drawer-glance

File map (80 files)

  • internal/trace/ (24) — the engine: trace.go (verdict), coverage.go (route projection), entries.go (path graph), probes.go (probe orchestration), findings.go, netpol.go (NetworkPolicy evaluator), ingress_controller.go (controller tier), egress.go. The cycleN_fixes_test.go files are test-only regression suites from review rounds.
  • pkg/probe/ (2) — probe primitives (DNS/TCP/TLS/HTTP, apiserver-proxy) + classification.
  • internal/reachability/ (4) — the in-cluster probe runner (short-lived Jobs).
  • packages/k8s-ui/src/components/trace/ (22) — the UI: ReachabilityView (Tree+Diagram), TracePanel, reachVerdict, traceToSubgraph, TraceSummary (drawer), probe-display.
  • internal/mcp/ (5), internal/server/ (4) — MCP diagnose extension + the /api/trace endpoint.
  • internal/k8s/ (6) — cache wiring for the kinds the trace reads.
  • web/, cmd/, docs/, deploy/helm, README — wiring, flags, docs.

Test plan

  • Go: go test ./... (incl. internal/trace, pkg/probe) — green.
  • Frontend: tsc clean, full vitest suite green (incl. reachVerdict / traceToSubgraph / TraceSummary / probe-display).
  • Verified end-to-end against a live kind cluster across the full fixture matrix, including cluster-down behavior (probes skip honestly instead of condemning).
  • Hardened via multi-round cross-model review; all findings resolved or skipped-with-evidence.

Known limitations (deliberately deferred)

Surfaced by an independent critic pass and consciously left for a follow-up — each would be net-negative or out-of-scope to fix now:

  • Core-NetworkPolicy "would-deny" on CNI-CRD clusters (Cilium/Calico). The static evaluator reads core NetworkPolicy only. A path a CNI CRD policy allows could still show an amber "a rule would block traffic". Not capped: core NP is enforced on those CNIs too, so suppressing it would under-report a real deny — and the finding is hedged + clearable by the in-cluster test (subject to the real CNI).
  • Backend IsNotFound during informer sync. A just-applied or cross-scope backend can briefly read as a missing-ref break until the cache syncs. Narrow + self-healing; a scope-aware guard risks a worse false-clear.
  • Succeeded/Job pods behind a Service. The finding is already softened at the detection layer; a rare metadata-level amber remains for the Service-fronting-Jobs case. Excluding terminal pods from selection risks a broad semantic change.

Scope / risk notes

  • The live test emits real traffic against the declared path (bounded to a 3s budget; a timeout returns an honest partial).
  • The in-cluster option is mutating — it creates up to 5 short-lived, self-deleting probe pods under the caller's RBAC; the only diagnose option with a side effect, gated on create jobs + list/get pods.
  • Custom probe path supported (method fixed to GET; per-request headers deferred).

Note

High Risk
Introduces mutating in-cluster probe Jobs under caller RBAC, new proxy-based active checks, and broad diagnosis surfaces (REST/MCP/UI) where false condemn or overclaim would mislead operators during incidents.

Overview
Adds network path diagnosis for Service, Ingress, HTTPRoute, GRPCRoute, and Gateway: a hop-ordered static trace from the informer cache plus optional live probes (DNS/TCP/TLS/HTTP) that respect vantage (real vs API-server indirect) so headlines do not overclaim reachability.

API & CLI: GET /api/trace/{kind}/{ns}/{name} with ?probe=true; in-cluster tests via restricted short-lived Jobs (POST …/probe-in-cluster, POST …/in-cluster) and a radar probe subcommand. --reachability-image / RADAR_IMAGE / self-read of the running pod image pick the probe container; Helm injects MY_POD_NAME and RADAR_IMAGE.

MCP: diagnose now returns coverage-shaped traces for network entry kinds (probe, mutating inCluster with Member+RBAC gates). Tool annotations treat diagnose as non-read-only when inCluster is used.

Detection tweaks feeding the trace: Argo Rollout-aware scale-to-zero, uncertain Rollout lookup instead of critical “no endpoints”, readiness-probe vs Service target port (svc:probe-port-mismatch), richer Gateway backend port messages, and ErrDynamicNotReady for uninitialized dynamic cache.

Docs (docs/diagnose.md, README, MCP) and make kind-load-probe support local in-cluster probe development.

Reviewed by Cursor Bugbot for commit e66ff7b. Bugbot is set up for automated code reviews on this repo. Configure here.

@hisco hisco requested a review from nadaverell as a code owner June 28, 2026 08:39
Comment thread internal/reachability/incluster.go
Comment thread packages/k8s-ui/src/components/trace/ReachabilityExplainer.tsx Fixed
Comment thread packages/k8s-ui/src/components/trace/reachVerdict.ts Fixed
Comment thread internal/trace/coverage.go Fixed
Comment thread internal/trace/coverage.go Fixed
Comment thread internal/trace/coverage.go Fixed
Comment thread internal/trace/netpol.go Fixed
Comment thread pkg/probe/probe.go Dismissed
@hisco hisco force-pushed the network-path-reachability branch 2 times, most recently from 7bbc3ba to f76a8ce Compare June 28, 2026 09:20

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f76a8ce. Configure here.

Comment thread internal/server/reachability_run.go
Comment thread internal/mcp/tools_diagnose.go
@hisco hisco force-pushed the network-path-reachability branch 2 times, most recently from 0be6378 to ea9014c Compare June 28, 2026 12:28
@hisco hisco changed the title feat(reachability): honest, DevOps-grade network-path diagnosis feat(reachability):rade network-path diagnosis Jun 28, 2026
@hisco hisco changed the title feat(reachability):rade network-path diagnosis feat(reachability): Network-path diagnosis Jun 28, 2026
@hisco hisco force-pushed the network-path-reachability branch from ea9014c to e66ff7b Compare July 1, 2026 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants