feat(reachability): Network-path diagnosis#1037
Open
hisco wants to merge 1 commit into
Open
Conversation
7bbc3ba to
f76a8ce
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f76a8ce. Configure here.
0be6378 to
ea9014c
Compare
ea9014c to
e66ff7b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Reachability · Network Path — honest, DevOps-grade path diagnosis
TL;DR
A new Reachability view for the kinds that carry traffic —
Service,Ingress,HTTPRoute,GRPCRoute,Gateway— that answers one operator question: "can traffic get to my app, and if not, exactly where and why does it break?" It reads like a DevOps engineer built it for someone who isn't a Kubernetes expert, and its defining constraint is honesty: it never says "broken" when traffic flows, and never says "healthy/reachable" for something it couldn't actually verify.Two modes: Config check (passive, zero traffic — reads the declared path + config breaks) and Live test (on demand — DNS/TCP/TLS/HTTP probes, reported per-route and per-vantage). Surfaces: a full Reachability tab (Tree + Diagram), a passive drawer glance, and the MCP
diagnosetool extended to network kinds.How to review this (suggested reading order)
It's a large diff but it's one feature with a clear spine. Read in this order:
internal/trace/coverage.go— the honesty spine.RouteResult{Outcome, Confidence},CoverageVerdict,CoverageHeadline. Understandrealvsindirectconfidence here and the rest follows.pkg/probe/probe.go— the probe primitives + the two vantages (real path vs apiserver-proxy) +proxyResult/isClusterUnreachable(couldn't-reach ≠ unreachable).internal/trace/trace.go—computeVerdict/reviseVerdictWithProbes(how findings + probes become healthy/degraded/broken/unknown).internal/trace/entries.go— how each subject's static path graph (hops + findings) is built.packages/k8s-ui/src/components/trace/reachVerdict.ts(operator-facing verdict),traceToSubgraph.ts(node/edge coloring),TraceSummary.tsx(drawer glance).What to scrutinize: the honesty invariants below — each is load-bearing and each has a test that pins it. If you can find a case that false-condemns (says broken when traffic flows) or overclaims (says reachable/healthy without real verification), that's the bug that matters most here.
High-level architecture
The two ideas that do the heavy lifting
confidence. Every probe is tagged by how it learned a result. real = tested the way traffic flows (direct dial / in-cluster). indirect = reached only via the API-server proxy — it localizes a problem but never sets the headline and never condemns the real path (the proxy bypasses NetworkPolicy and can fail for vantage-only reasons). The one exception: an authoritative cache fact (e.g. 0 ready endpoints) is definitive regardless of vantage and is promoted to a real failure.Honesty invariants (each pinned by a test)
trace.gohopReachSeverity;traceToSubgraph.tsTestComputeVerdict_PartialReadyBackendIsDegradedNotBroken; traceToSubgraph "partial-ready backend → amber"coverage.goupgradeDefinitiveBackendDownTestUpgradeDefinitiveBackendDown(incl. multi-port + uncertain-warning-stays-soft)pkg/probe/probe.goisClusterUnreachableTestIsClusterUnreachabletrace.goanyProbeReachedTestAnyProbeReached_FailedProbeIsNotReachedcoverage.gocoverageBannerTone,singleRouteHeadlinereachVerdict.tsbackendDowncoverage.go,TraceSummary.tsxtraceToSubgraph.tsScreenshots
1. Backend down — names the cause, links the culprit, doesn't blame the network. The headline leads with the real cause, credits the wiring, and the node panel links straight to the failing Pod + the exact
kubectl logscommand. The Service node stays neutral (wired fine); the red is on the Pod (the actual fault).2. Partial-ready — degraded, not condemned. 1 of 2 pods ready (one crashing, one serving). It still serves, so it reads degraded / reached — not the false "broken". This is the honesty line the feature is built on.
3. Coverage matrix — per-path, per-vantage. Each route ×
FROM YOUR MACHINE/VIA API PROXY/IN-CLUSTER. A name that doesn't resolve from the laptop is reported as that — never a false "unreachable".4. Ingress, controller present — full front-door path.
Ingress → Service → Pods, front-door honesty ("backend reachable, but the entry point wasn't tested from here"), missing-TLS-secret finding, copyablecurl --resolve.5. Ingress, no controller — the routing tier made visible. Rules exist but nothing serves them; plain explanation + the next step (
kubectl get ingressclass). The controller tier other tools skip.6. Passive drawer glance. In the resource drawer, Reachability is a calm one-line glance with a "Config check only — not tested" honesty subtitle and one CTA into the live test. It never nags.
File map (80 files)
internal/trace/(24) — the engine:trace.go(verdict),coverage.go(route projection),entries.go(path graph),probes.go(probe orchestration),findings.go,netpol.go(NetworkPolicy evaluator),ingress_controller.go(controller tier),egress.go. ThecycleN_fixes_test.gofiles are test-only regression suites from review rounds.pkg/probe/(2) — probe primitives (DNS/TCP/TLS/HTTP, apiserver-proxy) + classification.internal/reachability/(4) — the in-cluster probe runner (short-lived Jobs).packages/k8s-ui/src/components/trace/(22) — the UI:ReachabilityView(Tree+Diagram),TracePanel,reachVerdict,traceToSubgraph,TraceSummary(drawer),probe-display.internal/mcp/(5),internal/server/(4) — MCPdiagnoseextension + the/api/traceendpoint.internal/k8s/(6) — cache wiring for the kinds the trace reads.web/,cmd/,docs/,deploy/helm, README — wiring, flags, docs.Test plan
go test ./...(incl.internal/trace,pkg/probe) — green.tscclean, full vitest suite green (incl.reachVerdict/traceToSubgraph/TraceSummary/probe-display).Known limitations (deliberately deferred)
Surfaced by an independent critic pass and consciously left for a follow-up — each would be net-negative or out-of-scope to fix now:
NetworkPolicyonly. A path a CNI CRD policy allows could still show an amber "a rule would block traffic". Not capped: core NP is enforced on those CNIs too, so suppressing it would under-report a real deny — and the finding is hedged + clearable by the in-cluster test (subject to the real CNI).IsNotFoundduring informer sync. A just-applied or cross-scope backend can briefly read as a missing-ref break until the cache syncs. Narrow + self-healing; a scope-aware guard risks a worse false-clear.Scope / risk notes
create jobs+list/get pods.GET; per-request headers deferred).Note
High Risk
Introduces mutating in-cluster probe Jobs under caller RBAC, new proxy-based active checks, and broad diagnosis surfaces (REST/MCP/UI) where false condemn or overclaim would mislead operators during incidents.
Overview
Adds network path diagnosis for
Service,Ingress,HTTPRoute,GRPCRoute, andGateway: a hop-ordered static trace from the informer cache plus optional live probes (DNS/TCP/TLS/HTTP) that respect vantage (realvs API-serverindirect) so headlines do not overclaim reachability.API & CLI:
GET /api/trace/{kind}/{ns}/{name}with?probe=true; in-cluster tests via restricted short-lived Jobs (POST …/probe-in-cluster,POST …/in-cluster) and aradar probesubcommand.--reachability-image/RADAR_IMAGE/ self-read of the running pod image pick the probe container; Helm injectsMY_POD_NAMEandRADAR_IMAGE.MCP:
diagnosenow returns coverage-shaped traces for network entry kinds (probe, mutatinginClusterwith Member+RBAC gates). Tool annotations treatdiagnoseas non-read-only wheninClusteris used.Detection tweaks feeding the trace: Argo Rollout-aware scale-to-zero, uncertain Rollout lookup instead of critical “no endpoints”, readiness-probe vs Service target port (
svc:probe-port-mismatch), richer Gateway backend port messages, andErrDynamicNotReadyfor uninitialized dynamic cache.Docs (
docs/diagnose.md, README, MCP) andmake kind-load-probesupport local in-cluster probe development.Reviewed by Cursor Bugbot for commit e66ff7b. Bugbot is set up for automated code reviews on this repo. Configure here.