Skip to content

fix: restore dead ports on multi-port service reconnect (#509)#510

Merged
cjimti merged 3 commits into
masterfrom
fix/509-multiport-reconnect-zombie
May 29, 2026
Merged

fix: restore dead ports on multi-port service reconnect (#509)#510
cjimti merged 3 commits into
masterfrom
fix/509-multiport-reconnect-zombie

Conversation

@cjimti
Copy link
Copy Markdown
Member

@cjimti cjimti commented May 29, 2026

Fixes #509.

Problem

A multi-port normal service could enter an unrecoverable zombie state after one of its ports had its connection reset (e.g. the TCP RST behavior of kubernetes/kubernetes#111825, which kills a single port's kubelet listener). Auto-reconnect (-a, on by default in --tui) would fire, re-run SyncPodForwards, find the pod — and then do nothing: the dead port was never re-established and the service's /etc/hosts entries were never restored. Only a full kubefwd restart recovered the service.

Root cause

The defect is in syncNormalService (pkg/fwdservice/fwdservice.go). A normal service forwards a single pod, but each of its ports is a separate entry in PortForwards (key: service.podname.localport). The sync logic tracked a single forward key to keep (keyToKeep) and skipped LoopPodsToForward entirely whenever any forward for the pod still existed. So when one port survived, the dead port's slot was skipped forever.

This was introduced by 64ee12c: before it, the keep-comparison compared a map key against pod.Name (never equal), so keyToKeep stayed empty and every resync rebuilt all ports — inefficient but correct for multi-port. After 64ee12c the comparison started matching, which also meant healthy multi-port services would lose all-but-one port on the routine 5-minute resync, not just after an RST.

Fix

Reason in terms of the pod name to keep instead of a single forward key:

  • keep all forwards belonging to the chosen pod,
  • remove forwards belonging to any other pod,
  • always re-invoke LoopPodsToForward for the kept pod.

LoopPodsToForward already skips ports that are already forwarded, so this is a no-op for a healthy pod and re-establishes any independently-torn-down port (re-adding its /etc/hosts entries) otherwise. Headless services were already unaffected (they always loop all pods).

Tests

Full suite + -race + lint + build all pass.

Note

This branch also contains a small build commit (build: write verify sentinel...) that wires a missing local pre-commit-gate sentinel into make verify. It is unrelated to the bug fix; happy to drop it if you'd prefer the PR contain only the fix.

cjimti added 2 commits May 29, 2026 02:14
The local pre-commit gate (~/.claude/hooks/review-gate.sh) requires
`make verify` to record the working-tree diff hash to
.claude/.last-verify-passed on success, but the verify target never
wrote it, so the gate could never be satisfied. Add a write-verify-sentinel
step that records the same hash the gate computes. .claude is gitignored,
so writing the sentinel does not alter the diff.
A multi-port normal service could enter an unrecoverable zombie state
after one port's connection was reset (e.g. the TCP RST behavior of
kubernetes/kubernetes#111825 that kills a single port's kubelet
listener). Auto-reconnect re-ran SyncPodForwards, found the pod, but
never recreated the dead port or restored its /etc/hosts entries.

Root cause is in syncNormalService: it tracked a single forward KEY to
keep (one port) and skipped LoopPodsToForward entirely whenever any
forward for the pod still existed. The surviving port's map entry caused
the dead port to be skipped forever. This also degraded healthy
multi-port services on the periodic resync, dropping all but one port.

Reason in terms of the pod NAME to keep instead of a single key: keep
all forwards for the chosen pod, remove forwards for other pods, and
always re-run LoopPodsToForward for the kept pod. LoopPodsToForward
already skips ports that exist, so it is a no-op when healthy and
re-establishes any torn-down port (re-adding /etc/hosts) otherwise.

Add regression tests covering dead-port restoration and that a healthy
multi-port resync retains all ports.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 81.18%. Comparing base (1f6aa95) to head (e775923).

Files with missing lines Patch % Lines
pkg/fwdservice/fwdservice.go 90.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #510      +/-   ##
==========================================
+ Coverage   81.07%   81.18%   +0.10%     
==========================================
  Files          71       71              
  Lines       12843    12848       +5     
==========================================
+ Hits        10413    10431      +18     
+ Misses       2018     2009       -9     
+ Partials      412      408       -4     
Files with missing lines Coverage Δ
pkg/fwdservice/fwdservice.go 81.57% <90.00%> (+3.17%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add an end-to-end regression test that forwards a two-port service
(single nginx pod serving 80 and 8080) and forces a resync via the REST
API (POST /v1/services/:key/sync?force=true), then asserts both ports
still serve. This drives the same syncNormalService path auto-reconnect
uses; on the pre-fix code the forced sync dropped one port and removed
the shared /etc/hosts entry, breaking the service.

Reproducing the exact upstream RST trigger (kubernetes/kubernetes#111825)
is not deterministic, so the forced-sync path is used to exercise the bug
reliably.

Adds test/manifests/multiport-service.yaml (auto-deployed by
deploy-test-services.sh).
@cjimti cjimti merged commit fd564d9 into master May 29, 2026
11 checks passed
@cjimti cjimti deleted the fix/509-multiport-reconnect-zombie branch May 29, 2026 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Auto-reconnect leaves multi-port services in unrecoverable zombie state after TCP RST

1 participant