fix: harden proxy under load — TOCTOU, dial timeout, half-close#34
Merged
Conversation
Three related fixes surfaced by the 2026-05-18 multi-dim audit (Major Scalability findings 1, 2, 3). Each addresses a distinct runtime behavior that hurts the system under bursty or partial-failure load: 1) TOCTOU on MaxConnections cap. AcceptLoop previously did `GetActiveConnections() >= max` check, then `AddActiveConnection(1)` inside handleConn. Under accept storms two goroutines could both observe `cur < max` and both proceed past the limit. Fix: new `telemetry.TryClaimConnection(max)` uses an atomic CAS loop to increment ActiveConnections only when the new value would not exceed the cap. handleConn no longer manages the slot — the AcceptLoop's spawned goroutine releases with `defer telemetry.AddActiveConnection(-1)`. handleConn keeps AddTotalConnection so direct unit-test callers still observe the per-call total increment. 2) Dial timeout reuse. handleConn previously used cfg.ConnectTimeout (TS_TIMEOUT, default 30s) for the per-target dial. Combined with ReconnectDialer retries and 30s backoff cap, one stuck connection could hold a slot for ~3 minutes. Fix: new TS_DIAL_TIMEOUT env var, default 5s, parsed in config with > 0 requirement. handleConn now uses cfg.DialTimeout for the per-connection ctx. cfg.ConnectTimeout remains for tsnet init (control-plane handshake legitimately needs the larger budget). 3) Half-close not honored. proxyConnections previously used sync.Once to slam BOTH ends on the first direction's exit. A still-active direction mid-write would lose in-flight bytes — RDP keepalive teardowns triggered this in practice. Fix: each direction runs in its own goroutine and (on graceful EOF) calls CloseWrite on dst, letting the peer see EOF without tearing down the opposite direction. New halfCloseWrite() helper unwraps idleConn before type-asserting on the underlying halfCloser interface (idleConn embeds net.Conn, doesn't promote CloseWrite). Falls back to full Close when the conn doesn't implement CloseWrite (e.g. net.Pipe in tests). Both ends fully closed after both directions return (idempotent). Telemetry test coverage: - TestTryClaimConnection_HonorsLimit: basic claim/release - TestTryClaimConnection_AtomicUnderRace: 500 goroutines race for 50 slots; assert exactly 50 succeed (no overshoot) - TestTryClaimConnection_ZeroCapAlwaysFails: edge Proxy test coverage: - TestHalfCloseWrite_UnwrapsIdleConn: idleConn wrapper passes through - TestHalfCloseWrite_ReturnsFalseForNonHalfCloser: net.Pipe fallback - TestProxyConnections_HalfClosesOnEOF: end-to-end with mock halfCloseConn that records CloseWrite invocation Config test coverage: - 4 new cases for TS_DIAL_TIMEOUT (default, parsed, zero rejected, negative rejected) Also includes the gosimple S1008 fix in reconnect.go that PR #33 introduced — kept here so this PR is independently lintable against the pinned v1.62.2 even if PR #33 hasn't merged yet. Audit reference: 40-runbooks/audit-2026-05-18-multi-dim.md (Major Scalability 1, 2, 3).
…o 15 Adding TS_DIAL_TIMEOUT parsing pushed LoadConfig's cyclomatic complexity from 14 to 16, tripping golangci-lint gocyclo (threshold 15). Mirrors the earlier parseDialConfig extraction (ARCH-004): group the two per-connection timeout knobs into one helper, dropping the count back below threshold without changing observable semantics. No test changes — existing TS_IDLE_TIMEOUT / TS_DIAL_TIMEOUT cases exercise the new helper transparently.
b1ad862 to
c17ac35
Compare
mlorentedev
added a commit
that referenced
this pull request
May 18, 2026
🤖 I have created a release *beep* *boop* --- ## [1.7.2](v1.7.1...v1.7.2) (2026-05-18) ### Bug Fixes * harden proxy under load — TOCTOU, dial timeout, half-close ([#34](#34)) ([c769153](c769153)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three related fixes surfaced by the 2026-05-18 multi-dim audit (Major Scalability 1/2/3). Each addresses a distinct runtime behavior that hurts the system under bursty or partial-failure load.
What changed
Fix 1 — TOCTOU on `MaxConnections`
`AcceptLoop` previously did `GetActiveConnections() >= max` then `AddActiveConnection(1)` inside `handleConn`. Under accept storms, two goroutines could both observe `cur < max` and both proceed past the cap.
Fix: new `telemetry.TryClaimConnection(max)` uses an atomic CAS loop. `AcceptLoop`'s spawned goroutine handles release via `defer telemetry.AddActiveConnection(-1)`. `handleConn` no longer manages the slot.
Fix 2 — `TS_DIAL_TIMEOUT` separate from `TS_TIMEOUT`
`handleConn` previously used `cfg.ConnectTimeout` (TS_TIMEOUT, default 30s) for per-dial context. Combined with `ReconnectDialer` retries + 30s backoff cap, one stuck connection could hold a slot for ~3 minutes.
Fix: new env var `TS_DIAL_TIMEOUT` (default `5s`, must be > 0). `TS_TIMEOUT` retains its role for tsnet init (control-plane handshake legitimately needs the larger budget).
Fix 3 — Half-close support in `proxyConnections`
Old code used `sync.Once` to slam BOTH ends on the first direction's exit, losing in-flight bytes on the other direction (RDP keepalive teardowns triggered this).
Fix: each direction now runs in its own goroutine; on graceful EOF, `CloseWrite` is invoked on `dst`. New `halfCloseWrite` helper unwraps `*idleConn` (which embeds `net.Conn` and doesn't promote the method) before the type-assert. Falls back to full `Close` for non-`halfCloser` conns (e.g., `net.Pipe` in tests). Both ends fully closed after both directions complete.
Diff
10 files, +304 / -39 (production ~120 lines + tests + docs).
Tests added
Plus `TestConnectionLimit` rewritten to validate the new `TryClaimConnection` API instead of simulating the deprecated check-then-act pattern.
Notes
Test plan