Skip to content

fix: harden proxy under load — TOCTOU, dial timeout, half-close#34

Merged
mlorentedev merged 3 commits into
masterfrom
fix/proxy-harden-under-load
May 18, 2026
Merged

fix: harden proxy under load — TOCTOU, dial timeout, half-close#34
mlorentedev merged 3 commits into
masterfrom
fix/proxy-harden-under-load

Conversation

@mlorentedev
Copy link
Copy Markdown
Owner

Summary

Three related fixes surfaced by the 2026-05-18 multi-dim audit (Major Scalability 1/2/3). Each addresses a distinct runtime behavior that hurts the system under bursty or partial-failure load.

What changed

Fix 1 — TOCTOU on `MaxConnections`

`AcceptLoop` previously did `GetActiveConnections() >= max` then `AddActiveConnection(1)` inside `handleConn`. Under accept storms, two goroutines could both observe `cur < max` and both proceed past the cap.

Fix: new `telemetry.TryClaimConnection(max)` uses an atomic CAS loop. `AcceptLoop`'s spawned goroutine handles release via `defer telemetry.AddActiveConnection(-1)`. `handleConn` no longer manages the slot.

Fix 2 — `TS_DIAL_TIMEOUT` separate from `TS_TIMEOUT`

`handleConn` previously used `cfg.ConnectTimeout` (TS_TIMEOUT, default 30s) for per-dial context. Combined with `ReconnectDialer` retries + 30s backoff cap, one stuck connection could hold a slot for ~3 minutes.

Fix: new env var `TS_DIAL_TIMEOUT` (default `5s`, must be > 0). `TS_TIMEOUT` retains its role for tsnet init (control-plane handshake legitimately needs the larger budget).

Fix 3 — Half-close support in `proxyConnections`

Old code used `sync.Once` to slam BOTH ends on the first direction's exit, losing in-flight bytes on the other direction (RDP keepalive teardowns triggered this).

Fix: each direction now runs in its own goroutine; on graceful EOF, `CloseWrite` is invoked on `dst`. New `halfCloseWrite` helper unwraps `*idleConn` (which embeds `net.Conn` and doesn't promote the method) before the type-assert. Falls back to full `Close` for non-`halfCloser` conns (e.g., `net.Pipe` in tests). Both ends fully closed after both directions complete.

Diff

10 files, +304 / -39 (production ~120 lines + tests + docs).

Tests added

Test Verifies
`TestTryClaimConnection_HonorsLimit` Basic claim/release
`TestTryClaimConnection_AtomicUnderRace` 500 goroutines race for 50 slots; exactly 50 succeed (no overshoot)
`TestTryClaimConnection_ZeroCapAlwaysFails` Edge
`TestHalfCloseWrite_UnwrapsIdleConn` idleConn passes through
`TestHalfCloseWrite_ReturnsFalseForNonHalfCloser` net.Pipe fallback
`TestProxyConnections_HalfClosesOnEOF` End-to-end with mock `halfCloseConn` that records `CloseWrite`
`TestLoadConfig/dial_timeout_*` (×4) Default 5s, parsed, zero rejected, negative rejected

Plus `TestConnectionLimit` rewritten to validate the new `TryClaimConnection` API instead of simulating the deprecated check-then-act pattern.

Notes

Test plan

  • `go test ./...` → all packages PASS locally
  • `go vet ./...` clean
  • `go build ./...` clean
  • CI race detector + pinned golangci-lint
  • CI gosec
  • CI build-matrix across 6 platforms

Three related fixes surfaced by the 2026-05-18 multi-dim audit
(Major Scalability findings 1, 2, 3). Each addresses a distinct
runtime behavior that hurts the system under bursty or partial-failure
load:

1) TOCTOU on MaxConnections cap. AcceptLoop previously did
   `GetActiveConnections() >= max` check, then `AddActiveConnection(1)`
   inside handleConn. Under accept storms two goroutines could both
   observe `cur < max` and both proceed past the limit.

   Fix: new `telemetry.TryClaimConnection(max)` uses an atomic CAS loop
   to increment ActiveConnections only when the new value would not
   exceed the cap. handleConn no longer manages the slot — the
   AcceptLoop's spawned goroutine releases with
   `defer telemetry.AddActiveConnection(-1)`. handleConn keeps
   AddTotalConnection so direct unit-test callers still observe the
   per-call total increment.

2) Dial timeout reuse. handleConn previously used cfg.ConnectTimeout
   (TS_TIMEOUT, default 30s) for the per-target dial. Combined with
   ReconnectDialer retries and 30s backoff cap, one stuck connection
   could hold a slot for ~3 minutes.

   Fix: new TS_DIAL_TIMEOUT env var, default 5s, parsed in config
   with > 0 requirement. handleConn now uses cfg.DialTimeout for the
   per-connection ctx. cfg.ConnectTimeout remains for tsnet init
   (control-plane handshake legitimately needs the larger budget).

3) Half-close not honored. proxyConnections previously used sync.Once
   to slam BOTH ends on the first direction's exit. A still-active
   direction mid-write would lose in-flight bytes — RDP keepalive
   teardowns triggered this in practice.

   Fix: each direction runs in its own goroutine and (on graceful EOF)
   calls CloseWrite on dst, letting the peer see EOF without tearing
   down the opposite direction. New halfCloseWrite() helper unwraps
   idleConn before type-asserting on the underlying halfCloser
   interface (idleConn embeds net.Conn, doesn't promote CloseWrite).
   Falls back to full Close when the conn doesn't implement CloseWrite
   (e.g. net.Pipe in tests). Both ends fully closed after both
   directions return (idempotent).

Telemetry test coverage:
- TestTryClaimConnection_HonorsLimit: basic claim/release
- TestTryClaimConnection_AtomicUnderRace: 500 goroutines race for 50
  slots; assert exactly 50 succeed (no overshoot)
- TestTryClaimConnection_ZeroCapAlwaysFails: edge

Proxy test coverage:
- TestHalfCloseWrite_UnwrapsIdleConn: idleConn wrapper passes through
- TestHalfCloseWrite_ReturnsFalseForNonHalfCloser: net.Pipe fallback
- TestProxyConnections_HalfClosesOnEOF: end-to-end with mock
  halfCloseConn that records CloseWrite invocation

Config test coverage:
- 4 new cases for TS_DIAL_TIMEOUT (default, parsed, zero rejected,
  negative rejected)

Also includes the gosimple S1008 fix in reconnect.go that PR #33
introduced — kept here so this PR is independently lintable against
the pinned v1.62.2 even if PR #33 hasn't merged yet.

Audit reference: 40-runbooks/audit-2026-05-18-multi-dim.md
(Major Scalability 1, 2, 3).
…o 15

Adding TS_DIAL_TIMEOUT parsing pushed LoadConfig's cyclomatic complexity
from 14 to 16, tripping golangci-lint gocyclo (threshold 15).

Mirrors the earlier parseDialConfig extraction (ARCH-004): group the two
per-connection timeout knobs into one helper, dropping the count back
below threshold without changing observable semantics.

No test changes — existing TS_IDLE_TIMEOUT / TS_DIAL_TIMEOUT cases
exercise the new helper transparently.
@mlorentedev mlorentedev force-pushed the fix/proxy-harden-under-load branch from b1ad862 to c17ac35 Compare May 18, 2026 19:50
@mlorentedev mlorentedev merged commit c769153 into master May 18, 2026
12 checks passed
@mlorentedev mlorentedev deleted the fix/proxy-harden-under-load branch May 18, 2026 19:58
mlorentedev added a commit that referenced this pull request May 18, 2026
🤖 I have created a release *beep* *boop*
---


##
[1.7.2](v1.7.1...v1.7.2)
(2026-05-18)


### Bug Fixes

* harden proxy under load — TOCTOU, dial timeout, half-close
([#34](#34))
([c769153](c769153))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant