chore: refresh CLAUDE.md project structure and pin golangci-lint#33
Merged
Conversation
Two related DX papercuts surfaced by the 2026-05-18 multi-dim audit: 1. CLAUDE.md "Tech Stack" claimed `main.go ~785 lines` and "single-file" architecture. After ARCH-002 (PR #14), main.go is ~270 lines of orchestration plus four `internal/` packages (config, proxy, health, telemetry). The "Key Paths" table listed only main_test.go and main_integration_test.go, missing every internal/ package and the specs/ + scripts/tests/ trees added since. New contributors landing on the wrong file wastes onboarding time; tooling (claude-mem, vault refs, ADRs in the doc itself) was the only source of truth. Stale memory is worse than no memory. 2. .github/workflows/ci.yml pinned golangci-lint to `version: latest`, so contributors running `make lint` locally with whatever they installed could see green while CI sees red (or vice versa). Pinned to v1.62.2 in CI and documented the same install command in CLAUDE.md's Commands block so the two stay in lockstep. CLAUDE.md also gains explicit references to ADR-006 (Dialer) and ADR-007 (multi-package split) in the Architecture Decisions section, making the layout discoverable from the project memory file. Audit reference: 40-runbooks/audit-2026-05-18-multi-dim.md (Major DX-1 and DX-2 entries).
Pinning golangci-lint to v1.62.2 in this PR activated gosimple
S1008, which flagged the final 'if ... { return true }; return false'
in isPermanentDialError. Collapsed to a direct return of the bool
expression — semantically identical, lints clean.
The block survived the previous 'latest' golangci-lint runs because
that version had different rule defaults. Pinning catches the drift,
exactly as intended.
ae82c3f to
f820f3e
Compare
6 tasks
mlorentedev
added a commit
that referenced
this pull request
May 18, 2026
Three related fixes surfaced by the 2026-05-18 multi-dim audit (Major Scalability findings 1, 2, 3). Each addresses a distinct runtime behavior that hurts the system under bursty or partial-failure load: 1) TOCTOU on MaxConnections cap. AcceptLoop previously did `GetActiveConnections() >= max` check, then `AddActiveConnection(1)` inside handleConn. Under accept storms two goroutines could both observe `cur < max` and both proceed past the limit. Fix: new `telemetry.TryClaimConnection(max)` uses an atomic CAS loop to increment ActiveConnections only when the new value would not exceed the cap. handleConn no longer manages the slot — the AcceptLoop's spawned goroutine releases with `defer telemetry.AddActiveConnection(-1)`. handleConn keeps AddTotalConnection so direct unit-test callers still observe the per-call total increment. 2) Dial timeout reuse. handleConn previously used cfg.ConnectTimeout (TS_TIMEOUT, default 30s) for the per-target dial. Combined with ReconnectDialer retries and 30s backoff cap, one stuck connection could hold a slot for ~3 minutes. Fix: new TS_DIAL_TIMEOUT env var, default 5s, parsed in config with > 0 requirement. handleConn now uses cfg.DialTimeout for the per-connection ctx. cfg.ConnectTimeout remains for tsnet init (control-plane handshake legitimately needs the larger budget). 3) Half-close not honored. proxyConnections previously used sync.Once to slam BOTH ends on the first direction's exit. A still-active direction mid-write would lose in-flight bytes — RDP keepalive teardowns triggered this in practice. Fix: each direction runs in its own goroutine and (on graceful EOF) calls CloseWrite on dst, letting the peer see EOF without tearing down the opposite direction. New halfCloseWrite() helper unwraps idleConn before type-asserting on the underlying halfCloser interface (idleConn embeds net.Conn, doesn't promote CloseWrite). Falls back to full Close when the conn doesn't implement CloseWrite (e.g. net.Pipe in tests). Both ends fully closed after both directions return (idempotent). Telemetry test coverage: - TestTryClaimConnection_HonorsLimit: basic claim/release - TestTryClaimConnection_AtomicUnderRace: 500 goroutines race for 50 slots; assert exactly 50 succeed (no overshoot) - TestTryClaimConnection_ZeroCapAlwaysFails: edge Proxy test coverage: - TestHalfCloseWrite_UnwrapsIdleConn: idleConn wrapper passes through - TestHalfCloseWrite_ReturnsFalseForNonHalfCloser: net.Pipe fallback - TestProxyConnections_HalfClosesOnEOF: end-to-end with mock halfCloseConn that records CloseWrite invocation Config test coverage: - 4 new cases for TS_DIAL_TIMEOUT (default, parsed, zero rejected, negative rejected) Also includes the gosimple S1008 fix in reconnect.go that PR #33 introduced — kept here so this PR is independently lintable against the pinned v1.62.2 even if PR #33 hasn't merged yet. Audit reference: 40-runbooks/audit-2026-05-18-multi-dim.md (Major Scalability 1, 2, 3).
mlorentedev
added a commit
that referenced
this pull request
May 18, 2026
## Summary Three related fixes surfaced by the [2026-05-18 multi-dim audit](https://github.com/mlorentedev/ts-bridge/blob/master/specs/) (Major Scalability 1/2/3). Each addresses a distinct runtime behavior that hurts the system under bursty or partial-failure load. ## What changed ### Fix 1 — TOCTOU on \`MaxConnections\` \`AcceptLoop\` previously did \`GetActiveConnections() >= max\` then \`AddActiveConnection(1)\` inside \`handleConn\`. Under accept storms, two goroutines could both observe \`cur < max\` and both proceed past the cap. **Fix:** new \`telemetry.TryClaimConnection(max)\` uses an atomic CAS loop. \`AcceptLoop\`'s spawned goroutine handles release via \`defer telemetry.AddActiveConnection(-1)\`. \`handleConn\` no longer manages the slot. ### Fix 2 — \`TS_DIAL_TIMEOUT\` separate from \`TS_TIMEOUT\` \`handleConn\` previously used \`cfg.ConnectTimeout\` (TS_TIMEOUT, default 30s) for per-dial context. Combined with \`ReconnectDialer\` retries + 30s backoff cap, one stuck connection could hold a slot for ~3 minutes. **Fix:** new env var \`TS_DIAL_TIMEOUT\` (default \`5s\`, must be > 0). \`TS_TIMEOUT\` retains its role for tsnet init (control-plane handshake legitimately needs the larger budget). ### Fix 3 — Half-close support in \`proxyConnections\` Old code used \`sync.Once\` to slam BOTH ends on the first direction's exit, losing in-flight bytes on the other direction (RDP keepalive teardowns triggered this). **Fix:** each direction now runs in its own goroutine; on graceful EOF, \`CloseWrite\` is invoked on \`dst\`. New \`halfCloseWrite\` helper unwraps \`*idleConn\` (which embeds \`net.Conn\` and doesn't promote the method) before the type-assert. Falls back to full \`Close\` for non-\`halfCloser\` conns (e.g., \`net.Pipe\` in tests). Both ends fully closed after both directions complete. ## Diff 10 files, +304 / -39 (production ~120 lines + tests + docs). ## Tests added | Test | Verifies | |---|---| | \`TestTryClaimConnection_HonorsLimit\` | Basic claim/release | | \`TestTryClaimConnection_AtomicUnderRace\` | 500 goroutines race for 50 slots; exactly 50 succeed (no overshoot) | | \`TestTryClaimConnection_ZeroCapAlwaysFails\` | Edge | | \`TestHalfCloseWrite_UnwrapsIdleConn\` | idleConn passes through | | \`TestHalfCloseWrite_ReturnsFalseForNonHalfCloser\` | net.Pipe fallback | | \`TestProxyConnections_HalfClosesOnEOF\` | End-to-end with mock \`halfCloseConn\` that records \`CloseWrite\` | | \`TestLoadConfig/dial_timeout_*\` (×4) | Default 5s, parsed, zero rejected, negative rejected | Plus \`TestConnectionLimit\` rewritten to validate the new \`TryClaimConnection\` API instead of simulating the deprecated check-then-act pattern. ## Notes - Also carries the \`gosimple S1008\` fix in \`reconnect.go\` introduced by PR #33 (kept here so this PR is independently lintable against the pinned v1.62.2 if PR #33 hasn't merged yet — duplicate change drops out cleanly on rebase). - This is a \`fix:\` commit — release-please will bump v1.7.1 → v1.7.2 (patch). - All operator-visible changes are documented in the new \`TS_DIAL_TIMEOUT\` row in Starlight + \`.env.example\`. No breaking change for existing deployments (default 5s is sane, and operators with non-trivial latency can set higher explicitly). ## Test plan - [x] \`go test ./...\` → all packages PASS locally - [x] \`go vet ./...\` clean - [x] \`go build ./...\` clean - [ ] CI race detector + pinned golangci-lint - [ ] CI gosec - [ ] CI build-matrix across 6 platforms
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two DX papercuts from the 2026-05-18 multi-dim audit (Major DX-1 and DX-2).
Fix 1 — Stale CLAUDE.md
`CLAUDE.md` claimed:
Refreshed:
Fix 2 — Pin golangci-lint
`.github/workflows/ci.yml` had `version: latest`. Contributors running `make lint` locally with an older version see green while CI sees red (or the reverse), wasting investigation time.
Pinned to `v1.62.2` in CI and documented the same `go install [email protected]` in CLAUDE.md's Commands block. Bump deliberately on upgrade.
Diff
2 files changed, +22 / -11.
Test plan