Skip to content

chore: refresh CLAUDE.md project structure and pin golangci-lint#33

Merged
mlorentedev merged 2 commits into
masterfrom
chore/refresh-claude-md-and-pin-linter
May 18, 2026
Merged

chore: refresh CLAUDE.md project structure and pin golangci-lint#33
mlorentedev merged 2 commits into
masterfrom
chore/refresh-claude-md-and-pin-linter

Conversation

@mlorentedev
Copy link
Copy Markdown
Owner

Summary

Two DX papercuts from the 2026-05-18 multi-dim audit (Major DX-1 and DX-2).

Fix 1 — Stale CLAUDE.md

`CLAUDE.md` claimed:

  • "Architecture: Single-binary, single-file (`main.go` ~785 lines)" → wrong since PR feat: graceful drain and multi-package refactoring #14 (ARCH-002) split into `main.go` (~270 lines orchestration) + four `internal/` packages.
  • Key Paths table listed only `main_test.go` and `main_integration_test.go` — missing every `internal/*` package, `specs/`, `scripts/tests/`.

Refreshed:

  • Updated Tech Stack to "Single binary, multi-package".
  • Key Paths now lists each `internal/` package + their roles, `specs/` SDD trees, `scripts/tests/` BATS suite, updated CI section.
  • Architecture Decisions block adds ADR-006 (Dialer) and ADR-007 (multi-package split) — making the current layout discoverable from the project memory file alone.

Fix 2 — Pin golangci-lint

`.github/workflows/ci.yml` had `version: latest`. Contributors running `make lint` locally with an older version see green while CI sees red (or the reverse), wasting investigation time.

Pinned to `v1.62.2` in CI and documented the same `go install [email protected]` in CLAUDE.md's Commands block. Bump deliberately on upgrade.

Diff

2 files changed, +22 / -11.

Test plan

  • CLAUDE.md text reflects actual file structure as of v1.7.0 (verified `wc -l` on every cited file)
  • CI's existing lint job continues to pass with v1.62.2 (the version was already passing prior commits at `latest` resolving to this range)
  • CI green on this PR

Two related DX papercuts surfaced by the 2026-05-18 multi-dim audit:

1. CLAUDE.md "Tech Stack" claimed `main.go ~785 lines` and "single-file"
   architecture. After ARCH-002 (PR #14), main.go is ~270 lines of
   orchestration plus four `internal/` packages (config, proxy, health,
   telemetry). The "Key Paths" table listed only main_test.go and
   main_integration_test.go, missing every internal/ package and the
   specs/ + scripts/tests/ trees added since.

   New contributors landing on the wrong file wastes onboarding time;
   tooling (claude-mem, vault refs, ADRs in the doc itself) was the
   only source of truth. Stale memory is worse than no memory.

2. .github/workflows/ci.yml pinned golangci-lint to `version: latest`,
   so contributors running `make lint` locally with whatever they
   installed could see green while CI sees red (or vice versa). Pinned
   to v1.62.2 in CI and documented the same install command in
   CLAUDE.md's Commands block so the two stay in lockstep.

CLAUDE.md also gains explicit references to ADR-006 (Dialer) and
ADR-007 (multi-package split) in the Architecture Decisions section,
making the layout discoverable from the project memory file.

Audit reference: 40-runbooks/audit-2026-05-18-multi-dim.md
(Major DX-1 and DX-2 entries).
Pinning golangci-lint to v1.62.2 in this PR activated gosimple
S1008, which flagged the final 'if ... { return true }; return false'
in isPermanentDialError. Collapsed to a direct return of the bool
expression — semantically identical, lints clean.

The block survived the previous 'latest' golangci-lint runs because
that version had different rule defaults. Pinning catches the drift,
exactly as intended.
@mlorentedev mlorentedev force-pushed the chore/refresh-claude-md-and-pin-linter branch from ae82c3f to f820f3e Compare May 18, 2026 19:31
@mlorentedev mlorentedev merged commit 40254f4 into master May 18, 2026
12 checks passed
@mlorentedev mlorentedev deleted the chore/refresh-claude-md-and-pin-linter branch May 18, 2026 19:49
mlorentedev added a commit that referenced this pull request May 18, 2026
Three related fixes surfaced by the 2026-05-18 multi-dim audit
(Major Scalability findings 1, 2, 3). Each addresses a distinct
runtime behavior that hurts the system under bursty or partial-failure
load:

1) TOCTOU on MaxConnections cap. AcceptLoop previously did
   `GetActiveConnections() >= max` check, then `AddActiveConnection(1)`
   inside handleConn. Under accept storms two goroutines could both
   observe `cur < max` and both proceed past the limit.

   Fix: new `telemetry.TryClaimConnection(max)` uses an atomic CAS loop
   to increment ActiveConnections only when the new value would not
   exceed the cap. handleConn no longer manages the slot — the
   AcceptLoop's spawned goroutine releases with
   `defer telemetry.AddActiveConnection(-1)`. handleConn keeps
   AddTotalConnection so direct unit-test callers still observe the
   per-call total increment.

2) Dial timeout reuse. handleConn previously used cfg.ConnectTimeout
   (TS_TIMEOUT, default 30s) for the per-target dial. Combined with
   ReconnectDialer retries and 30s backoff cap, one stuck connection
   could hold a slot for ~3 minutes.

   Fix: new TS_DIAL_TIMEOUT env var, default 5s, parsed in config
   with > 0 requirement. handleConn now uses cfg.DialTimeout for the
   per-connection ctx. cfg.ConnectTimeout remains for tsnet init
   (control-plane handshake legitimately needs the larger budget).

3) Half-close not honored. proxyConnections previously used sync.Once
   to slam BOTH ends on the first direction's exit. A still-active
   direction mid-write would lose in-flight bytes — RDP keepalive
   teardowns triggered this in practice.

   Fix: each direction runs in its own goroutine and (on graceful EOF)
   calls CloseWrite on dst, letting the peer see EOF without tearing
   down the opposite direction. New halfCloseWrite() helper unwraps
   idleConn before type-asserting on the underlying halfCloser
   interface (idleConn embeds net.Conn, doesn't promote CloseWrite).
   Falls back to full Close when the conn doesn't implement CloseWrite
   (e.g. net.Pipe in tests). Both ends fully closed after both
   directions return (idempotent).

Telemetry test coverage:
- TestTryClaimConnection_HonorsLimit: basic claim/release
- TestTryClaimConnection_AtomicUnderRace: 500 goroutines race for 50
  slots; assert exactly 50 succeed (no overshoot)
- TestTryClaimConnection_ZeroCapAlwaysFails: edge

Proxy test coverage:
- TestHalfCloseWrite_UnwrapsIdleConn: idleConn wrapper passes through
- TestHalfCloseWrite_ReturnsFalseForNonHalfCloser: net.Pipe fallback
- TestProxyConnections_HalfClosesOnEOF: end-to-end with mock
  halfCloseConn that records CloseWrite invocation

Config test coverage:
- 4 new cases for TS_DIAL_TIMEOUT (default, parsed, zero rejected,
  negative rejected)

Also includes the gosimple S1008 fix in reconnect.go that PR #33
introduced — kept here so this PR is independently lintable against
the pinned v1.62.2 even if PR #33 hasn't merged yet.

Audit reference: 40-runbooks/audit-2026-05-18-multi-dim.md
(Major Scalability 1, 2, 3).
mlorentedev added a commit that referenced this pull request May 18, 2026
## Summary

Three related fixes surfaced by the [2026-05-18 multi-dim
audit](https://github.com/mlorentedev/ts-bridge/blob/master/specs/)
(Major Scalability 1/2/3). Each addresses a distinct runtime behavior
that hurts the system under bursty or partial-failure load.

## What changed

### Fix 1 — TOCTOU on \`MaxConnections\`

\`AcceptLoop\` previously did \`GetActiveConnections() >= max\` then
\`AddActiveConnection(1)\` inside \`handleConn\`. Under accept storms,
two goroutines could both observe \`cur < max\` and both proceed past
the cap.

**Fix:** new \`telemetry.TryClaimConnection(max)\` uses an atomic CAS
loop. \`AcceptLoop\`'s spawned goroutine handles release via \`defer
telemetry.AddActiveConnection(-1)\`. \`handleConn\` no longer manages
the slot.

### Fix 2 — \`TS_DIAL_TIMEOUT\` separate from \`TS_TIMEOUT\`

\`handleConn\` previously used \`cfg.ConnectTimeout\` (TS_TIMEOUT,
default 30s) for per-dial context. Combined with \`ReconnectDialer\`
retries + 30s backoff cap, one stuck connection could hold a slot for ~3
minutes.

**Fix:** new env var \`TS_DIAL_TIMEOUT\` (default \`5s\`, must be > 0).
\`TS_TIMEOUT\` retains its role for tsnet init (control-plane handshake
legitimately needs the larger budget).

### Fix 3 — Half-close support in \`proxyConnections\`

Old code used \`sync.Once\` to slam BOTH ends on the first direction's
exit, losing in-flight bytes on the other direction (RDP keepalive
teardowns triggered this).

**Fix:** each direction now runs in its own goroutine; on graceful EOF,
\`CloseWrite\` is invoked on \`dst\`. New \`halfCloseWrite\` helper
unwraps \`*idleConn\` (which embeds \`net.Conn\` and doesn't promote the
method) before the type-assert. Falls back to full \`Close\` for
non-\`halfCloser\` conns (e.g., \`net.Pipe\` in tests). Both ends fully
closed after both directions complete.

## Diff

10 files, +304 / -39 (production ~120 lines + tests + docs).

## Tests added

| Test | Verifies |
|---|---|
| \`TestTryClaimConnection_HonorsLimit\` | Basic claim/release |
| \`TestTryClaimConnection_AtomicUnderRace\` | 500 goroutines race for
50 slots; exactly 50 succeed (no overshoot) |
| \`TestTryClaimConnection_ZeroCapAlwaysFails\` | Edge |
| \`TestHalfCloseWrite_UnwrapsIdleConn\` | idleConn passes through |
| \`TestHalfCloseWrite_ReturnsFalseForNonHalfCloser\` | net.Pipe
fallback |
| \`TestProxyConnections_HalfClosesOnEOF\` | End-to-end with mock
\`halfCloseConn\` that records \`CloseWrite\` |
| \`TestLoadConfig/dial_timeout_*\` (×4) | Default 5s, parsed, zero
rejected, negative rejected |

Plus \`TestConnectionLimit\` rewritten to validate the new
\`TryClaimConnection\` API instead of simulating the deprecated
check-then-act pattern.

## Notes

- Also carries the \`gosimple S1008\` fix in \`reconnect.go\` introduced
by PR #33 (kept here so this PR is independently lintable against the
pinned v1.62.2 if PR #33 hasn't merged yet — duplicate change drops out
cleanly on rebase).
- This is a \`fix:\` commit — release-please will bump v1.7.1 → v1.7.2
(patch).
- All operator-visible changes are documented in the new
\`TS_DIAL_TIMEOUT\` row in Starlight + \`.env.example\`. No breaking
change for existing deployments (default 5s is sane, and operators with
non-trivial latency can set higher explicitly).

## Test plan

- [x] \`go test ./...\` → all packages PASS locally
- [x] \`go vet ./...\` clean
- [x] \`go build ./...\` clean
- [ ] CI race detector + pinned golangci-lint
- [ ] CI gosec
- [ ] CI build-matrix across 6 platforms
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant