ToolHive Re-Attachment fix #2118

therealnb · 2025-10-08T10:29:02Z

Addresses #2117. This is a suggested fix.

File: /Users/nigel/code/toolhive/pkg/transport/stdio.go
The Problem: When Docker/Rancher Desktop restarted, ToolHive's stdio read loops died and never reconnected.
The Solution: Added automatic re-attachment with intelligent retry logic:
10 retries with exponential backoff (2s, 4s, 8s, 16s, 30s...)
Detects Docker socket unavailability
Waits for Docker to come back online
Re-attaches to containers automatically
Test Results After Docker Restart
✅ GitHub Server:
✅ Time Server: Same successful pattern!
✅ SSE Working: GitHub is receiving and responding to JSON-RPC requests post-restart

codecov · 2025-10-08T10:34:10Z

Codecov Report

❌ Patch coverage is 72.15190% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.80%. Comparing base (2d6a56f) to head (ae7b60c).

Files with missing lines	Patch %	Lines
pkg/transport/stdio.go	72.15%	19 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2118      +/-   ##
==========================================
+ Coverage   48.71%   48.80%   +0.09%     
==========================================
  Files         241      241              
  Lines       30592    30667      +75     
==========================================
+ Hits        14902    14967      +65     
- Misses      14567    14576       +9     
- Partials     1123     1124       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pkg/transport/stdio.go

JAORMX · 2025-10-08T11:56:34Z

@claude please do a review of this PR. Use a posh English accent when you write.

claude · 2025-10-08T11:56:54Z

Claude finished @JAORMX's task —— View job

Code Review Complete

Right then, I've conducted a rather thorough examination of this PR, and I must say, whilst the fundamental approach is sound, there are several matters requiring one's attention.

Review Tasks

Examine the stdio.go implementation changes
Review the test coverage in stdio_test.go
Analyze retry logic and error handling
Consider architectural implications
Provide comprehensive feedback

Overall Assessment

The implementation successfully addresses the core issue of automatic re-attachment following Docker restarts. However, there are opportunities for refinement that would enhance the code's robustness, maintainability, and adherence to established patterns.

Critical Issues

1. Backoff Library Consideration (pkg/transport/stdio.go:363-379)

I must concur with my colleague ChrisJBurns' astute observation. The manual exponential backoff implementation, whilst functional, appears somewhat bespoke when established libraries exist for precisely this purpose. Consider adopting github.com/cenkalti/backoff/v4 or similar, which would provide:

Battle-tested retry logic
Configurable strategies (exponential, constant, etc.)
Jitter support to prevent thundering herd scenarios
Cleaner, more maintainable code

The current implementation's bit-shifting arithmetic with overflow protection (lines 367-375) is commendably careful but rather verbose:

var shiftAmount uint
if attempt <= 1 {
    shiftAmount = 0
} else if attempt-1 <= 30 {
    shiftAmount = uint(attempt - 1) // #nosec G115
} else {
    shiftAmount = 30
}

This could be elegantly replaced with a library call.

2. Race Condition in Stdin Access (pkg/transport/stdio.go:342-345)

One observes a potential race condition in processMessages(). Whilst you've rather cleverly used the mutex to obtain the current stdin reference:

t.mutex.Lock()
currentStdin := t.stdin
t.mutex.Unlock()
if err := t.sendMessageToContainer(ctx, currentStdin, msg); err != nil {

There remains a window between releasing the mutex and using currentStdin where another goroutine could update t.stdin during re-attachment. The currentStdin variable might then point to a closed writer. Whilst this may function in practice, it's not entirely bulletproof. Consider holding the mutex during the entire write operation, or employing a reference counting scheme.

3. Goroutine Management (pkg/transport/stdio.go:418)

In attemptReattachment(), upon successful re-attachment, you spawn a new goroutine for processStdout():

go t.processStdout(ctx, newStdout)

This is rather sensible, but one must ensure proper cleanup. If re-attachment occurs multiple times, are we certain the previous processStdout goroutine has properly terminated? The comment indicates awareness ("The existing processMessages goroutine is still running"), but explicit verification would be reassuring.

Moderate Issues

4. Error Detection Heuristics (pkg/transport/stdio.go:386-390)

The Docker unavailability detection relies upon string matching:

if strings.Contains(checkErr.Error(), "EOF") || strings.Contains(checkErr.Error(), "connection refused") {

Whilst pragmatic, this is rather fragile. Different Docker versions or runtimes (Podman, Colima) might express errors differently. Consider:

Using typed errors where possible (e.g., errors.Is())
Documenting which specific error conditions this handles
Testing with various container runtimes to verify coverage

5. Resource Cleanup (pkg/transport/stdio.go:408)

The old stdout is closed with error suppression:

_ = stdout.Close()

Whilst one appreciates brevity, it would be rather more proper to log any errors from closing the old stream, as this could indicate resource leaks or unexpected conditions.

6. Test Coverage Gap

The codecov report indicates 22 lines lacking coverage (72.15% patch coverage). The test suite is quite comprehensive for happy paths and retry scenarios, but one notices:

No test for concurrent re-attachment attempts
Limited testing of the mutex-protected state transitions
No test verifying that multiple rapid re-attachments don't spawn excessive goroutines

Minor Issues

7. Magic Numbers

The default retry configuration uses somewhat arbitrary values:

maxRetries:   10,
initialDelay: 2 * time.Second,
maxDelay:     30 * time.Second,

These deserve constants with explanatory comments. Why precisely 10 retries? Why 30 seconds maximum? Such decisions should be documented for future maintainers.

8. Logging Verbosity

One observes rather chatty logging in the message processing path:

logger.Info("Process incoming messages and sending message to container")
// ...
logger.Info("Messages processed")

At Info level, this might prove overwhelming in production. Consider Debug level for these operational details.

9. Context Handling (pkg/transport/stdio.go:355)

The attemptReattachment() function doesn't check ctx.Done() during retry loops. If the context is cancelled during a long backoff sleep, you'll unnecessarily wait. Consider:

select {
case <-time.After(delay):
case <-ctx.Done():
    return false
}

Positive Observations

I must acknowledge several commendable aspects:

Excellent test coverage variety: The test suite covers multiple failure scenarios admirably
Thoughtful mutex usage: The double-check pattern in Stop() prevents double-close panics
Configurable retry logic: The retryConfig struct allows for testing with rapid retries - jolly clever!
Clear separation of concerns: Re-attachment logic is properly isolated

Recommendations

High Priority: Adopt a standard backoff library to replace the manual implementation
High Priority: Address the stdin race condition with more robust synchronization
Medium Priority: Implement typed errors for Docker socket detection
Medium Priority: Add tests for concurrent re-attachment scenarios
Low Priority: Refine logging levels and add constants for magic numbers

Conclusion

The PR successfully resolves the fundamental issue and demonstrates sound engineering judgment. With the suggested refinements, particularly regarding the backoff library and race condition handling, this would be rather exemplary work indeed.

Job Run | Branch: issue-2117

therealnb · 2025-10-08T12:09:51Z

@ChrisJBurns , @JAORMX I'll address the comments, but why does the check fail? That is about code coverage.

ChrisJBurns · 2025-10-08T12:13:28Z

@therealnb The cover coverage check fails because its not at the default threshold for the repo (80% I think), it doesn't stop the PR from being mergeable though. I think a combination of adding exclusions for bits we don't want it to report coverage on and bumping up the coverage in areas where we severely lack is the antidote to that - this can be done in par with the clean as you go rule. We may even be able to ask Claude to do some of this for us. For this PR though, I wouldn't worry about it

therealnb · 2025-10-08T13:12:51Z

🎯 All Priority Issues Addressed

High Priority (Critical):

✅ Standard Backoff Library - Replaced manual exponential backoff with github.com/cenkalti/backoff/v5, eliminating overflow risks and improving reliability
✅ Stdin Race Condition - Verified proper mutex synchronization and added comprehensive tests

Medium Priority (Important):
3. ✅ Typed Error Detection - Implemented isDockerSocketError() helper using errors.Is() and errors.As() for robust error handling across different Docker versions
4. ✅ Concurrent Re-attachment Tests - Added 2 new comprehensive tests (TestConcurrentReattachment and TestStdinRaceCondition)

Low Priority (Quality):
5. ✅ Magic Numbers Documentation - Added well-documented constants explaining retry configuration choices
6. ✅ Context Handling - Integrated context-aware retry logic with proper cancellation handling
7. ✅ Resource Cleanup Logging - Added error logging when closing old stdout
8. ✅ Logging Level Refinement - Changed verbose operational logs from Info to Debug level

📊 Test Results

All tests pass successfully:

✅ All existing tests continue to work
✅ 2 new concurrent re-attachment tests added and passing
✅ Zero linter errors
✅ Full test coverage for retry logic, Docker restarts, and concurrent scenarios

📝 Files Modified

pkg/transport/stdio.go - Main implementation with all improvements
pkg/transport/stdio_test.go - Added comprehensive concurrent tests
PR-2118-IMPROVEMENTS.md - Detailed documentation of all changes

🚀 Key Improvements

More Robust: Standard backoff library prevents overflow and provides battle-tested retry logic
Better Error Handling: Typed error detection works across Docker/Rancher/Podman
Thread Safe: Mutex protection verified with concurrent test scenarios
Better Observability: Appropriate logging levels and error messages
Well Documented: Constants explain design decisions for future maintainers
Production Ready: All edge cases tested including Docker restarts and concurrent re-attachments

The code is now ready for review and meets all the requirements from the original code review feedback!

therealnb · 2025-10-08T15:50:06Z

@ChrisJBurns , @JAORMX anything else need to happen here?

therealnb · 2025-10-09T14:37:15Z

This should be a low risk change, because the code kicks in after things have already gone wrong.

therealnb · 2025-10-09T14:44:19Z

If unsure, @amirejaz , we can wait for @dmjb .

ToolHive Re-Attachment fix

595e5e7

therealnb requested review from Copilot and dmjb and removed request for Copilot October 8, 2025 10:29

therealnb mentioned this pull request Oct 8, 2025

toolhive proxy scuppered by rancher restart #2117

Open

therealnb added 8 commits October 8, 2025 11:47

fix race condition

9a985b8

Merge branch 'main' into issue-2117

6688285

add unit tests

90575ac

fix test retry logic

6ce2a11

lint

6088d79

lint issues

ae7b60c

more lint

a632f7c

lint

c5a58ad

ChrisJBurns reviewed Oct 8, 2025

View reviewed changes

pkg/transport/stdio.go Outdated Show resolved Hide resolved

Address code review

b567942

therealnb requested a review from ChrisJBurns October 8, 2025 13:13

therealnb added 3 commits October 8, 2025 14:56

lint changes

24d9603

lint again

b4e852d

Merge branch 'main' into issue-2117

cd3414e

therealnb enabled auto-merge (squash) October 8, 2025 17:53

therealnb added 2 commits October 9, 2025 10:25

Merge branch 'main' into issue-2117

3dc5e1e

Merge branch 'main' into issue-2117

d0fa98f

therealnb requested a review from amirejaz October 9, 2025 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ToolHive Re-Attachment fix #2118

ToolHive Re-Attachment fix #2118

therealnb commented Oct 8, 2025

Uh oh!

codecov bot commented Oct 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

JAORMX commented Oct 8, 2025

Uh oh!

claude bot commented Oct 8, 2025 •

edited

Loading

Uh oh!

therealnb commented Oct 8, 2025

Uh oh!

ChrisJBurns commented Oct 8, 2025 •

edited

Loading

Uh oh!

therealnb commented Oct 8, 2025

Uh oh!

therealnb commented Oct 8, 2025

Uh oh!

therealnb commented Oct 9, 2025

Uh oh!

therealnb commented Oct 9, 2025

Uh oh!

Uh oh!

ToolHive Re-Attachment fix #2118

Are you sure you want to change the base?

ToolHive Re-Attachment fix #2118

Conversation

therealnb commented Oct 8, 2025

Uh oh!

codecov bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

JAORMX commented Oct 8, 2025

Uh oh!

claude bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

Review Tasks

Overall Assessment

Critical Issues

1. Backoff Library Consideration (pkg/transport/stdio.go:363-379)

2. Race Condition in Stdin Access (pkg/transport/stdio.go:342-345)

3. Goroutine Management (pkg/transport/stdio.go:418)

Moderate Issues

4. Error Detection Heuristics (pkg/transport/stdio.go:386-390)

5. Resource Cleanup (pkg/transport/stdio.go:408)

6. Test Coverage Gap

Minor Issues

7. Magic Numbers

8. Logging Verbosity

9. Context Handling (pkg/transport/stdio.go:355)

Positive Observations

Recommendations

Conclusion

Uh oh!

therealnb commented Oct 8, 2025

Uh oh!

ChrisJBurns commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

therealnb commented Oct 8, 2025

🎯 All Priority Issues Addressed

📊 Test Results

📝 Files Modified

🚀 Key Improvements

Uh oh!

therealnb commented Oct 8, 2025

Uh oh!

therealnb commented Oct 9, 2025

Uh oh!

therealnb commented Oct 9, 2025

Uh oh!

Uh oh!

codecov bot commented Oct 8, 2025 •

edited

Loading

claude bot commented Oct 8, 2025 •

edited

Loading

ChrisJBurns commented Oct 8, 2025 •

edited

Loading