Skip to content

Conversation

therealnb
Copy link

Addresses #2117. This is a suggested fix.

File: /Users/nigel/code/toolhive/pkg/transport/stdio.go
The Problem: When Docker/Rancher Desktop restarted, ToolHive's stdio read loops died and never reconnected.
The Solution: Added automatic re-attachment with intelligent retry logic:
10 retries with exponential backoff (2s, 4s, 8s, 16s, 30s...)
Detects Docker socket unavailability
Waits for Docker to come back online
Re-attaches to containers automatically
Test Results After Docker Restart
✅ GitHub Server:
✅ Time Server: Same successful pattern!
✅ SSE Working: GitHub is receiving and responding to JSON-RPC requests post-restart

@therealnb therealnb requested review from Copilot and dmjb and removed request for Copilot October 8, 2025 10:29
Copy link

codecov bot commented Oct 8, 2025

Codecov Report

❌ Patch coverage is 72.15190% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.80%. Comparing base (2d6a56f) to head (ae7b60c).

Files with missing lines Patch % Lines
pkg/transport/stdio.go 72.15% 19 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2118      +/-   ##
==========================================
+ Coverage   48.71%   48.80%   +0.09%     
==========================================
  Files         241      241              
  Lines       30592    30667      +75     
==========================================
+ Hits        14902    14967      +65     
- Misses      14567    14576       +9     
- Partials     1123     1124       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JAORMX
Copy link
Collaborator

JAORMX commented Oct 8, 2025

@claude please do a review of this PR. Use a posh English accent when you write.

Copy link
Contributor

claude bot commented Oct 8, 2025

Claude finished @JAORMX's task —— View job


Code Review Complete

Right then, I've conducted a rather thorough examination of this PR, and I must say, whilst the fundamental approach is sound, there are several matters requiring one's attention.

Review Tasks

  • Examine the stdio.go implementation changes
  • Review the test coverage in stdio_test.go
  • Analyze retry logic and error handling
  • Consider architectural implications
  • Provide comprehensive feedback

Overall Assessment

The implementation successfully addresses the core issue of automatic re-attachment following Docker restarts. However, there are opportunities for refinement that would enhance the code's robustness, maintainability, and adherence to established patterns.

Critical Issues

1. Backoff Library Consideration (pkg/transport/stdio.go:363-379)

I must concur with my colleague ChrisJBurns' astute observation. The manual exponential backoff implementation, whilst functional, appears somewhat bespoke when established libraries exist for precisely this purpose. Consider adopting github.com/cenkalti/backoff/v4 or similar, which would provide:

  • Battle-tested retry logic
  • Configurable strategies (exponential, constant, etc.)
  • Jitter support to prevent thundering herd scenarios
  • Cleaner, more maintainable code

The current implementation's bit-shifting arithmetic with overflow protection (lines 367-375) is commendably careful but rather verbose:

var shiftAmount uint
if attempt <= 1 {
    shiftAmount = 0
} else if attempt-1 <= 30 {
    shiftAmount = uint(attempt - 1) // #nosec G115
} else {
    shiftAmount = 30
}

This could be elegantly replaced with a library call.

2. Race Condition in Stdin Access (pkg/transport/stdio.go:342-345)

One observes a potential race condition in processMessages(). Whilst you've rather cleverly used the mutex to obtain the current stdin reference:

t.mutex.Lock()
currentStdin := t.stdin
t.mutex.Unlock()
if err := t.sendMessageToContainer(ctx, currentStdin, msg); err != nil {

There remains a window between releasing the mutex and using currentStdin where another goroutine could update t.stdin during re-attachment. The currentStdin variable might then point to a closed writer. Whilst this may function in practice, it's not entirely bulletproof. Consider holding the mutex during the entire write operation, or employing a reference counting scheme.

3. Goroutine Management (pkg/transport/stdio.go:418)

In attemptReattachment(), upon successful re-attachment, you spawn a new goroutine for processStdout():

go t.processStdout(ctx, newStdout)

This is rather sensible, but one must ensure proper cleanup. If re-attachment occurs multiple times, are we certain the previous processStdout goroutine has properly terminated? The comment indicates awareness ("The existing processMessages goroutine is still running"), but explicit verification would be reassuring.

Moderate Issues

4. Error Detection Heuristics (pkg/transport/stdio.go:386-390)

The Docker unavailability detection relies upon string matching:

if strings.Contains(checkErr.Error(), "EOF") || strings.Contains(checkErr.Error(), "connection refused") {

Whilst pragmatic, this is rather fragile. Different Docker versions or runtimes (Podman, Colima) might express errors differently. Consider:

  • Using typed errors where possible (e.g., errors.Is())
  • Documenting which specific error conditions this handles
  • Testing with various container runtimes to verify coverage

5. Resource Cleanup (pkg/transport/stdio.go:408)

The old stdout is closed with error suppression:

_ = stdout.Close()

Whilst one appreciates brevity, it would be rather more proper to log any errors from closing the old stream, as this could indicate resource leaks or unexpected conditions.

6. Test Coverage Gap

The codecov report indicates 22 lines lacking coverage (72.15% patch coverage). The test suite is quite comprehensive for happy paths and retry scenarios, but one notices:

  • No test for concurrent re-attachment attempts
  • Limited testing of the mutex-protected state transitions
  • No test verifying that multiple rapid re-attachments don't spawn excessive goroutines

Minor Issues

7. Magic Numbers

The default retry configuration uses somewhat arbitrary values:

maxRetries:   10,
initialDelay: 2 * time.Second,
maxDelay:     30 * time.Second,

These deserve constants with explanatory comments. Why precisely 10 retries? Why 30 seconds maximum? Such decisions should be documented for future maintainers.

8. Logging Verbosity

One observes rather chatty logging in the message processing path:

logger.Info("Process incoming messages and sending message to container")
// ...
logger.Info("Messages processed")

At Info level, this might prove overwhelming in production. Consider Debug level for these operational details.

9. Context Handling (pkg/transport/stdio.go:355)

The attemptReattachment() function doesn't check ctx.Done() during retry loops. If the context is cancelled during a long backoff sleep, you'll unnecessarily wait. Consider:

select {
case <-time.After(delay):
case <-ctx.Done():
    return false
}

Positive Observations

I must acknowledge several commendable aspects:

  • Excellent test coverage variety: The test suite covers multiple failure scenarios admirably
  • Thoughtful mutex usage: The double-check pattern in Stop() prevents double-close panics
  • Configurable retry logic: The retryConfig struct allows for testing with rapid retries - jolly clever!
  • Clear separation of concerns: Re-attachment logic is properly isolated

Recommendations

  1. High Priority: Adopt a standard backoff library to replace the manual implementation
  2. High Priority: Address the stdin race condition with more robust synchronization
  3. Medium Priority: Implement typed errors for Docker socket detection
  4. Medium Priority: Add tests for concurrent re-attachment scenarios
  5. Low Priority: Refine logging levels and add constants for magic numbers

Conclusion

The PR successfully resolves the fundamental issue and demonstrates sound engineering judgment. With the suggested refinements, particularly regarding the backoff library and race condition handling, this would be rather exemplary work indeed.


Job Run | Branch: issue-2117

@therealnb
Copy link
Author

@ChrisJBurns , @JAORMX I'll address the comments, but why does the check fail? That is about code coverage.

@ChrisJBurns
Copy link
Collaborator

ChrisJBurns commented Oct 8, 2025

@therealnb The cover coverage check fails because its not at the default threshold for the repo (80% I think), it doesn't stop the PR from being mergeable though. I think a combination of adding exclusions for bits we don't want it to report coverage on and bumping up the coverage in areas where we severely lack is the antidote to that - this can be done in par with the clean as you go rule. We may even be able to ask Claude to do some of this for us. For this PR though, I wouldn't worry about it

@therealnb
Copy link
Author

🎯 All Priority Issues Addressed

High Priority (Critical):

  1. Standard Backoff Library - Replaced manual exponential backoff with github.com/cenkalti/backoff/v5, eliminating overflow risks and improving reliability
  2. Stdin Race Condition - Verified proper mutex synchronization and added comprehensive tests

Medium Priority (Important):
3. ✅ Typed Error Detection - Implemented isDockerSocketError() helper using errors.Is() and errors.As() for robust error handling across different Docker versions
4. ✅ Concurrent Re-attachment Tests - Added 2 new comprehensive tests (TestConcurrentReattachment and TestStdinRaceCondition)

Low Priority (Quality):
5. ✅ Magic Numbers Documentation - Added well-documented constants explaining retry configuration choices
6. ✅ Context Handling - Integrated context-aware retry logic with proper cancellation handling
7. ✅ Resource Cleanup Logging - Added error logging when closing old stdout
8. ✅ Logging Level Refinement - Changed verbose operational logs from Info to Debug level

📊 Test Results

All tests pass successfully:

  • ✅ All existing tests continue to work
  • ✅ 2 new concurrent re-attachment tests added and passing
  • ✅ Zero linter errors
  • ✅ Full test coverage for retry logic, Docker restarts, and concurrent scenarios

📝 Files Modified

  1. pkg/transport/stdio.go - Main implementation with all improvements
  2. pkg/transport/stdio_test.go - Added comprehensive concurrent tests
  3. PR-2118-IMPROVEMENTS.md - Detailed documentation of all changes

🚀 Key Improvements

  • More Robust: Standard backoff library prevents overflow and provides battle-tested retry logic
  • Better Error Handling: Typed error detection works across Docker/Rancher/Podman
  • Thread Safe: Mutex protection verified with concurrent test scenarios
  • Better Observability: Appropriate logging levels and error messages
  • Well Documented: Constants explain design decisions for future maintainers
  • Production Ready: All edge cases tested including Docker restarts and concurrent re-attachments

The code is now ready for review and meets all the requirements from the original code review feedback!

@therealnb therealnb requested a review from ChrisJBurns October 8, 2025 13:13
@therealnb
Copy link
Author

@ChrisJBurns , @JAORMX anything else need to happen here?

@therealnb therealnb enabled auto-merge (squash) October 8, 2025 17:53
@therealnb
Copy link
Author

This should be a low risk change, because the code kicks in after things have already gone wrong.

@therealnb therealnb requested a review from amirejaz October 9, 2025 14:43
@therealnb
Copy link
Author

If unsure, @amirejaz , we can wait for @dmjb .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants