-
Notifications
You must be signed in to change notification settings - Fork 125
ToolHive Re-Attachment fix #2118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2118 +/- ##
==========================================
+ Coverage 48.71% 48.80% +0.09%
==========================================
Files 241 241
Lines 30592 30667 +75
==========================================
+ Hits 14902 14967 +65
- Misses 14567 14576 +9
- Partials 1123 1124 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@claude please do a review of this PR. Use a posh English accent when you write. |
Claude finished @JAORMX's task —— View job Code Review CompleteRight then, I've conducted a rather thorough examination of this PR, and I must say, whilst the fundamental approach is sound, there are several matters requiring one's attention. Review Tasks
Overall AssessmentThe implementation successfully addresses the core issue of automatic re-attachment following Docker restarts. However, there are opportunities for refinement that would enhance the code's robustness, maintainability, and adherence to established patterns. Critical Issues1. Backoff Library Consideration (pkg/transport/stdio.go:363-379)I must concur with my colleague ChrisJBurns' astute observation. The manual exponential backoff implementation, whilst functional, appears somewhat bespoke when established libraries exist for precisely this purpose. Consider adopting
The current implementation's bit-shifting arithmetic with overflow protection (lines 367-375) is commendably careful but rather verbose: var shiftAmount uint
if attempt <= 1 {
shiftAmount = 0
} else if attempt-1 <= 30 {
shiftAmount = uint(attempt - 1) // #nosec G115
} else {
shiftAmount = 30
} This could be elegantly replaced with a library call. 2. Race Condition in Stdin Access (pkg/transport/stdio.go:342-345)One observes a potential race condition in t.mutex.Lock()
currentStdin := t.stdin
t.mutex.Unlock()
if err := t.sendMessageToContainer(ctx, currentStdin, msg); err != nil { There remains a window between releasing the mutex and using 3. Goroutine Management (pkg/transport/stdio.go:418)In go t.processStdout(ctx, newStdout) This is rather sensible, but one must ensure proper cleanup. If re-attachment occurs multiple times, are we certain the previous Moderate Issues4. Error Detection Heuristics (pkg/transport/stdio.go:386-390)The Docker unavailability detection relies upon string matching: if strings.Contains(checkErr.Error(), "EOF") || strings.Contains(checkErr.Error(), "connection refused") { Whilst pragmatic, this is rather fragile. Different Docker versions or runtimes (Podman, Colima) might express errors differently. Consider:
5. Resource Cleanup (pkg/transport/stdio.go:408)The old stdout is closed with error suppression: _ = stdout.Close() Whilst one appreciates brevity, it would be rather more proper to log any errors from closing the old stream, as this could indicate resource leaks or unexpected conditions. 6. Test Coverage GapThe codecov report indicates 22 lines lacking coverage (72.15% patch coverage). The test suite is quite comprehensive for happy paths and retry scenarios, but one notices:
Minor Issues7. Magic NumbersThe default retry configuration uses somewhat arbitrary values: maxRetries: 10,
initialDelay: 2 * time.Second,
maxDelay: 30 * time.Second, These deserve constants with explanatory comments. Why precisely 10 retries? Why 30 seconds maximum? Such decisions should be documented for future maintainers. 8. Logging VerbosityOne observes rather chatty logging in the message processing path: logger.Info("Process incoming messages and sending message to container")
// ...
logger.Info("Messages processed") At Info level, this might prove overwhelming in production. Consider Debug level for these operational details. 9. Context Handling (pkg/transport/stdio.go:355)The select {
case <-time.After(delay):
case <-ctx.Done():
return false
} Positive ObservationsI must acknowledge several commendable aspects:
Recommendations
ConclusionThe PR successfully resolves the fundamental issue and demonstrates sound engineering judgment. With the suggested refinements, particularly regarding the backoff library and race condition handling, this would be rather exemplary work indeed. |
@ChrisJBurns , @JAORMX I'll address the comments, but why does the check fail? That is about code coverage. |
@therealnb The cover coverage check fails because its not at the default threshold for the repo (80% I think), it doesn't stop the PR from being mergeable though. I think a combination of adding exclusions for bits we don't want it to report coverage on and bumping up the coverage in areas where we severely lack is the antidote to that - this can be done in par with the clean as you go rule. We may even be able to ask Claude to do some of this for us. For this PR though, I wouldn't worry about it |
🎯 All Priority Issues AddressedHigh Priority (Critical):
Medium Priority (Important): Low Priority (Quality): 📊 Test ResultsAll tests pass successfully:
📝 Files Modified
🚀 Key Improvements
The code is now ready for review and meets all the requirements from the original code review feedback! |
@ChrisJBurns , @JAORMX anything else need to happen here? |
This should be a low risk change, because the code kicks in after things have already gone wrong. |
Addresses #2117. This is a suggested fix.
File: /Users/nigel/code/toolhive/pkg/transport/stdio.go
The Problem: When Docker/Rancher Desktop restarted, ToolHive's stdio read loops died and never reconnected.
The Solution: Added automatic re-attachment with intelligent retry logic:
10 retries with exponential backoff (2s, 4s, 8s, 16s, 30s...)
Detects Docker socket unavailability
Waits for Docker to come back online
Re-attaches to containers automatically
Test Results After Docker Restart
✅ GitHub Server:
✅ Time Server: Same successful pattern!
✅ SSE Working: GitHub is receiving and responding to JSON-RPC requests post-restart