fix: fast-fail session creation on terminal sandbox failure#273
fix: fast-fail session creation on terminal sandbox failure#273Aman-Cool wants to merge 3 commits intovolcano-sh:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @Aman-Cool! It looks like this is your first PR to volcano-sh/agentcube 🎉 |
There was a problem hiding this comment.
Code Review
This pull request enhances sandbox creation by implementing terminal failure detection and improving error reporting. Key changes include updating the SandboxStatusUpdate struct to carry error information, refactoring the reconciler to notify waiters of failed states, and extending getSandboxStatus to extract failure messages from CRD conditions. Feedback suggests updating the caller of createSandbox to propagate descriptive errors to the user and replacing time.After with time.NewTimer for more efficient resource management.
| if result.Err != nil { | ||
| klog.Warningf("sandbox %s/%s failed: %v", sandbox.Namespace, sandbox.Name, result.Err) | ||
| return nil, result.Err | ||
| } |
There was a problem hiding this comment.
The error returned here contains the descriptive failure reason (e.g., "sandbox failed: ErrImagePull"). However, the caller in handleSandboxCreate (line 136) is currently hardcoded to return a generic "internal server error" to the client. To fulfill the PR's objective of providing descriptive errors to the user, you should update the caller to use err.Error() instead of a static string.
- getSandboxStatus now returns (status, failMsg) with three states: running, failed, unknown - SandboxReconciler dispatches failure notifications immediately on ConditionFalse+Reason - createSandbox select gains ctx.Done() arm to release goroutine on client disconnect - Add test coverage for the new failed state Signed-off-by: Aman-Cool <aman017102007@gmail.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR improves session creation responsiveness by notifying waiters when a sandbox transitions into a terminal failure state and by stopping waits when the client request context is canceled, avoiding unnecessary blocking until the router timeout.
Changes:
- Detect terminal sandbox failures from Ready=False + non-empty Reason and propagate the failure message to session creation callers.
- Extend watcher notifications to include terminal failure errors (not just success).
- Add
ctx.Done()handling during sandbox creation waits to prevent goroutine retention after client disconnect.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| pkg/workloadmanager/sandbox_helper_test.go | Expands status tests to distinguish Ready=False with/without a Reason and adapts to new return signature. |
| pkg/workloadmanager/sandbox_helper.go | Changes getSandboxStatus to return (status, failureMessage) and adds a string-only helper for metadata. |
| pkg/workloadmanager/sandbox_controller.go | Notifies watchers on both running and terminal failure; includes an error in the update payload. |
| pkg/workloadmanager/handlers.go | Handles terminal failure updates and exits early on request cancellation instead of waiting for timeout. |
92cd7e7 to
93547da
Compare
|
@hzxuzhonghu, happy to update anything here ; the main call worth double-checking is the |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #273 +/- ##
==========================================
+ Coverage 35.60% 43.33% +7.72%
==========================================
Files 29 30 +1
Lines 2533 2647 +114
==========================================
+ Hits 902 1147 +245
+ Misses 1505 1375 -130
+ Partials 126 125 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- propagate descriptive sandbox error to HTTP response instead of generic 'internal server error' - replace time.After with time.NewTimer; stop timer explicitly in each winning select arm - include sandbox namespace/name in reconciler error for log correlation - use value type for SandboxStatusUpdate in reconciler instead of pointer indirection - assert failure message in getSandboxStatus tests to catch silent regressions Signed-off-by: Aman-Cool <aman017102007@gmail.com>
|
Addressed all the copilot suggestions: Propagated the actual error message to the HTTP response so callers get something useful instead of a generic 500. Swapped |
… creation - Return ctx.Err() directly from createSandbox so errors.Is checks work - Sanitize internal errors in handleSandboxCreate using apierrors.IsInternalError - Skip writing HTTP response when client has already disconnected - Replace non-blocking channel send with blocking send in reconciler; safe because the channel buffer is always empty at the point of send - Add test cases covering both sanitized and exposed error paths Signed-off-by: Aman-Cool <aman017102007@gmail.com>
What type of PR is this?
/kind bug
What this PR does / why we need it:
When a sandbox pod dies (bad image, OOM, evicted; anything), the session creation request would just sit there silently for the full 2 minutes before returning a useless "timed out" error. The reconciler only ever notified waiters on success, never on failure. On top of that, if the client disconnected mid-wait, the server goroutine kept blocking anyway.
This makes the reconciler detect terminal pod failures immediately and push the actual failure reason back to the caller. Also adds a
ctx.Done()arm so we stop holding goroutines when the client is already gone.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
ConditionFalsewithout aReasonis still treated as unknown (transient/pending) ; onlyConditionFalse+ non-emptyReasonis considered a terminal failure. This avoids false-positive fast-fails during normal pod startup churn.Does this PR introduce a user-facing change?
Session creation now fails fast with a descriptive error when the sandbox pod enters a terminal failure state, instead of blocking for 2 minutes before returning a generic timeout.