Skip to content

fix: fast-fail session creation on terminal sandbox failure#273

Open
Aman-Cool wants to merge 3 commits intovolcano-sh:mainfrom
Aman-Cool:fix/sandbox-terminal-failure-notification
Open

fix: fast-fail session creation on terminal sandbox failure#273
Aman-Cool wants to merge 3 commits intovolcano-sh:mainfrom
Aman-Cool:fix/sandbox-terminal-failure-notification

Conversation

@Aman-Cool
Copy link
Copy Markdown

What type of PR is this?
/kind bug

What this PR does / why we need it:

When a sandbox pod dies (bad image, OOM, evicted; anything), the session creation request would just sit there silently for the full 2 minutes before returning a useless "timed out" error. The reconciler only ever notified waiters on success, never on failure. On top of that, if the client disconnected mid-wait, the server goroutine kept blocking anyway.

This makes the reconciler detect terminal pod failures immediately and push the actual failure reason back to the caller. Also adds a ctx.Done() arm so we stop holding goroutines when the client is already gone.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

ConditionFalse without a Reason is still treated as unknown (transient/pending) ; only ConditionFalse + non-empty Reason is considered a terminal failure. This avoids false-positive fast-fails during normal pod startup churn.

Does this PR introduce a user-facing change?

Session creation now fails fast with a descriptive error when the sandbox pod enters a terminal failure state, instead of blocking for 2 minutes before returning a generic timeout.

Copilot AI review requested due to automatic review settings April 13, 2026 10:02
@volcano-sh-bot volcano-sh-bot requested a review from acsoto April 13, 2026 10:02
@volcano-sh-bot volcano-sh-bot added the kind/bug Something isn't working label Apr 13, 2026
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

Welcome @Aman-Cool! It looks like this is your first PR to volcano-sh/agentcube 🎉

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances sandbox creation by implementing terminal failure detection and improving error reporting. Key changes include updating the SandboxStatusUpdate struct to carry error information, refactoring the reconciler to notify waiters of failed states, and extending getSandboxStatus to extract failure messages from CRD conditions. Feedback suggests updating the caller of createSandbox to propagate descriptive errors to the user and replacing time.After with time.NewTimer for more efficient resource management.

Comment on lines +203 to +206
if result.Err != nil {
klog.Warningf("sandbox %s/%s failed: %v", sandbox.Namespace, sandbox.Name, result.Err)
return nil, result.Err
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The error returned here contains the descriptive failure reason (e.g., "sandbox failed: ErrImagePull"). However, the caller in handleSandboxCreate (line 136) is currently hardcoded to return a generic "internal server error" to the client. To fulfill the PR's objective of providing descriptive errors to the user, you should update the caller to use err.Error() instead of a static string.

- getSandboxStatus now returns (status, failMsg) with three states: running, failed, unknown
- SandboxReconciler dispatches failure notifications immediately on ConditionFalse+Reason
- createSandbox select gains ctx.Done() arm to release goroutine on client disconnect
- Add test coverage for the new failed state

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves session creation responsiveness by notifying waiters when a sandbox transitions into a terminal failure state and by stopping waits when the client request context is canceled, avoiding unnecessary blocking until the router timeout.

Changes:

  • Detect terminal sandbox failures from Ready=False + non-empty Reason and propagate the failure message to session creation callers.
  • Extend watcher notifications to include terminal failure errors (not just success).
  • Add ctx.Done() handling during sandbox creation waits to prevent goroutine retention after client disconnect.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
pkg/workloadmanager/sandbox_helper_test.go Expands status tests to distinguish Ready=False with/without a Reason and adapts to new return signature.
pkg/workloadmanager/sandbox_helper.go Changes getSandboxStatus to return (status, failureMessage) and adds a string-only helper for metadata.
pkg/workloadmanager/sandbox_controller.go Notifies watchers on both running and terminal failure; includes an error in the update payload.
pkg/workloadmanager/handlers.go Handles terminal failure updates and exits early on request cancellation instead of waiting for timeout.

@Aman-Cool Aman-Cool force-pushed the fix/sandbox-terminal-failure-notification branch from 92cd7e7 to 93547da Compare April 13, 2026 10:05
@Aman-Cool
Copy link
Copy Markdown
Author

@hzxuzhonghu, happy to update anything here ; the main call worth double-checking is the ConditionFalse + non-empty Reason heuristic for "terminal failure". That's based on how agent-sandbox sets conditions today, but if there are cases where a non-empty Reason appears during normal startup I'd want to know. Open to tightening that check if needed.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 13, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 36.20690% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.33%. Comparing base (845b798) to head (76b594b).
⚠️ Report is 158 commits behind head on main.

Files with missing lines Patch % Lines
pkg/workloadmanager/sandbox_controller.go 0.00% 26 Missing ⚠️
pkg/workloadmanager/handlers.go 35.29% 9 Missing and 2 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #273      +/-   ##
==========================================
+ Coverage   35.60%   43.33%   +7.72%     
==========================================
  Files          29       30       +1     
  Lines        2533     2647     +114     
==========================================
+ Hits          902     1147     +245     
+ Misses       1505     1375     -130     
+ Partials      126      125       -1     
Flag Coverage Δ
unittests 43.33% <36.20%> (+7.72%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- propagate descriptive sandbox error to HTTP response instead of generic 'internal server error'
- replace time.After with time.NewTimer; stop timer explicitly in each winning select arm
- include sandbox namespace/name in reconciler error for log correlation
- use value type for SandboxStatusUpdate in reconciler instead of pointer indirection
- assert failure message in getSandboxStatus tests to catch silent regressions

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
Copilot AI review requested due to automatic review settings April 13, 2026 10:36
@Aman-Cool
Copy link
Copy Markdown
Author

Addressed all the copilot suggestions:

Propagated the actual error message to the HTTP response so callers get something useful instead of a generic 500. Swapped time.After for time.NewTimer with explicit stops in each branch (no defer, avoids the ordering issue). Dropped the pointer indirection on SandboxStatusUpdate in favor of a value + bool flag. Included namespace/name in the reconciler error so it's easier to correlate across concurrent creates. Also extended the status tests to assert the failure message, added an extra case for when Message is empty and we fall back to Reason.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

… creation

- Return ctx.Err() directly from createSandbox so errors.Is checks work
- Sanitize internal errors in handleSandboxCreate using apierrors.IsInternalError
- Skip writing HTTP response when client has already disconnected
- Replace non-blocking channel send with blocking send in reconciler;
  safe because the channel buffer is always empty at the point of send
- Add test cases covering both sanitized and exposed error paths

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants