fix: move sandbox rollback defer before timeout to prevent resource leak#258
Conversation
Signed-off-by: Sanchit2662 <[email protected]>
|
Welcome @Sanchit2662! It looks like this is your first PR to volcano-sh/agentcube 🎉 |
There was a problem hiding this comment.
Code Review
This pull request refactors the sandbox creation logic to improve resource cleanup. It moves the rollback registration before the sandbox readiness check and updates the rollback function to include store placeholder cleanup and ensure execution continues even if individual deletion steps fail. A review comment points out that the store placeholder could still be leaked if the initial Kubernetes resource creation fails before the rollback is deferred, suggesting that the rollback registration should be moved even earlier in the process.
| // Register rollback BEFORE waiting for the sandbox to become ready. | ||
| // This ensures the K8s resource and store placeholder are cleaned up on | ||
| // timeout, pod-IP failure, or store-update failure — not just on post-creation errors. | ||
| needRollbackSandbox := true |
There was a problem hiding this comment.
The rollback registration is still performed after the Kubernetes resource creation calls (createSandboxClaim or createSandbox). If these calls fail, the function returns early before the defer is registered, which means the store placeholder created at line 147 will not be cleaned up. To ensure all resources are properly rolled back on any failure, move the sandboxRollbackFunc definition and its defer registration to immediately follow the successful StoreSandbox call (around line 151).
There was a problem hiding this comment.
This makes sense in general. But there is gc that will reclaim the expired sandbox, it is no harm.
There was a problem hiding this comment.
Pull request overview
This PR fixes a resource leak in sandbox creation where the cleanup defer statement was registered after the timeout return path, causing it to never execute on timeout. The fix reorders the defer registration to occur before the timeout select, and also adds cleanup of the store placeholder entry during rollback.
Changes:
- Move the defer/rollback registration before the timeout select statement to ensure cleanup on timeout
- Add store placeholder cleanup (DeleteSandboxBySessionID) to the rollback function
- Remove early returns from rollback logic to ensure both K8s deletion and store cleanup attempt to run
| klog.Infof("sandbox %s/%s rollback succeeded", sandbox.Namespace, sandbox.Name) | ||
| } | ||
| // Clean up the store placeholder so it does not pollute GC queries | ||
| if delErr := s.storeClient.DeleteSandboxBySessionID(ctxTimeout, sandboxEntry.SessionID); delErr != nil { |
There was a problem hiding this comment.
The new store cleanup behavior in the rollback function (DeleteSandboxBySessionID) is not covered by the existing tests. Consider adding test cases that verify store cleanup is called when rollback occurs (e.g., during timeout or pod-IP failure scenarios).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #258 +/- ##
==========================================
+ Coverage 35.60% 43.32% +7.71%
==========================================
Files 29 30 +1
Lines 2533 2613 +80
==========================================
+ Hits 902 1132 +230
+ Misses 1505 1358 -147
+ Partials 126 123 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hzxuzhonghu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
I found a resource leak in sandbox creation that really starts hurting when the cluster gets under pressure. basically when a sandbox takes too long to start and hits the 2 minute timeout, the cleanup code never runs because i had the defer registered after the timeout return statement. so you end up with leftover kubernetes resources and store entries just sitting there consuming cpu and memory.
I moved the defer earlier in the function so it actually catches the timeout case. I also realized the rollback was never cleaning up the store placeholder which was another gap. And I fixed the rollback logic to not bail out early so both the K8s deletion and store cleanup always try to run even if one of them fails.
Special notes for your reviewer:
This is a classic Go bug where I had the defer too far down in the function. The timeout case would return before the defer ever got registered. The fix is straightforward, just move the defer up earlier so it wraps the timeout case.
I also made the rollback function a bit more robust by removing the early returns so it always attempts both cleanup steps.
Does this PR introduce a user-facing change?:
No, this is an internal fix that prevents resource leaks. Users will see improved cluster stability under burst load scenarios, but no API or behavior changes.
NONE