Skip to content

fix: move sandbox rollback defer before timeout to prevent resource leak#258

Merged
volcano-sh-bot merged 1 commit intovolcano-sh:mainfrom
Sanchit2662:fix/sandbox-creation-timeout-resource-leak
Apr 8, 2026
Merged

fix: move sandbox rollback defer before timeout to prevent resource leak#258
volcano-sh-bot merged 1 commit intovolcano-sh:mainfrom
Sanchit2662:fix/sandbox-creation-timeout-resource-leak

Conversation

@Sanchit2662
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

I found a resource leak in sandbox creation that really starts hurting when the cluster gets under pressure. basically when a sandbox takes too long to start and hits the 2 minute timeout, the cleanup code never runs because i had the defer registered after the timeout return statement. so you end up with leftover kubernetes resources and store entries just sitting there consuming cpu and memory.

I moved the defer earlier in the function so it actually catches the timeout case. I also realized the rollback was never cleaning up the store placeholder which was another gap. And I fixed the rollback logic to not bail out early so both the K8s deletion and store cleanup always try to run even if one of them fails.

Special notes for your reviewer:

This is a classic Go bug where I had the defer too far down in the function. The timeout case would return before the defer ever got registered. The fix is straightforward, just move the defer up earlier so it wraps the timeout case.
I also made the rollback function a bit more robust by removing the early returns so it always attempts both cleanup steps.

Does this PR introduce a user-facing change?:

No, this is an internal fix that prevents resource leaks. Users will see improved cluster stability under burst load scenarios, but no API or behavior changes.
NONE

Copilot AI review requested due to automatic review settings April 7, 2026 16:07
@volcano-sh-bot volcano-sh-bot added the kind/bug Something isn't working label Apr 7, 2026
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

Welcome @Sanchit2662! It looks like this is your first PR to volcano-sh/agentcube 🎉

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the sandbox creation logic to improve resource cleanup. It moves the rollback registration before the sandbox readiness check and updates the rollback function to include store placeholder cleanup and ensure execution continues even if individual deletion steps fail. A review comment points out that the store placeholder could still be leaked if the initial Kubernetes resource creation fails before the rollback is deferred, suggesting that the rollback registration should be moved even earlier in the process.

Comment on lines +163 to 166
// Register rollback BEFORE waiting for the sandbox to become ready.
// This ensures the K8s resource and store placeholder are cleaned up on
// timeout, pod-IP failure, or store-update failure — not just on post-creation errors.
needRollbackSandbox := true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The rollback registration is still performed after the Kubernetes resource creation calls (createSandboxClaim or createSandbox). If these calls fail, the function returns early before the defer is registered, which means the store placeholder created at line 147 will not be cleaned up. To ensure all resources are properly rolled back on any failure, move the sandboxRollbackFunc definition and its defer registration to immediately follow the successful StoreSandbox call (around line 151).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense in general. But there is gc that will reclaim the expired sandbox, it is no harm.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a resource leak in sandbox creation where the cleanup defer statement was registered after the timeout return path, causing it to never execute on timeout. The fix reorders the defer registration to occur before the timeout select, and also adds cleanup of the store placeholder entry during rollback.

Changes:

  • Move the defer/rollback registration before the timeout select statement to ensure cleanup on timeout
  • Add store placeholder cleanup (DeleteSandboxBySessionID) to the rollback function
  • Remove early returns from rollback logic to ensure both K8s deletion and store cleanup attempt to run

klog.Infof("sandbox %s/%s rollback succeeded", sandbox.Namespace, sandbox.Name)
}
// Clean up the store placeholder so it does not pollute GC queries
if delErr := s.storeClient.DeleteSandboxBySessionID(ctxTimeout, sandboxEntry.SessionID); delErr != nil {
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new store cleanup behavior in the rollback function (DeleteSandboxBySessionID) is not covered by the existing tests. Consider adding test cases that verify store cleanup is called when rollback occurs (e.g., during timeout or pod-IP failure scenarios).

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 42.85714% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.32%. Comparing base (845b798) to head (27274ee).
⚠️ Report is 151 commits behind head on main.

Files with missing lines Patch % Lines
pkg/workloadmanager/handlers.go 42.85% 6 Missing and 2 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #258      +/-   ##
==========================================
+ Coverage   35.60%   43.32%   +7.71%     
==========================================
  Files          29       30       +1     
  Lines        2533     2613      +80     
==========================================
+ Hits          902     1132     +230     
+ Misses       1505     1358     -147     
+ Partials      126      123       -3     
Flag Coverage Δ
unittests 43.32% <42.85%> (+7.71%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot merged commit 2a7101d into volcano-sh:main Apr 8, 2026
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants