fix: move sandbox rollback defer before timeout to prevent resource leak by Sanchit2662 · Pull Request #258 · volcano-sh/agentcube

Sanchit2662 · 2026-04-07T16:07:14Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

I found a resource leak in sandbox creation that really starts hurting when the cluster gets under pressure. basically when a sandbox takes too long to start and hits the 2 minute timeout, the cleanup code never runs because i had the defer registered after the timeout return statement. so you end up with leftover kubernetes resources and store entries just sitting there consuming cpu and memory.

I moved the defer earlier in the function so it actually catches the timeout case. I also realized the rollback was never cleaning up the store placeholder which was another gap. And I fixed the rollback logic to not bail out early so both the K8s deletion and store cleanup always try to run even if one of them fails.

Special notes for your reviewer:

This is a classic Go bug where I had the defer too far down in the function. The timeout case would return before the defer ever got registered. The fix is straightforward, just move the defer up earlier so it wraps the timeout case.
I also made the rollback function a bit more robust by removing the early returns so it always attempts both cleanup steps.

Does this PR introduce a user-facing change?:

No, this is an internal fix that prevents resource leaks. Users will see improved cluster stability under burst load scenarios, but no API or behavior changes.
NONE

Signed-off-by: Sanchit2662 <[email protected]>

volcano-sh-bot · 2026-04-07T16:07:28Z

Welcome @Sanchit2662! It looks like this is your first PR to volcano-sh/agentcube 🎉

gemini-code-assist

Code Review

This pull request refactors the sandbox creation logic to improve resource cleanup. It moves the rollback registration before the sandbox readiness check and updates the rollback function to include store placeholder cleanup and ensure execution continues even if individual deletion steps fail. A review comment points out that the store placeholder could still be leaked if the initial Kubernetes resource creation fails before the rollback is deferred, suggesting that the rollback registration should be moved even earlier in the process.

gemini-code-assist · 2026-04-07T16:10:41Z

pkg/workloadmanager/handlers.go

+	// Register rollback BEFORE waiting for the sandbox to become ready.
+	// This ensures the K8s resource and store placeholder are cleaned up on
+	// timeout, pod-IP failure, or store-update failure — not just on post-creation errors.
 	needRollbackSandbox := true


The rollback registration is still performed after the Kubernetes resource creation calls (createSandboxClaim or createSandbox). If these calls fail, the function returns early before the defer is registered, which means the store placeholder created at line 147 will not be cleaned up. To ensure all resources are properly rolled back on any failure, move the sandboxRollbackFunc definition and its defer registration to immediately follow the successful StoreSandbox call (around line 151).

This makes sense in general. But there is gc that will reclaim the expired sandbox, it is no harm.

Copilot

Pull request overview

This PR fixes a resource leak in sandbox creation where the cleanup defer statement was registered after the timeout return path, causing it to never execute on timeout. The fix reorders the defer registration to occur before the timeout select, and also adds cleanup of the store placeholder entry during rollback.

Changes:

Move the defer/rollback registration before the timeout select statement to ensure cleanup on timeout
Add store placeholder cleanup (DeleteSandboxBySessionID) to the rollback function
Remove early returns from rollback logic to ensure both K8s deletion and store cleanup attempt to run

Copilot · 2026-04-07T16:13:02Z

pkg/workloadmanager/handlers.go

-			klog.Infof("sandbox %s/%s rollback succeeded", sandbox.Namespace, sandbox.Name)
+		}
+		// Clean up the store placeholder so it does not pollute GC queries
+		if delErr := s.storeClient.DeleteSandboxBySessionID(ctxTimeout, sandboxEntry.SessionID); delErr != nil {


The new store cleanup behavior in the rollback function (DeleteSandboxBySessionID) is not covered by the existing tests. Consider adding test cases that verify store cleanup is called when rollback occurs (e.g., during timeout or pod-IP failure scenarios).

codecov-commenter · 2026-04-07T16:15:21Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 42.85714% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.32%. Comparing base (845b798) to head (27274ee).
⚠️ Report is 151 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/workloadmanager/handlers.go	42.85%	6 Missing and 2 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #258      +/-   ##
==========================================
+ Coverage   35.60%   43.32%   +7.71%     
==========================================
  Files          29       30       +1     
  Lines        2533     2613      +80     
==========================================
+ Hits          902     1132     +230     
+ Misses       1505     1358     -147     
+ Partials      126      123       -3

Flag	Coverage Δ
unittests	`43.32% <42.85%> (+7.71%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hzxuzhonghu

/lgtm

volcano-sh-bot · 2026-04-08T02:01:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/workloadmanager/OWNERS~~ [hzxuzhonghu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fix: move sandbox rollback defer before timeout to prevent resource leak

27274ee

Signed-off-by: Sanchit2662 <[email protected]>

Copilot AI review requested due to automatic review settings April 7, 2026 16:07

volcano-sh-bot added the kind/bug Something isn't working label Apr 7, 2026

volcano-sh-bot requested review from LiZhenCheng9527 and YaoZengzeng April 7, 2026 16:07

volcano-sh-bot added the size/M label Apr 7, 2026

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

Copilot started reviewing on behalf of Sanchit2662 April 7, 2026 16:11 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

hzxuzhonghu approved these changes Apr 8, 2026

View reviewed changes

volcano-sh-bot assigned hzxuzhonghu Apr 8, 2026

volcano-sh-bot added the lgtm label Apr 8, 2026

volcano-sh-bot added the approved label Apr 8, 2026

volcano-sh-bot merged commit 2a7101d into volcano-sh:main Apr 8, 2026
18 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: move sandbox rollback defer before timeout to prevent resource leak#258

fix: move sandbox rollback defer before timeout to prevent resource leak#258
volcano-sh-bot merged 1 commit intovolcano-sh:mainfrom
Sanchit2662:fix/sandbox-creation-timeout-resource-leak

Sanchit2662 commented Apr 7, 2026

Uh oh!

volcano-sh-bot commented Apr 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

hzxuzhonghu Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

codecov-commenter commented Apr 7, 2026

Uh oh!

hzxuzhonghu left a comment

Uh oh!

volcano-sh-bot commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Sanchit2662 commented Apr 7, 2026

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Uh oh!

volcano-sh-bot commented Apr 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 7, 2026

Codecov Report

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

volcano-sh-bot commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants