Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 21 additions & 14 deletions pkg/workloadmanager/handlers.go
Original file line number Diff line number Diff line change
Expand Up @@ -160,16 +160,9 @@ func (s *Server) createSandbox(ctx context.Context, dynamicClient dynamic.Interf
}
}

var createdSandbox *sandboxv1alpha1.Sandbox
select {
case result := <-resultChan:
createdSandbox = result.Sandbox
klog.V(2).Infof("sandbox %s/%s running", createdSandbox.Namespace, createdSandbox.Name)
case <-time.After(2 * time.Minute): // consistent with router settings
klog.Warningf("sandbox %s/%s create timed out", sandbox.Namespace, sandbox.Name)
return nil, fmt.Errorf("sandbox creation timed out")
}

// Register rollback BEFORE waiting for the sandbox to become ready.
// This ensures the K8s resource and store placeholder are cleaned up on
// timeout, pod-IP failure, or store-update failure — not just on post-creation errors.
needRollbackSandbox := true
Comment on lines +163 to 166
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The rollback registration is still performed after the Kubernetes resource creation calls (createSandboxClaim or createSandbox). If these calls fail, the function returns early before the defer is registered, which means the store placeholder created at line 147 will not be cleaned up. To ensure all resources are properly rolled back on any failure, move the sandboxRollbackFunc definition and its defer registration to immediately follow the successful StoreSandbox call (around line 151).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense in general. But there is gc that will reclaim the expired sandbox, it is no harm.

sandboxRollbackFunc := func() {
ctxTimeout, cancel := context.WithTimeout(context.Background(), 30*time.Second)
Expand All @@ -180,17 +173,21 @@ func (s *Server) createSandbox(ctx context.Context, dynamicClient dynamic.Interf
err = deleteSandboxClaim(ctxTimeout, dynamicClient, sandboxClaim.Namespace, sandboxClaim.Name)
if err != nil {
klog.Infof("sandbox claim %s/%s rollback failed: %v", sandboxClaim.Namespace, sandboxClaim.Name, err)
return
} else {
klog.Infof("sandbox claim %s/%s rollback succeeded", sandboxClaim.Namespace, sandboxClaim.Name)
}
klog.Infof("sandbox claim %s/%s rollback succeeded", sandboxClaim.Namespace, sandboxClaim.Name)
} else {
// Rollback Sandbox
err = deleteSandbox(ctxTimeout, dynamicClient, sandbox.Namespace, sandbox.Name)
if err != nil {
klog.Infof("sandbox %s/%s rollback failed: %v", sandbox.Namespace, sandbox.Name, err)
return
} else {
klog.Infof("sandbox %s/%s rollback succeeded", sandbox.Namespace, sandbox.Name)
}
klog.Infof("sandbox %s/%s rollback succeeded", sandbox.Namespace, sandbox.Name)
}
// Clean up the store placeholder so it does not pollute GC queries
if delErr := s.storeClient.DeleteSandboxBySessionID(ctxTimeout, sandboxEntry.SessionID); delErr != nil {
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new store cleanup behavior in the rollback function (DeleteSandboxBySessionID) is not covered by the existing tests. Consider adding test cases that verify store cleanup is called when rollback occurs (e.g., during timeout or pod-IP failure scenarios).

Copilot uses AI. Check for mistakes.
klog.Infof("sandbox %s/%s store placeholder cleanup failed: %v", sandbox.Namespace, sandbox.Name, delErr)
}
}
defer func() {
Expand All @@ -200,6 +197,16 @@ func (s *Server) createSandbox(ctx context.Context, dynamicClient dynamic.Interf
sandboxRollbackFunc()
}()

var createdSandbox *sandboxv1alpha1.Sandbox
select {
case result := <-resultChan:
createdSandbox = result.Sandbox
klog.V(2).Infof("sandbox %s/%s running", createdSandbox.Namespace, createdSandbox.Name)
case <-time.After(2 * time.Minute): // consistent with router settings
klog.Warningf("sandbox %s/%s create timed out", sandbox.Namespace, sandbox.Name)
return nil, fmt.Errorf("sandbox creation timed out")
}

// agent-sandbox create pod with same name as sandbox if no warmpool is used
// so here we try to get pod IP by sandbox name first
// if warmpool is used, the pod name is stored in sandbox's annotation `agents.x-k8s.io/sandbox-pod-name`
Expand Down
Loading