Skip to content

fix: recover from warm pool pod deletion instead of permanent error loop#5

Merged
vamsi-resolve merged 1 commit intomainfrom
fix/warm-pool-pod-recovery
Apr 8, 2026
Merged

fix: recover from warm pool pod deletion instead of permanent error loop#5
vamsi-resolve merged 1 commit intomainfrom
fix/warm-pool-pod-recovery

Conversation

@vamsi-resolve
Copy link
Copy Markdown
Collaborator

Summary

  • Cherry-picks upstream kubernetes-sigs/agent-sandbox#521 — clears stale pod-name annotation when adopted warm pool pod is deleted, allowing controller to create a replacement pod instead of entering a permanent reconcile error loop
  • Cherry-picks upstream kubernetes-sigs/agent-sandbox#469 — ensures pod-name annotation is correct during warm pool adoption before sandbox can be observed Ready (race condition fix)
  • Updates test to match new recovery behavior

Context

Investigation of a reclaw supervisor timeout revealed that when a warm pool pod dies (node failure, drain, eviction), the sandbox controller returned a hard error because the agents.x-k8s.io/pod-name annotation pointed to a non-existent pod. The Sandbox CR entered a permanent reconcile error loop — no replacement pod was ever created.

Fix 1 (kubernetes-sigs#521): Clear stale annotation

When the annotated pod is not found (404), the controller now clears the stale annotation and falls through to the pod creation path. The existing PVC (owned by Sandbox CR, not the pod) is remounted by the replacement pod, preserving workspace data.

Fix 2 (kubernetes-sigs#469): Correct annotation during adoption

During warm pool adoption, the claim controller now ensures the pod-name annotation matches adopted.Name before the atomic r.Update call that transfers ownership. This prevents stale annotations from being set in the first place.

Test plan

  • TestReconcilePod — all 13 cases pass, including new "clears stale annotation and creates replacement pod" case
  • extensions/controllers — all tests pass
  • Deploy to dev0 and verify warm pool adoption works correctly
  • Simulate pod deletion on an adopted sandbox and confirm replacement pod is created

🤖 Generated with Claude Code

Cherry-picks two upstream fixes:

1. kubernetes-sigs#521 — When an adopted warm pool pod is
   deleted (node failure, drain, eviction), the controller returned a hard
   error because the agents.x-k8s.io/pod-name annotation pointed to a
   non-existent pod. This left the Sandbox stuck in a permanent reconcile
   error loop. Now the controller clears the stale annotation and falls
   through to create a replacement pod (which remounts the existing PVC).

2. kubernetes-sigs#469 — During warm pool adoption, ensure
   the pod-name annotation is correct before the sandbox can be observed
   as Ready. Prevents stale annotations from being set in the first place.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vamsi-resolve vamsi-resolve merged commit 299c350 into main Apr 8, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant