fix: recover from warm pool pod deletion instead of permanent error loop#5
Merged
vamsi-resolve merged 1 commit intomainfrom Apr 8, 2026
Merged
Conversation
Cherry-picks two upstream fixes: 1. kubernetes-sigs#521 — When an adopted warm pool pod is deleted (node failure, drain, eviction), the controller returned a hard error because the agents.x-k8s.io/pod-name annotation pointed to a non-existent pod. This left the Sandbox stuck in a permanent reconcile error loop. Now the controller clears the stale annotation and falls through to create a replacement pod (which remounts the existing PVC). 2. kubernetes-sigs#469 — During warm pool adoption, ensure the pod-name annotation is correct before the sandbox can be observed as Ready. Prevents stale annotations from being set in the first place. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pod-nameannotation when adopted warm pool pod is deleted, allowing controller to create a replacement pod instead of entering a permanent reconcile error looppod-nameannotation is correct during warm pool adoption before sandbox can be observed Ready (race condition fix)Context
Investigation of a reclaw supervisor timeout revealed that when a warm pool pod dies (node failure, drain, eviction), the sandbox controller returned a hard error because the
agents.x-k8s.io/pod-nameannotation pointed to a non-existent pod. The Sandbox CR entered a permanent reconcile error loop — no replacement pod was ever created.Fix 1 (kubernetes-sigs#521): Clear stale annotation
When the annotated pod is not found (404), the controller now clears the stale annotation and falls through to the pod creation path. The existing PVC (owned by Sandbox CR, not the pod) is remounted by the replacement pod, preserving workspace data.
Fix 2 (kubernetes-sigs#469): Correct annotation during adoption
During warm pool adoption, the claim controller now ensures the
pod-nameannotation matchesadopted.Namebefore the atomicr.Updatecall that transfers ownership. This prevents stale annotations from being set in the first place.Test plan
TestReconcilePod— all 13 cases pass, including new "clears stale annotation and creates replacement pod" caseextensions/controllers— all tests pass🤖 Generated with Claude Code