fix: skip not-ready sandboxes during warm pool adoption#519
fix: skip not-ready sandboxes during warm pool adoption#519noeljackson wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
During warm pool rotation (template spec change triggers pod cycling), the claim controller could adopt a Sandbox whose backing pod doesn't exist yet. The adoption succeeds (ownership transfer on the Sandbox CR), but reconcilePod fails with "Pod not found", leaving the claim stuck in ReconcilerError. Root cause: adoptSandboxFromCandidates sorts Ready sandboxes first, but if none are Ready (all pods being recreated during rotation), it still adopts a not-Ready sandbox. Fix: skip candidates without Ready=True in the adoption loop. If no Ready candidates exist, return nil to fall through to cold creation. The claim takes longer to start but doesn't hang.
✅ Deploy Preview for agent-sandbox canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: noeljackson The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @noeljackson. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| currIndex := (startIndex + i) % n | ||
| adopted := candidates[currIndex] | ||
|
|
||
| if !isSandboxReady(adopted) { |
There was a problem hiding this comment.
This seems reasonable, but if there is no ready sandbox, do we want to fall-back to a non-ready Sandbox? I'm imagining the (pathological) case that a Sandbox takes 2 minutes to start up; is it better to use a Sandbox that is 1 minute into startup in that case?
|
Good question. In practice, adopting a not-ready sandbox doesn't save time — it makes things worse:
The pathological case you're describing — sandbox exists, pod exists, 1 minute into a 2-minute startup — could theoretically benefit from adoption. But during rotation specifically, the not-ready sandboxes have no backing pod at all. The sort already puts Ready sandboxes first, so in the non-rotation case where some are ready and some aren't, claims always grab a ready one. If we wanted to handle the "pod exists but isn't ready yet" case in the future, the right approach would be to check for pod existence (not just sandbox readiness) before adopting. But that's a separate optimization — this fix addresses the immediate hang during rotation. |
|
/ok-to-test |
Summary
Prevent the claim controller from adopting not-Ready sandboxes during warm pool rotation, which causes claims to hang with
ReconcilerError.Problem
During warm pool rotation (e.g. template spec change triggers pod cycling),
adoptSandboxFromCandidatessorts Ready sandboxes first but doesn't filter — if no Ready candidates are available (all pods being recreated), it adopts a not-Ready sandbox. The adoption succeeds (ownership transfer on the Sandbox CR), but the backing pod doesn't exist yet, causingreconcilePodto fail with "Pod not found". The claim gets stuck withReady=False, Reason=ReconcilerError.The error does trigger requeue with backoff, but the user sees a hung CLI.
Fix
Add a readiness guard in the adoption loop using the existing
isSandboxReady()helper (already defined at line 480, previously only used for metrics). Candidates withoutReady=Trueare skipped. If no Ready candidates exist,adoptSandboxFromCandidatesreturns nil, andgetOrCreateSandboxfalls through to cold creation.Behavior change
Test plan
TestSandboxClaimSkipsNotReadyAdoptionCandidates— only not-Ready candidates, verifies none are adoptedskips not-ready sandboxes and falls through to cold creation(wasadopts oldest non-ready sandbox)