Skip to content

fix: skip not-ready sandboxes during warm pool adoption#519

Open
noeljackson wants to merge 1 commit intokubernetes-sigs:mainfrom
noeljackson:pr/claim-skip-not-ready
Open

fix: skip not-ready sandboxes during warm pool adoption#519
noeljackson wants to merge 1 commit intokubernetes-sigs:mainfrom
noeljackson:pr/claim-skip-not-ready

Conversation

@noeljackson
Copy link
Copy Markdown
Contributor

Summary

Prevent the claim controller from adopting not-Ready sandboxes during warm pool rotation, which causes claims to hang with ReconcilerError.

Problem

During warm pool rotation (e.g. template spec change triggers pod cycling), adoptSandboxFromCandidates sorts Ready sandboxes first but doesn't filter — if no Ready candidates are available (all pods being recreated), it adopts a not-Ready sandbox. The adoption succeeds (ownership transfer on the Sandbox CR), but the backing pod doesn't exist yet, causing reconcilePod to fail with "Pod not found". The claim gets stuck with Ready=False, Reason=ReconcilerError.

The error does trigger requeue with backoff, but the user sees a hung CLI.

Fix

Add a readiness guard in the adoption loop using the existing isSandboxReady() helper (already defined at line 480, previously only used for metrics). Candidates without Ready=True are skipped. If no Ready candidates exist, adoptSandboxFromCandidates returns nil, and getOrCreateSandbox falls through to cold creation.

if !isSandboxReady(adopted) {
    logger.V(1).Info("skipping not-ready adoption candidate",
        "sandbox", adopted.Name, "claim", claim.Name)
    continue
}

Behavior change

Scenario Before After
During rotation (no Ready pods) Adopts not-Ready sandbox, claim hangs Falls through to cold creation, slower but works
After rotation (Ready pods available) Adopts Ready sandbox Unchanged
Normal operation Adopts Ready sandbox Unchanged

Test plan

  • New test: TestSandboxClaimSkipsNotReadyAdoptionCandidates — only not-Ready candidates, verifies none are adopted
  • Updated existing test: skips not-ready sandboxes and falls through to cold creation (was adopts oldest non-ready sandbox)
  • All existing adoption tests pass (Ready sandbox adoption unchanged)

During warm pool rotation (template spec change triggers pod cycling),
the claim controller could adopt a Sandbox whose backing pod doesn't
exist yet. The adoption succeeds (ownership transfer on the Sandbox
CR), but reconcilePod fails with "Pod not found", leaving the claim
stuck in ReconcilerError.

Root cause: adoptSandboxFromCandidates sorts Ready sandboxes first,
but if none are Ready (all pods being recreated during rotation), it
still adopts a not-Ready sandbox.

Fix: skip candidates without Ready=True in the adoption loop. If no
Ready candidates exist, return nil to fall through to cold creation.
The claim takes longer to start but doesn't hang.
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 3, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 62e4adb
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69cfcce2576bfb000887986b

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: noeljackson
Once this PR has been reviewed and has the lgtm label, please assign barney-s for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 3, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @noeljackson. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 3, 2026
noeljackson added a commit to noeljackson/agent-sandbox that referenced this pull request Apr 3, 2026
currIndex := (startIndex + i) % n
adopted := candidates[currIndex]

if !isSandboxReady(adopted) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable, but if there is no ready sandbox, do we want to fall-back to a non-ready Sandbox? I'm imagining the (pathological) case that a Sandbox takes 2 minutes to start up; is it better to use a Sandbox that is 1 minute into startup in that case?

@noeljackson
Copy link
Copy Markdown
Contributor Author

Good question. In practice, adopting a not-ready sandbox doesn't save time — it makes things worse:

  1. The claim controller doesn't wait on the adopted sandbox's pod. It transfers ownership on the Sandbox CR, then immediately tries to reconcile the pod. If the pod doesn't exist yet (which is the case during rotation — pods are being deleted and recreated), it fails with "Pod not found" and the claim enters ReconcilerError.

  2. Even if the pod exists but isn't ready, the claim enters a requeue loop waiting for Ready=True. A cold-created sandbox enters the same requeue loop. No time is saved either way.

  3. The warm pool controller loses track of the adopted sandbox (ownership was transferred to the claim), so it creates a replacement. Now you have two sandboxes starting up for one claim.

The pathological case you're describing — sandbox exists, pod exists, 1 minute into a 2-minute startup — could theoretically benefit from adoption. But during rotation specifically, the not-ready sandboxes have no backing pod at all. The sort already puts Ready sandboxes first, so in the non-rotation case where some are ready and some aren't, claims always grab a ready one.

If we wanted to handle the "pod exists but isn't ready yet" case in the future, the right approach would be to check for pod existence (not just sandbox readiness) before adopting. But that's a separate optimization — this fix addresses the immediate hang during rotation.

@justinsb
Copy link
Copy Markdown
Contributor

justinsb commented Apr 3, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:extensions cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants