Skip to content

fix: clear stale pod-name annotation instead of hard error#521

Open
noeljackson wants to merge 1 commit intokubernetes-sigs:mainfrom
noeljackson:pr/fix-stale-pod-annotation
Open

fix: clear stale pod-name annotation instead of hard error#521
noeljackson wants to merge 1 commit intokubernetes-sigs:mainfrom
noeljackson:pr/fix-stale-pod-annotation

Conversation

@noeljackson
Copy link
Copy Markdown
Contributor

Summary

When the pod tracked by agents.x-k8s.io/pod-name annotation doesn't exist, clear the stale annotation and fall through to pod creation instead of returning a hard error.

Problem

The ensurePodNameAnnotation function (commit 32cddd3) records the backing pod's name on the Sandbox CR. This is used for stable pod tracking across reconciliations. However, when the annotated pod is deleted (warm pool rotation, eviction, image pull failure), reconcilePod returns a hard error:

if podNameAnnotationExists {
    log.Error(err, "Pod not found")
    return nil, fmt.Errorf("pod in annotation get failed: %w", err)
}

The controller never reaches PATH 3 (create pod). The Sandbox is stuck in a reconcile error loop and the warm pool never becomes ready.

Fix

When the annotated pod isn't found, clear the stale annotation and let pod = nil fall through to pod creation:

if podNameAnnotationExists {
    log.Info("Tracked pod not found, clearing stale annotation", "podName", podName)
    patch := client.MergeFrom(sandbox.DeepCopy())
    delete(sandbox.Annotations, sandboxv1alpha1.SandboxPodNameAnnotation)
    if patchErr := r.Patch(ctx, sandbox, patch); patchErr != nil {
        return nil, fmt.Errorf("failed to clear stale pod name annotation: %w", patchErr)
    }
}

The subsequent ensurePodNameAnnotation call after pod creation re-sets the annotation to track the new pod.

Test plan

  • TestReconcilePodClearsStaleAnnotation — sandbox with stale annotation pointing to non-existent pod creates a new pod and updates the annotation
  • Updated table test to remove the old hard-error expectation
  • All existing reconcilePod tests pass (no behavior change for valid annotations)

When the pod tracked by agents.x-k8s.io/pod-name doesn't exist
(deleted during warm pool rotation, eviction, or image pull failure),
the controller returned a hard error, leaving the Sandbox stuck in a
reconcile loop unable to create a replacement pod.

Now the controller clears the stale annotation and falls through to
pod creation. The new pod gets tracked via ensurePodNameAnnotation.
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 3, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 18afe68
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69cff02a312f860008cb9c95

@k8s-ci-robot k8s-ci-robot requested review from barney-s and soltysh April 3, 2026 16:51
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 3, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: noeljackson
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @noeljackson. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 3, 2026
noeljackson added a commit to noeljackson/agent-sandbox that referenced this pull request Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants