Skip to content

fix: ensure warm pool pod-name annotation before Sandbox Ready#469

Open
Pepper-rice wants to merge 2 commits intokubernetes-sigs:mainfrom
Pepper-rice:feat/168-Controller-e2e-Test
Open

fix: ensure warm pool pod-name annotation before Sandbox Ready#469
Pepper-rice wants to merge 2 commits intokubernetes-sigs:mainfrom
Pepper-rice:feat/168-Controller-e2e-Test

Conversation

@Pepper-rice
Copy link
Copy Markdown

fixes #168

Summary:
This PR ensures an adopted warm-pool Sandbox has the agents.x-k8s.io/pod-name annotation before it can be observed as Ready=True.

What Changed:
• Set agents.x-k8s.io/pod-name during warm-pool sandbox adoption when the annotation is missing.
• Added TestWarmPoolPodNameAnnotationBeforeReady, a watch-based controller e2e test that fails immediately if an adopted sandbox is ever observed as Ready=True before the annotation is set.
• Added a unit-test assertion covering the warm-pool adoption path.

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 24, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 69059a4
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69c39e0eed277800080dbd22

@k8s-ci-robot k8s-ci-robot requested review from igooch and justinsb March 24, 2026 07:10
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Mar 24, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @Pepper-rice!

It looks like this is your first PR to kubernetes-sigs/agent-sandbox 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/agent-sandbox has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 24, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @Pepper-rice. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 24, 2026
@aditya-shantanu
Copy link
Copy Markdown
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 24, 2026
}, 15*time.Second, 500*time.Millisecond, "sandbox should become not-ready after pod deletion")
}

func TestWarmPoolPodNameAnnotationBeforeReady(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR. This PR also resolves:

  1. Metric calculation:
    // Existence of the SandboxPodNameAnnotation implies the pod was adopted from a warm pool.
  2. Also Client side dependency for pod name annotation.

Can we simplify this test? It's long and hard to read. Can you add some comments and add helper methods wherever needed?

adopted.Annotations = make(map[string]string)
}
// Map name before Ready to prevent Pod mismatch
if adopted.Annotations[v1alpha1.SandboxPodNameAnnotation] == "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we should force set it ? if it present and the value is different what is the intended behavior. we can log and force set.


// 4. Verify the adopted sandbox records the adopted pod name
if val := adoptedSandbox.Annotations[sandboxv1alpha1.SandboxPodNameAnnotation]; val != adoptedSandbox.Name {
t.Errorf("expected adopted sandbox to have %q annotation %q, got %q", sandboxv1alpha1.SandboxPodNameAnnotation, adoptedSandbox.Name, val)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the error message to log the entirety of adoptedSandbox.Annotations to provide better context on the error.

require.NoError(t, tc.CreateWithCleanup(t.Context(), warmPool))

require.Eventually(t, func() bool {
sandboxList := &sandboxv1alpha1.SandboxList{}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic to list and wait for the warm pool sandbox to become Ready=True is an exact duplicate of lines 62-77 above. Extracting this 15-line block into a helper function (e.g., waitForWarmPoolSandboxReady(tc, nsName, warmPool)) will heavily reduce the length of both E2E tests and improve maintainability.

defer cancel()

doneCh := make(chan struct {
done bool
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a critical race condition here that will cause the E2E test to flake. The framework.Watch is initiated inside a background goroutine, but the main thread immediately proceeds to call tc.CreateWithCleanup to create the claim. Because goroutine scheduling is non-deterministic, the claim might be created and adopted before the watch subscription connects to the API server. If this happens, the background watcher will miss the adoption UPDATE event entirely and will hang until the 60-second context timeout is reached, failing the test. You should implement a short delay (e.g., time.Sleep(200 * time.Millisecond)) before creating the claim, or add a synchronization mechanism to guarantee the watch has started.

@aditya-shantanu
Copy link
Copy Markdown
Contributor

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 25, 2026
@Pepper-rice
Copy link
Copy Markdown
Author

Thanks for the detailed feedback. I’ve pushed updates in <commit> to address the review comments.

Changes made

  1. Refactored the new E2E test in test/e2e/extensions/warmpool_sandbox_watcher_test.go with shared helpers and clearer comments.
  2. Updated sandbox adoption logic to reconcile agents.x-k8s.io/pod-name when the annotation is stale, and added an explicit log for that correction path.
  3. Expanded unit-test coverage for both stale and already-correct pod-name annotation cases.
  4. Improved unit-test assertion output to include the full annotations map for easier debugging.
  5. Updated the new E2E test to use a direct dynamic watch instead of framework.Watch so it does not miss early sandbox events during claim adoption.

Validation

  • go test ./test/e2e/extensions -run TestWarmPoolPodNameAnnotationBeforeReady -count=1 --parallel=1 -v
  • go test ./test/e2e/extensions -run 'TestWarmPoolSandboxWatcher|TestWarmPoolPodNameAnnotationBeforeReady' -count=1 --parallel=1 -v

Both passed locally.

Note: During manual validation for #168, I also reproduced a separate sandbox-controller issue that appears to be out of scope for this PR. I’ll track it in a follow-up issue.

Copy link
Copy Markdown

@codebot-robot codebot-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the changes look great and effectively resolve the race conditions surrounding warm-pool sandbox adoption by ensuring the pod-name annotation is stamped prior to Ready=True.

I left a few detailed notes below. The main focus areas are hardening the E2E tests against common flaky behaviors (e.g., watch channel closures, unexpected API server event types, Go loop pointer traps) and ensuring deep-copy safety during object mutation in the reconciliation loop.

(This review was generated by Overseer)

}
// Ensure the adopted sandbox records its pod name before it can be observed Ready.
if podName := adopted.Annotations[v1alpha1.SandboxPodNameAnnotation]; podName != adopted.Name {
if podName != "" {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears log.Info is being used here directly. If log is a package-level or globally scoped logger, it is recommended to retrieve the context-aware logger using logger := log.FromContext(ctx). This ensures trace IDs and other contextual request data are injected into the logs.

if podName := adopted.Annotations[v1alpha1.SandboxPodNameAnnotation]; podName != adopted.Name {
if podName != "" {
log.Info("Correcting adopted sandbox pod-name annotation", "sandbox", adopted.Name, "oldPodName", podName, "newPodName", adopted.Name)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigning adopted.Annotations[v1alpha1.SandboxPodNameAnnotation] = adopted.Name mutates the object. While controller-runtime clients (Get/List) return deep copies by default making this safe, it is crucial to ensure adopted was not retrieved directly from a raw informer cache elsewhere in the chain. Direct mutation of cached objects causes data races and cache corruption.

name: "corrects stale pod-name annotation when adopting sandbox",
existingObjects: []client.Object{
template,
claim,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By assigning a completely new map to sb.Annotations, you overwrite any default annotations that createWarmPoolSandbox might have initialized. It's safer to check if sb.Annotations is nil, initialize if so, and assign your specific key-value pair instead of overwriting the whole map.

}

// 4. Verify the adopted sandbox records the adopted pod name
if val := adoptedSandbox.Annotations[sandboxv1alpha1.SandboxPodNameAnnotation]; val != adoptedSandbox.Name {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If adoptedSandbox.Annotations could theoretically be nil in other test cases, this manual check won't panic (as map lookup on nil is safe in Go), but using testing helpers like require.Equal(t, adoptedSandbox.Name, adoptedSandbox.Annotations[sandboxv1alpha1.SandboxPodNameAnnotation]) provides much more readable diffs on test failure.


// Wait for warm pool Sandbox to become ready
var poolSandboxName string
require.Eventually(t, func() bool {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Go versions prior to 1.22, taking the address of a loop variable (&sb) captures the same memory address across iterations, leading to unpredictable behavior if isSandboxReady or metav1.IsControlledBy persist the pointer. Consider passing by value or using &sandboxList.Items[i] for pointer safety.

t.Helper()

ctx, cancel := context.WithTimeout(t.Context(), 60*time.Second)
defer cancel()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using metav1.ListOptions{} without specifying a ResourceVersion means the watch starts from the current cluster state. While Watch establishment is synchronous, in highly loaded clusters there's a micro-race where the Sandbox might be updated right before the watch fully connects. Consider listing with a ResourceVersion and watching from that specific version to guarantee no missed events.


// Use a direct API watch to avoid the async subscription race in framework.Watch
sandboxWatcher, err := tc.DynamicClient().Resource(
sandboxv1alpha1.GroupVersion.WithResource("sandboxes"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper function requirePodNameAnnotationWhenReady implies it strictly performs assertions based on the name, but it actively creates the claim object via tc.CreateWithCleanup. Consider renaming it to something like createClaimAndVerifyPodNameAnnotationBeforeReady so the side-effect of resource creation is clear to readers.


require.NoError(t, tc.CreateWithCleanup(t.Context(), claim))

for {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line require.True(t, ok, ...) immediately fails the test if the watch channel closes. Kubernetes API server watches can close naturally (e.g. from timeouts or API server restarts). For more robust E2E tests, handle channel closure by gracefully re-establishing the watch instead of a hard failure.

require.True(t, ok, "sandbox watch closed before observing adopted sandbox readiness")
require.NotEqual(t, watch.Error, event.Type, "received error event while watching sandboxes")

if event.Type == watch.Deleted {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The watch event might occasionally yield a *metav1.Status object (especially during watch errors or timeouts from the APIServer). Assuming it is always *unstructured.Unstructured and failing via require.True will cause flaky tests. It's better to explicitly check for *metav1.Status and log/retry, or safely ignore it.

sb := &sandboxv1alpha1.Sandbox{}
require.NoError(t, runtime.DefaultUnstructuredConverter.FromUnstructured(u.Object, sb))

controllerRef := metav1.GetControllerOf(sb)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If isSandboxReady(sb) returns false, the loop silently continues. If the test hangs because readiness is never reached, it will simply timeout after 60s with no clues. Adding a t.Logf("Observed sandbox %s controlled by claim, but not ready yet", sb.Name) right before the continue will greatly improve debuggability for flaky CI runs.

@Pepper-rice
Copy link
Copy Markdown
Author

Friendly ping on this PR
@vicentefb @barney-s could you please take another look when you have a moment?
I’ve addressed prior comments and updated tests/validation in the latest commit.
Thank you!

@Pepper-rice Pepper-rice requested a review from SHRUTI6991 April 1, 2026 15:43
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aditya-shantanu, Pepper-rice
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:extensions cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Controller e2e Test: pod-name annotation is present when Sandbox is Ready

8 participants