fix: ensure warm pool pod-name annotation before Sandbox Ready by Pepper-rice · Pull Request #469 · kubernetes-sigs/agent-sandbox

Pepper-rice · 2026-03-24T07:10:48Z

fixes #168

Summary:
This PR ensures an adopted warm-pool Sandbox has the agents.x-k8s.io/pod-name annotation before it can be observed as Ready=True.

What Changed:
• Set agents.x-k8s.io/pod-name during warm-pool sandbox adoption when the annotation is missing.
• Added TestWarmPoolPodNameAnnotationBeforeReady, a watch-based controller e2e test that fails immediately if an adopted sandbox is ever observed as Ready=True before the annotation is set.
• Added a unit-test assertion covering the warm-pool adoption path.

Signed-off-by: Yangyin <[email protected]>

netlify · 2026-03-24T07:10:55Z

✅ Deploy Preview for agent-sandbox canceled.

Name	Link
🔨 Latest commit	`69059a4`
🔍 Latest deploy log	https://app.netlify.com/projects/agent-sandbox/deploys/69c39e0eed277800080dbd22

linux-foundation-easycla · 2026-03-24T07:10:57Z

The committers listed above are authorized under a signed CLA.

✅ login: Pepper-rice / name: Yin (244a302, 69059a4)

k8s-ci-robot · 2026-03-24T07:10:58Z

Welcome @Pepper-rice!

It looks like this is your first PR to kubernetes-sigs/agent-sandbox 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/agent-sandbox has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-03-24T07:10:59Z

Hi @Pepper-rice. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

aditya-shantanu · 2026-03-24T16:00:08Z

/ok-to-test

SHRUTI6991 · 2026-03-24T16:02:46Z

test/e2e/extensions/warmpool_sandbox_watcher_test.go

 	}, 15*time.Second, 500*time.Millisecond, "sandbox should become not-ready after pod deletion")
 }
+
+func TestWarmPoolPodNameAnnotationBeforeReady(t *testing.T) {


Thank you for this PR. This PR also resolves:

Metric calculation:

agent-sandbox/extensions/controllers/sandboxclaim_controller.go

Line 738 in d9b5f86

// Existence of the SandboxPodNameAnnotation implies the pod was adopted from a warm pool.

Also Client side dependency for pod name annotation.

Can we simplify this test? It's long and hard to read. Can you add some comments and add helper methods wherever needed?

barney-s · 2026-03-24T16:51:15Z

extensions/controllers/sandboxclaim_controller.go

 			adopted.Annotations = make(map[string]string)
 		}
+		// Map name before Ready to prevent Pod mismatch
+		if adopted.Annotations[v1alpha1.SandboxPodNameAnnotation] == "" {


wondering if we should force set it ? if it present and the value is different what is the intended behavior. we can log and force set.

vicentefb · 2026-03-24T22:07:23Z

extensions/controllers/sandboxclaim_controller_test.go


+				// 4. Verify the adopted sandbox records the adopted pod name
+				if val := adoptedSandbox.Annotations[sandboxv1alpha1.SandboxPodNameAnnotation]; val != adoptedSandbox.Name {
+					t.Errorf("expected adopted sandbox to have %q annotation %q, got %q", sandboxv1alpha1.SandboxPodNameAnnotation, adoptedSandbox.Name, val)


please update the error message to log the entirety of adoptedSandbox.Annotations to provide better context on the error.

vicentefb · 2026-03-24T22:08:34Z

test/e2e/extensions/warmpool_sandbox_watcher_test.go

+	require.NoError(t, tc.CreateWithCleanup(t.Context(), warmPool))
+
+	require.Eventually(t, func() bool {
+		sandboxList := &sandboxv1alpha1.SandboxList{}


The logic to list and wait for the warm pool sandbox to become Ready=True is an exact duplicate of lines 62-77 above. Extracting this 15-line block into a helper function (e.g., waitForWarmPoolSandboxReady(tc, nsName, warmPool)) will heavily reduce the length of both E2E tests and improve maintainability.

vicentefb · 2026-03-24T22:09:42Z

test/e2e/extensions/warmpool_sandbox_watcher_test.go

+	defer cancel()
+
+	doneCh := make(chan struct {
+		done bool


There is a critical race condition here that will cause the E2E test to flake. The framework.Watch is initiated inside a background goroutine, but the main thread immediately proceeds to call tc.CreateWithCleanup to create the claim. Because goroutine scheduling is non-deterministic, the claim might be created and adopted before the watch subscription connects to the API server. If this happens, the background watcher will miss the adoption UPDATE event entirely and will hang until the 60-second context timeout is reached, failing the test. You should implement a short delay (e.g., time.Sleep(200 * time.Millisecond)) before creating the claim, or add a synchronization mechanism to guarantee the watch has started.

aditya-shantanu · 2026-03-25T05:05:31Z

/priority critical-urgent

Pepper-rice · 2026-03-25T08:41:30Z

Thanks for the detailed feedback. I’ve pushed updates in <commit> to address the review comments.

Changes made

Refactored the new E2E test in test/e2e/extensions/warmpool_sandbox_watcher_test.go with shared helpers and clearer comments.
Updated sandbox adoption logic to reconcile agents.x-k8s.io/pod-name when the annotation is stale, and added an explicit log for that correction path.
Expanded unit-test coverage for both stale and already-correct pod-name annotation cases.
Improved unit-test assertion output to include the full annotations map for easier debugging.
Updated the new E2E test to use a direct dynamic watch instead of framework.Watch so it does not miss early sandbox events during claim adoption.

Validation

go test ./test/e2e/extensions -run TestWarmPoolPodNameAnnotationBeforeReady -count=1 --parallel=1 -v
go test ./test/e2e/extensions -run 'TestWarmPoolSandboxWatcher|TestWarmPoolPodNameAnnotationBeforeReady' -count=1 --parallel=1 -v

Both passed locally.

Note: During manual validation for #168, I also reproduced a separate sandbox-controller issue that appears to be out of scope for this PR. I’ll track it in a follow-up issue.

codebot-robot

Overall, the changes look great and effectively resolve the race conditions surrounding warm-pool sandbox adoption by ensuring the pod-name annotation is stamped prior to Ready=True.

I left a few detailed notes below. The main focus areas are hardening the E2E tests against common flaky behaviors (e.g., watch channel closures, unexpected API server event types, Go loop pointer traps) and ensuring deep-copy safety during object mutation in the reconciliation loop.

(This review was generated by Overseer)

codebot-robot · 2026-03-26T01:07:05Z

extensions/controllers/sandboxclaim_controller.go

 		}
+		// Ensure the adopted sandbox records its pod name before it can be observed Ready.
+		if podName := adopted.Annotations[v1alpha1.SandboxPodNameAnnotation]; podName != adopted.Name {
+			if podName != "" {


It appears log.Info is being used here directly. If log is a package-level or globally scoped logger, it is recommended to retrieve the context-aware logger using logger := log.FromContext(ctx). This ensures trace IDs and other contextual request data are injected into the logs.

codebot-robot · 2026-03-26T01:07:05Z

extensions/controllers/sandboxclaim_controller.go

+		if podName := adopted.Annotations[v1alpha1.SandboxPodNameAnnotation]; podName != adopted.Name {
+			if podName != "" {
+				log.Info("Correcting adopted sandbox pod-name annotation", "sandbox", adopted.Name, "oldPodName", podName, "newPodName", adopted.Name)
+			}


Assigning adopted.Annotations[v1alpha1.SandboxPodNameAnnotation] = adopted.Name mutates the object. While controller-runtime clients (Get/List) return deep copies by default making this safe, it is crucial to ensure adopted was not retrieved directly from a raw informer cache elsewhere in the chain. Direct mutation of cached objects causes data races and cache corruption.

codebot-robot · 2026-03-26T01:07:05Z

extensions/controllers/sandboxclaim_controller_test.go

+			name: "corrects stale pod-name annotation when adopting sandbox",
+			existingObjects: []client.Object{
+				template,
+				claim,


By assigning a completely new map to sb.Annotations, you overwrite any default annotations that createWarmPoolSandbox might have initialized. It's safer to check if sb.Annotations is nil, initialize if so, and assign your specific key-value pair instead of overwriting the whole map.

codebot-robot · 2026-03-26T01:07:05Z

extensions/controllers/sandboxclaim_controller_test.go

 				}

+				// 4. Verify the adopted sandbox records the adopted pod name
+				if val := adoptedSandbox.Annotations[sandboxv1alpha1.SandboxPodNameAnnotation]; val != adoptedSandbox.Name {


If adoptedSandbox.Annotations could theoretically be nil in other test cases, this manual check won't panic (as map lookup on nil is safe in Go), but using testing helpers like require.Equal(t, adoptedSandbox.Name, adoptedSandbox.Annotations[sandboxv1alpha1.SandboxPodNameAnnotation]) provides much more readable diffs on test failure.

codebot-robot · 2026-03-26T01:07:05Z

test/e2e/extensions/warmpool_sandbox_watcher_test.go


-	// Wait for warm pool Sandbox to become ready
-	var poolSandboxName string
 	require.Eventually(t, func() bool {


In Go versions prior to 1.22, taking the address of a loop variable (&sb) captures the same memory address across iterations, leading to unpredictable behavior if isSandboxReady or metav1.IsControlledBy persist the pointer. Consider passing by value or using &sandboxList.Items[i] for pointer safety.

codebot-robot · 2026-03-26T01:07:05Z

test/e2e/extensions/warmpool_sandbox_watcher_test.go

+	t.Helper()
+
+	ctx, cancel := context.WithTimeout(t.Context(), 60*time.Second)
+	defer cancel()


Using metav1.ListOptions{} without specifying a ResourceVersion means the watch starts from the current cluster state. While Watch establishment is synchronous, in highly loaded clusters there's a micro-race where the Sandbox might be updated right before the watch fully connects. Consider listing with a ResourceVersion and watching from that specific version to guarantee no missed events.

codebot-robot · 2026-03-26T01:07:05Z

test/e2e/extensions/warmpool_sandbox_watcher_test.go

+
+	// Use a direct API watch to avoid the async subscription race in framework.Watch
+	sandboxWatcher, err := tc.DynamicClient().Resource(
+		sandboxv1alpha1.GroupVersion.WithResource("sandboxes"),


The helper function requirePodNameAnnotationWhenReady implies it strictly performs assertions based on the name, but it actively creates the claim object via tc.CreateWithCleanup. Consider renaming it to something like createClaimAndVerifyPodNameAnnotationBeforeReady so the side-effect of resource creation is clear to readers.

codebot-robot · 2026-03-26T01:07:06Z

test/e2e/extensions/warmpool_sandbox_watcher_test.go

+
+	require.NoError(t, tc.CreateWithCleanup(t.Context(), claim))
+
+	for {


The line require.True(t, ok, ...) immediately fails the test if the watch channel closes. Kubernetes API server watches can close naturally (e.g. from timeouts or API server restarts). For more robust E2E tests, handle channel closure by gracefully re-establishing the watch instead of a hard failure.

codebot-robot · 2026-03-26T01:07:06Z

test/e2e/extensions/warmpool_sandbox_watcher_test.go

+			require.True(t, ok, "sandbox watch closed before observing adopted sandbox readiness")
+			require.NotEqual(t, watch.Error, event.Type, "received error event while watching sandboxes")
+
+			if event.Type == watch.Deleted {


The watch event might occasionally yield a *metav1.Status object (especially during watch errors or timeouts from the APIServer). Assuming it is always *unstructured.Unstructured and failing via require.True will cause flaky tests. It's better to explicitly check for *metav1.Status and log/retry, or safely ignore it.

codebot-robot · 2026-03-26T01:07:06Z

test/e2e/extensions/warmpool_sandbox_watcher_test.go

+			sb := &sandboxv1alpha1.Sandbox{}
+			require.NoError(t, runtime.DefaultUnstructuredConverter.FromUnstructured(u.Object, sb))
+
+			controllerRef := metav1.GetControllerOf(sb)


If isSandboxReady(sb) returns false, the loop silently continues. If the test hangs because readiness is never reached, it will simply timeout after 60s with no clues. Adding a t.Logf("Observed sandbox %s controlled by claim, but not ready yet", sb.Name) right before the continue will greatly improve debuggability for flaky CI runs.

Pepper-rice · 2026-03-31T02:59:02Z

Friendly ping on this PR
@vicentefb @barney-s could you please take another look when you have a moment?
I’ve addressed prior comments and updated tests/validation in the latest commit.
Thank you!

k8s-ci-robot · 2026-04-01T21:11:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aditya-shantanu, Pepper-rice
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

extensions/controllers/OWNERS
~~test/OWNERS~~ [aditya-shantanu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fix: ensure warm pool pod-name annotation before Sandbox Ready

244a302

Signed-off-by: Yangyin <[email protected]>

k8s-ci-robot requested review from igooch and justinsb March 24, 2026 07:10

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 24, 2026

aditya-shantanu approved these changes Mar 24, 2026

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 24, 2026

SHRUTI6991 reviewed Mar 24, 2026

View reviewed changes

barney-s reviewed Mar 24, 2026

View reviewed changes

vicentefb reviewed Mar 24, 2026

View reviewed changes

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 25, 2026

fix: ensure warm pool pod-name annotation before Sandbox Ready

69059a4

codebot-robot reviewed Mar 26, 2026

View reviewed changes

Pepper-rice requested review from barney-s and vicentefb March 27, 2026 02:08

Pepper-rice requested a review from SHRUTI6991 April 1, 2026 15:43

aditya-shantanu approved these changes Apr 1, 2026

View reviewed changes

justinsb added the area:extensions label Apr 3, 2026


		require.NoError(t, tc.CreateWithCleanup(t.Context(), claim))

		for {

Conversation

Pepper-rice commented Mar 24, 2026

Uh oh!

netlify bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for agent-sandbox canceled.

Uh oh!

linux-foundation-easycla bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 24, 2026

Uh oh!

k8s-ci-robot commented Mar 24, 2026

Uh oh!

aditya-shantanu commented Mar 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aditya-shantanu commented Mar 25, 2026

Uh oh!

Pepper-rice commented Mar 25, 2026

Changes made

Validation

Uh oh!

codebot-robot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pepper-rice commented Mar 31, 2026

Uh oh!

k8s-ci-robot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

netlify bot commented Mar 24, 2026 •

edited

Loading

linux-foundation-easycla bot commented Mar 24, 2026 •

edited

Loading