fix: ensure warm pool pod-name annotation before Sandbox Ready#469
fix: ensure warm pool pod-name annotation before Sandbox Ready#469Pepper-rice wants to merge 2 commits intokubernetes-sigs:mainfrom
Conversation
Signed-off-by: Yangyin <[email protected]>
✅ Deploy Preview for agent-sandbox canceled.
|
|
Welcome @Pepper-rice! |
|
Hi @Pepper-rice. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
| }, 15*time.Second, 500*time.Millisecond, "sandbox should become not-ready after pod deletion") | ||
| } | ||
|
|
||
| func TestWarmPoolPodNameAnnotationBeforeReady(t *testing.T) { |
There was a problem hiding this comment.
Thank you for this PR. This PR also resolves:
- Metric calculation:
- Also Client side dependency for pod name annotation.
Can we simplify this test? It's long and hard to read. Can you add some comments and add helper methods wherever needed?
| adopted.Annotations = make(map[string]string) | ||
| } | ||
| // Map name before Ready to prevent Pod mismatch | ||
| if adopted.Annotations[v1alpha1.SandboxPodNameAnnotation] == "" { |
There was a problem hiding this comment.
wondering if we should force set it ? if it present and the value is different what is the intended behavior. we can log and force set.
|
|
||
| // 4. Verify the adopted sandbox records the adopted pod name | ||
| if val := adoptedSandbox.Annotations[sandboxv1alpha1.SandboxPodNameAnnotation]; val != adoptedSandbox.Name { | ||
| t.Errorf("expected adopted sandbox to have %q annotation %q, got %q", sandboxv1alpha1.SandboxPodNameAnnotation, adoptedSandbox.Name, val) |
There was a problem hiding this comment.
please update the error message to log the entirety of adoptedSandbox.Annotations to provide better context on the error.
| require.NoError(t, tc.CreateWithCleanup(t.Context(), warmPool)) | ||
|
|
||
| require.Eventually(t, func() bool { | ||
| sandboxList := &sandboxv1alpha1.SandboxList{} |
There was a problem hiding this comment.
The logic to list and wait for the warm pool sandbox to become Ready=True is an exact duplicate of lines 62-77 above. Extracting this 15-line block into a helper function (e.g., waitForWarmPoolSandboxReady(tc, nsName, warmPool)) will heavily reduce the length of both E2E tests and improve maintainability.
| defer cancel() | ||
|
|
||
| doneCh := make(chan struct { | ||
| done bool |
There was a problem hiding this comment.
There is a critical race condition here that will cause the E2E test to flake. The framework.Watch is initiated inside a background goroutine, but the main thread immediately proceeds to call tc.CreateWithCleanup to create the claim. Because goroutine scheduling is non-deterministic, the claim might be created and adopted before the watch subscription connects to the API server. If this happens, the background watcher will miss the adoption UPDATE event entirely and will hang until the 60-second context timeout is reached, failing the test. You should implement a short delay (e.g., time.Sleep(200 * time.Millisecond)) before creating the claim, or add a synchronization mechanism to guarantee the watch has started.
|
/priority critical-urgent |
|
Thanks for the detailed feedback. I’ve pushed updates in <commit> to address the review comments. Changes made
Validation
Both passed locally. Note: During manual validation for #168, I also reproduced a separate sandbox-controller issue that appears to be out of scope for this PR. I’ll track it in a follow-up issue. |
codebot-robot
left a comment
There was a problem hiding this comment.
Overall, the changes look great and effectively resolve the race conditions surrounding warm-pool sandbox adoption by ensuring the pod-name annotation is stamped prior to Ready=True.
I left a few detailed notes below. The main focus areas are hardening the E2E tests against common flaky behaviors (e.g., watch channel closures, unexpected API server event types, Go loop pointer traps) and ensuring deep-copy safety during object mutation in the reconciliation loop.
(This review was generated by Overseer)
| } | ||
| // Ensure the adopted sandbox records its pod name before it can be observed Ready. | ||
| if podName := adopted.Annotations[v1alpha1.SandboxPodNameAnnotation]; podName != adopted.Name { | ||
| if podName != "" { |
There was a problem hiding this comment.
It appears log.Info is being used here directly. If log is a package-level or globally scoped logger, it is recommended to retrieve the context-aware logger using logger := log.FromContext(ctx). This ensures trace IDs and other contextual request data are injected into the logs.
| if podName := adopted.Annotations[v1alpha1.SandboxPodNameAnnotation]; podName != adopted.Name { | ||
| if podName != "" { | ||
| log.Info("Correcting adopted sandbox pod-name annotation", "sandbox", adopted.Name, "oldPodName", podName, "newPodName", adopted.Name) | ||
| } |
There was a problem hiding this comment.
Assigning adopted.Annotations[v1alpha1.SandboxPodNameAnnotation] = adopted.Name mutates the object. While controller-runtime clients (Get/List) return deep copies by default making this safe, it is crucial to ensure adopted was not retrieved directly from a raw informer cache elsewhere in the chain. Direct mutation of cached objects causes data races and cache corruption.
| name: "corrects stale pod-name annotation when adopting sandbox", | ||
| existingObjects: []client.Object{ | ||
| template, | ||
| claim, |
There was a problem hiding this comment.
By assigning a completely new map to sb.Annotations, you overwrite any default annotations that createWarmPoolSandbox might have initialized. It's safer to check if sb.Annotations is nil, initialize if so, and assign your specific key-value pair instead of overwriting the whole map.
| } | ||
|
|
||
| // 4. Verify the adopted sandbox records the adopted pod name | ||
| if val := adoptedSandbox.Annotations[sandboxv1alpha1.SandboxPodNameAnnotation]; val != adoptedSandbox.Name { |
There was a problem hiding this comment.
If adoptedSandbox.Annotations could theoretically be nil in other test cases, this manual check won't panic (as map lookup on nil is safe in Go), but using testing helpers like require.Equal(t, adoptedSandbox.Name, adoptedSandbox.Annotations[sandboxv1alpha1.SandboxPodNameAnnotation]) provides much more readable diffs on test failure.
|
|
||
| // Wait for warm pool Sandbox to become ready | ||
| var poolSandboxName string | ||
| require.Eventually(t, func() bool { |
There was a problem hiding this comment.
In Go versions prior to 1.22, taking the address of a loop variable (&sb) captures the same memory address across iterations, leading to unpredictable behavior if isSandboxReady or metav1.IsControlledBy persist the pointer. Consider passing by value or using &sandboxList.Items[i] for pointer safety.
| t.Helper() | ||
|
|
||
| ctx, cancel := context.WithTimeout(t.Context(), 60*time.Second) | ||
| defer cancel() |
There was a problem hiding this comment.
Using metav1.ListOptions{} without specifying a ResourceVersion means the watch starts from the current cluster state. While Watch establishment is synchronous, in highly loaded clusters there's a micro-race where the Sandbox might be updated right before the watch fully connects. Consider listing with a ResourceVersion and watching from that specific version to guarantee no missed events.
|
|
||
| // Use a direct API watch to avoid the async subscription race in framework.Watch | ||
| sandboxWatcher, err := tc.DynamicClient().Resource( | ||
| sandboxv1alpha1.GroupVersion.WithResource("sandboxes"), |
There was a problem hiding this comment.
The helper function requirePodNameAnnotationWhenReady implies it strictly performs assertions based on the name, but it actively creates the claim object via tc.CreateWithCleanup. Consider renaming it to something like createClaimAndVerifyPodNameAnnotationBeforeReady so the side-effect of resource creation is clear to readers.
|
|
||
| require.NoError(t, tc.CreateWithCleanup(t.Context(), claim)) | ||
|
|
||
| for { |
There was a problem hiding this comment.
The line require.True(t, ok, ...) immediately fails the test if the watch channel closes. Kubernetes API server watches can close naturally (e.g. from timeouts or API server restarts). For more robust E2E tests, handle channel closure by gracefully re-establishing the watch instead of a hard failure.
| require.True(t, ok, "sandbox watch closed before observing adopted sandbox readiness") | ||
| require.NotEqual(t, watch.Error, event.Type, "received error event while watching sandboxes") | ||
|
|
||
| if event.Type == watch.Deleted { |
There was a problem hiding this comment.
The watch event might occasionally yield a *metav1.Status object (especially during watch errors or timeouts from the APIServer). Assuming it is always *unstructured.Unstructured and failing via require.True will cause flaky tests. It's better to explicitly check for *metav1.Status and log/retry, or safely ignore it.
| sb := &sandboxv1alpha1.Sandbox{} | ||
| require.NoError(t, runtime.DefaultUnstructuredConverter.FromUnstructured(u.Object, sb)) | ||
|
|
||
| controllerRef := metav1.GetControllerOf(sb) |
There was a problem hiding this comment.
If isSandboxReady(sb) returns false, the loop silently continues. If the test hangs because readiness is never reached, it will simply timeout after 60s with no clues. Adding a t.Logf("Observed sandbox %s controlled by claim, but not ready yet", sb.Name) right before the continue will greatly improve debuggability for flaky CI runs.
|
Friendly ping on this PR |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aditya-shantanu, Pepper-rice The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
fixes #168
Summary:
This PR ensures an adopted warm-pool
Sandboxhas theagents.x-k8s.io/pod-nameannotation before it can be observed asReady=True.What Changed:
• Set agents.x-k8s.io/pod-name during warm-pool sandbox adoption when the annotation is missing.
• Added TestWarmPoolPodNameAnnotationBeforeReady, a watch-based controller e2e test that fails immediately if an adopted sandbox is ever observed as Ready=True before the annotation is set.
• Added a unit-test assertion covering the warm-pool adoption path.