-
Notifications
You must be signed in to change notification settings - Fork 192
feat: Implement ProgressDeadlineSeconds for Sandbox Resource #307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -105,13 +105,11 @@ func (r *SandboxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct | |
| return ctrl.Result{}, nil | ||
| } | ||
|
|
||
| // Check if already marked as expired to avoid repeated operations, including cleanups | ||
| if sandboxMarkedExpired(sandbox) { | ||
| log.Info("Sandbox is already marked as expired") | ||
| // Note: The sandbox won't be deleted if shutdown policy is changed to delete after expiration. | ||
| // To delete an expired sandbox, the user should delete the sandbox instead of updating it. | ||
| // This keeps the controller code simple. | ||
| return ctrl.Result{}, nil | ||
| // This stops reconciliation for resources that have already hit a deadline or expired. | ||
| // TODO: Use sandbox phase "Failed" check instead of these helper functions PR#121 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please remove/rephrase this line. As I commented in #121, using phase is a legacy approach and now an anti-pattern in Kubernetes. |
||
| if sandboxMarkedExpired(sandbox) || sandboxStalled(sandbox) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: once Reason=ProgressDeadlineExceeded we return early forever. Is that intended as a hard terminal state even after spec updates? If yes, a short comment/doc note might help set expectations.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because of limited status tracking in the Sandbox resource, the progress deadline is currently calculated from the CreationTimestamp. Even without the early return above this results in a terminal stalled state that persists even if the spec is updated or the underlying pod becomes ready. While using condition transition times could allow the resource to recover, it may lead to unintended timer resets.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This creates a terminal lock-in where the resource will never be reconciled again, even if the user fixes the underlying pod spec or if a transient infrastructure issue (e.g. node, network) resolves itself. Given that this takes inspiration from Deployment controller, let's see how it's done there. In Deployment controller,
If progress resumes (e.g., pods become Ready after a transient infra issue), the controller updates the |
||
| log.Info("Sandbox is in a terminal state (Expired or Stalled). Stopping reconciliation.") | ||
| return ctrl.Result{}, nil // stop trying to reconcile the resource | ||
| } | ||
|
|
||
| // Initialize trace ID for active resources missing an ID | ||
|
|
@@ -139,24 +137,44 @@ func (r *SandboxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct | |
| var err error | ||
| sandboxDeleted := false | ||
|
|
||
| expired, requeueAfter := checkSandboxExpiry(sandbox) | ||
| // Calculate lifecycle and timeout states | ||
| expired, expiryRequeue := checkSandboxExpiry(sandbox) | ||
|
|
||
| deadlineHit := false | ||
| var deadlineRequeue time.Duration | ||
| // We only check the deadline if the sandbox is not yet in a "Ready" state and hasn't expired. | ||
| // TODO: Only check if the Sandbox is in a "Pending" status PR#121 | ||
| if !expired && !isSandboxReady(sandbox) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice addition overall. Small concern: this deadline check runs whenever Ready is false, even after a sandbox was previously Ready. Since elapsed time is from CreationTimestamp, a transient later NotReady could be marked ProgressDeadlineExceeded immediately. Would it make sense to gate this to initial provisioning only (or skip once Ready has ever been true)? |
||
| deadlineHit, deadlineRequeue = checkProgressDeadline(sandbox) | ||
| } | ||
|
|
||
| // Check if sandbox has expired | ||
| // Handle state transitions | ||
| // Expiry takes precedence as it triggers a cleanup/deletion event. | ||
| if expired { | ||
| log.Info("Sandbox has expired, deleting child resources and checking shutdown policy") | ||
| sandboxDeleted, err = r.handleSandboxExpiry(ctx, sandbox) | ||
| } else if deadlineHit { | ||
| log.Info("Sandbox progress deadline exceeded. Marking as stalled.") | ||
| return ctrl.Result{}, r.handleProgressDeadline(ctx, oldStatus, sandbox) // stop trying to reconcile the resource | ||
| } else { | ||
| // Standard reconciliation for active, non-stalled resources. | ||
| err = r.reconcileChildResources(ctx, sandbox) | ||
| } | ||
|
|
||
| // Final Status update for non-deleted resources | ||
| if !sandboxDeleted { | ||
| // Update status | ||
| if statusUpdateErr := r.updateStatus(ctx, oldStatus, sandbox); statusUpdateErr != nil { | ||
| // Surface update error | ||
| err = errors.Join(err, statusUpdateErr) | ||
| } | ||
| } | ||
| // return errors seen | ||
|
|
||
| // Select the earliest requeue time to ensure the controller wakes up for the next event. | ||
| requeueAfter := deadlineRequeue | ||
| if expiryRequeue > 0 && (requeueAfter == 0 || expiryRequeue < requeueAfter) { | ||
| requeueAfter = expiryRequeue | ||
| } | ||
|
|
||
| // Return with the calculated minimum requeue duration | ||
| return ctrl.Result{RequeueAfter: requeueAfter}, err | ||
| } | ||
|
|
||
|
|
@@ -550,7 +568,8 @@ func (r *SandboxReconciler) handleSandboxExpiry(ctx context.Context, sandbox *sa | |
| allErrors = errors.Join(allErrors, fmt.Errorf("failed to delete service: %w", err)) | ||
| } | ||
|
|
||
| if sandbox.Spec.ShutdownPolicy != nil && *sandbox.Spec.ShutdownPolicy == sandboxv1alpha1.ShutdownPolicyDelete { | ||
| if sandbox.Spec.ShutdownPolicy != nil && | ||
| *sandbox.Spec.ShutdownPolicy == sandboxv1alpha1.ShutdownPolicyDelete { | ||
| if err := r.Delete(ctx, sandbox); err != nil && !k8serrors.IsNotFound(err) { | ||
| allErrors = errors.Join(allErrors, fmt.Errorf("failed to delete sandbox: %w", err)) | ||
| } else { | ||
|
|
@@ -576,10 +595,37 @@ func (r *SandboxReconciler) handleSandboxExpiry(ctx context.Context, sandbox *sa | |
| return false, allErrors | ||
| } | ||
|
|
||
| // handleProgressDeadline updates the Sandbox status on ProgressDeadlineExceeded | ||
| func (r *SandboxReconciler) handleProgressDeadline(ctx context.Context, | ||
| oldStatus *sandboxv1alpha1.SandboxStatus, sandbox *sandboxv1alpha1.Sandbox) error { | ||
|
|
||
| if r.Tracer.IsRecording(ctx) { | ||
| r.Tracer.AddEvent(ctx, "SandboxProgressDeadlineExceeded", map[string]string{ | ||
| "sandbox.Name": sandbox.Name, | ||
| }) | ||
| } | ||
|
|
||
| // TODO: update Sandbox phase to "Failed" PR#121 | ||
|
|
||
| meta.SetStatusCondition(&sandbox.Status.Conditions, metav1.Condition{ | ||
| Type: string(sandboxv1alpha1.SandboxConditionReady), | ||
| Status: metav1.ConditionFalse, | ||
| ObservedGeneration: sandbox.Generation, | ||
| Reason: sandboxv1alpha1.SandboxReasonDeadlineExceeded, | ||
| Message: "Sandbox failed to reach Ready state within the allocated deadline.", | ||
| }) | ||
|
|
||
| if updateErr := r.updateStatus(ctx, oldStatus, sandbox); updateErr != nil { | ||
| return updateErr | ||
| } | ||
| return nil | ||
| } | ||
|
|
||
| // checks if the sandbox has expired | ||
| // returns true if expired, false otherwise | ||
| // if not expired, also returns the duration to requeue after | ||
| func checkSandboxExpiry(sandbox *sandboxv1alpha1.Sandbox) (bool, time.Duration) { | ||
| // If ShutdownTime is not set, the sandbox never expires. | ||
| if sandbox.Spec.ShutdownTime == nil { | ||
| return false, 0 | ||
| } | ||
|
|
@@ -626,3 +672,35 @@ func (r *SandboxReconciler) SetupWithManager(mgr ctrl.Manager) error { | |
| Owns(&corev1.Service{}, builder.WithPredicates(labelSelectorPredicate)). | ||
| Complete(r) | ||
| } | ||
|
|
||
| // checkProgressDeadline calculates if the sandbox has timed out. | ||
| func checkProgressDeadline(sandbox *sandboxv1alpha1.Sandbox) (bool, time.Duration) { | ||
|
|
||
| deadline := sandboxv1alpha1.DefaultProgressDeadlineSeconds | ||
| if sandbox.Spec.ProgressDeadlineSeconds != nil { | ||
| deadline = *sandbox.Spec.ProgressDeadlineSeconds | ||
| } | ||
|
|
||
| // TODO: This logic will need to be updated when Sandbox pause / resume is implemented. Issue #36. | ||
| elapsed := time.Since(sandbox.CreationTimestamp.Time) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not consistent with deployment, youe are counting time fron creating. But deployment controller checks time duration from last event. |
||
| deadlineDuration := time.Duration(deadline) * time.Second | ||
|
|
||
| if elapsed >= deadlineDuration { | ||
| return true, 0 | ||
| } | ||
|
|
||
| // Schedule the next reconciliation to trigger precisely when the deadline expires. | ||
| return false, deadlineDuration - elapsed | ||
| } | ||
|
|
||
| // sandboxStalled checks if the sandbox is already marked with a deadline failure. | ||
| func sandboxStalled(sandbox *sandboxv1alpha1.Sandbox) bool { | ||
| cond := meta.FindStatusCondition(sandbox.Status.Conditions, string(sandboxv1alpha1.SandboxConditionReady)) | ||
| return cond != nil && cond.Reason == sandboxv1alpha1.SandboxReasonDeadlineExceeded | ||
| } | ||
|
|
||
| // isSandboxReady returns true if the Ready condition is currently True. | ||
| func isSandboxReady(sandbox *sandboxv1alpha1.Sandbox) bool { | ||
| cond := meta.FindStatusCondition(sandbox.Status.Conditions, string(sandboxv1alpha1.SandboxConditionReady)) | ||
| return cond != nil && cond.Status == metav1.ConditionTrue | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing this new default is a breaking change. Any sandbox that takes >600s to become ready will be failed even if they will eventually be ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, the default here is 600 to be consistent with the default in Deployments. Should I remove the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agress with @janetkuo you should not stop reconcile even after ProgressDeadlineSeconds