Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions api/v1alpha1/sandbox_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,16 @@ type ConditionType string
func (c ConditionType) String() string { return string(c) }

const (
// DefaultProgressDeadlineSeconds is the default maximum time (10 minutes) for a Sandbox to reach
// the Ready state.
DefaultProgressDeadlineSeconds int32 = 600

// SandboxConditionReady indicates readiness for Sandbox
SandboxConditionReady ConditionType = "Ready"

// SandboxReasonDeadlineExceeded indicates the sandbox failed to become ready within the deadline.
SandboxReasonDeadlineExceeded = "ProgressDeadlineExceeded"

// SandboxReasonExpired indicates expired state for Sandbox
SandboxReasonExpired = "SandboxExpired"
)
Expand Down Expand Up @@ -135,6 +142,14 @@ const (

// Lifecycle defines the lifecycle management for the Sandbox.
type Lifecycle struct {

// ProgressDeadlineSeconds is the maximum time in seconds for a Sandbox to become ready.
// Defaults to 600 seconds.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing this new default is a breaking change. Any sandbox that takes >600s to become ready will be failed even if they will eventually be ready

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, the default here is 600 to be consistent with the default in Deployments. Should I remove the default?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agress with @janetkuo you should not stop reconcile even after ProgressDeadlineSeconds

// +kubebuilder:default=600
// +kubebuilder:validation:Minimum=0
// +optional
ProgressDeadlineSeconds *int32 `json:"progressDeadlineSeconds,omitempty"`

// ShutdownTime is the absolute time when the sandbox expires.
// +kubebuilder:validation:Format="date-time"
// +optional
Expand Down
5 changes: 5 additions & 0 deletions api/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

104 changes: 91 additions & 13 deletions controllers/sandbox_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -105,13 +105,11 @@ func (r *SandboxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
return ctrl.Result{}, nil
}

// Check if already marked as expired to avoid repeated operations, including cleanups
if sandboxMarkedExpired(sandbox) {
log.Info("Sandbox is already marked as expired")
// Note: The sandbox won't be deleted if shutdown policy is changed to delete after expiration.
// To delete an expired sandbox, the user should delete the sandbox instead of updating it.
// This keeps the controller code simple.
return ctrl.Result{}, nil
// This stops reconciliation for resources that have already hit a deadline or expired.
// TODO: Use sandbox phase "Failed" check instead of these helper functions PR#121
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove/rephrase this line. As I commented in #121, using phase is a legacy approach and now an anti-pattern in Kubernetes.

if sandboxMarkedExpired(sandbox) || sandboxStalled(sandbox) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: once Reason=ProgressDeadlineExceeded we return early forever. Is that intended as a hard terminal state even after spec updates? If yes, a short comment/doc note might help set expectations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of limited status tracking in the Sandbox resource, the progress deadline is currently calculated from the CreationTimestamp. Even without the early return above this results in a terminal stalled state that persists even if the spec is updated or the underlying pod becomes ready. While using condition transition times could allow the resource to recover, it may lead to unintended timer resets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janetkuo @barney-s do you have a preference on the behavior of the Sandbox reconciler after the Progress Deadline Exceeded is hit?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a terminal lock-in where the resource will never be reconciled again, even if the user fixes the underlying pod spec or if a transient infrastructure issue (e.g. node, network) resolves itself.

Given that this takes inspiration from Deployment controller, let's see how it's done there. In Deployment controller, spec.progressDeadlineSeconds is used to handle a stuck Deployment, but Deployment controller continues reconciling the Deployment even after its progressDeadlineSeconds has passed. Ref https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/deployment-v1/#DeploymentSpec:

The maximum time in seconds for a deployment to make progress before it is considered to be failed. The deployment controller will continue to process failed deployments and a condition with a ProgressDeadlineExceeded reason will be surfaced in the deployment status. Note that progress will not be estimated during the time a deployment is paused. Defaults to 600s.

If progress resumes (e.g., pods become Ready after a transient infra issue), the controller updates the Progressing condition to Status: True with Reason: NewRSAvailableReason.

log.Info("Sandbox is in a terminal state (Expired or Stalled). Stopping reconciliation.")
return ctrl.Result{}, nil // stop trying to reconcile the resource
}

// Initialize trace ID for active resources missing an ID
Expand Down Expand Up @@ -139,24 +137,44 @@ func (r *SandboxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
var err error
sandboxDeleted := false

expired, requeueAfter := checkSandboxExpiry(sandbox)
// Calculate lifecycle and timeout states
expired, expiryRequeue := checkSandboxExpiry(sandbox)

deadlineHit := false
var deadlineRequeue time.Duration
// We only check the deadline if the sandbox is not yet in a "Ready" state and hasn't expired.
// TODO: Only check if the Sandbox is in a "Pending" status PR#121
if !expired && !isSandboxReady(sandbox) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition overall. Small concern: this deadline check runs whenever Ready is false, even after a sandbox was previously Ready. Since elapsed time is from CreationTimestamp, a transient later NotReady could be marked ProgressDeadlineExceeded immediately. Would it make sense to gate this to initial provisioning only (or skip once Ready has ever been true)?

deadlineHit, deadlineRequeue = checkProgressDeadline(sandbox)
}

// Check if sandbox has expired
// Handle state transitions
// Expiry takes precedence as it triggers a cleanup/deletion event.
if expired {
log.Info("Sandbox has expired, deleting child resources and checking shutdown policy")
sandboxDeleted, err = r.handleSandboxExpiry(ctx, sandbox)
} else if deadlineHit {
log.Info("Sandbox progress deadline exceeded. Marking as stalled.")
return ctrl.Result{}, r.handleProgressDeadline(ctx, oldStatus, sandbox) // stop trying to reconcile the resource
} else {
// Standard reconciliation for active, non-stalled resources.
err = r.reconcileChildResources(ctx, sandbox)
}

// Final Status update for non-deleted resources
if !sandboxDeleted {
// Update status
if statusUpdateErr := r.updateStatus(ctx, oldStatus, sandbox); statusUpdateErr != nil {
// Surface update error
err = errors.Join(err, statusUpdateErr)
}
}
// return errors seen

// Select the earliest requeue time to ensure the controller wakes up for the next event.
requeueAfter := deadlineRequeue
if expiryRequeue > 0 && (requeueAfter == 0 || expiryRequeue < requeueAfter) {
requeueAfter = expiryRequeue
}

// Return with the calculated minimum requeue duration
return ctrl.Result{RequeueAfter: requeueAfter}, err
}

Expand Down Expand Up @@ -550,7 +568,8 @@ func (r *SandboxReconciler) handleSandboxExpiry(ctx context.Context, sandbox *sa
allErrors = errors.Join(allErrors, fmt.Errorf("failed to delete service: %w", err))
}

if sandbox.Spec.ShutdownPolicy != nil && *sandbox.Spec.ShutdownPolicy == sandboxv1alpha1.ShutdownPolicyDelete {
if sandbox.Spec.ShutdownPolicy != nil &&
*sandbox.Spec.ShutdownPolicy == sandboxv1alpha1.ShutdownPolicyDelete {
if err := r.Delete(ctx, sandbox); err != nil && !k8serrors.IsNotFound(err) {
allErrors = errors.Join(allErrors, fmt.Errorf("failed to delete sandbox: %w", err))
} else {
Expand All @@ -576,10 +595,37 @@ func (r *SandboxReconciler) handleSandboxExpiry(ctx context.Context, sandbox *sa
return false, allErrors
}

// handleProgressDeadline updates the Sandbox status on ProgressDeadlineExceeded
func (r *SandboxReconciler) handleProgressDeadline(ctx context.Context,
oldStatus *sandboxv1alpha1.SandboxStatus, sandbox *sandboxv1alpha1.Sandbox) error {

if r.Tracer.IsRecording(ctx) {
r.Tracer.AddEvent(ctx, "SandboxProgressDeadlineExceeded", map[string]string{
"sandbox.Name": sandbox.Name,
})
}

// TODO: update Sandbox phase to "Failed" PR#121

meta.SetStatusCondition(&sandbox.Status.Conditions, metav1.Condition{
Type: string(sandboxv1alpha1.SandboxConditionReady),
Status: metav1.ConditionFalse,
ObservedGeneration: sandbox.Generation,
Reason: sandboxv1alpha1.SandboxReasonDeadlineExceeded,
Message: "Sandbox failed to reach Ready state within the allocated deadline.",
})

if updateErr := r.updateStatus(ctx, oldStatus, sandbox); updateErr != nil {
return updateErr
}
return nil
}

// checks if the sandbox has expired
// returns true if expired, false otherwise
// if not expired, also returns the duration to requeue after
func checkSandboxExpiry(sandbox *sandboxv1alpha1.Sandbox) (bool, time.Duration) {
// If ShutdownTime is not set, the sandbox never expires.
if sandbox.Spec.ShutdownTime == nil {
return false, 0
}
Expand Down Expand Up @@ -626,3 +672,35 @@ func (r *SandboxReconciler) SetupWithManager(mgr ctrl.Manager) error {
Owns(&corev1.Service{}, builder.WithPredicates(labelSelectorPredicate)).
Complete(r)
}

// checkProgressDeadline calculates if the sandbox has timed out.
func checkProgressDeadline(sandbox *sandboxv1alpha1.Sandbox) (bool, time.Duration) {

deadline := sandboxv1alpha1.DefaultProgressDeadlineSeconds
if sandbox.Spec.ProgressDeadlineSeconds != nil {
deadline = *sandbox.Spec.ProgressDeadlineSeconds
}

// TODO: This logic will need to be updated when Sandbox pause / resume is implemented. Issue #36.
elapsed := time.Since(sandbox.CreationTimestamp.Time)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not consistent with deployment, youe are counting time fron creating. But deployment controller checks time duration from last event.

deadlineDuration := time.Duration(deadline) * time.Second

if elapsed >= deadlineDuration {
return true, 0
}

// Schedule the next reconciliation to trigger precisely when the deadline expires.
return false, deadlineDuration - elapsed
}

// sandboxStalled checks if the sandbox is already marked with a deadline failure.
func sandboxStalled(sandbox *sandboxv1alpha1.Sandbox) bool {
cond := meta.FindStatusCondition(sandbox.Status.Conditions, string(sandboxv1alpha1.SandboxConditionReady))
return cond != nil && cond.Reason == sandboxv1alpha1.SandboxReasonDeadlineExceeded
}

// isSandboxReady returns true if the Ready condition is currently True.
func isSandboxReady(sandbox *sandboxv1alpha1.Sandbox) bool {
cond := meta.FindStatusCondition(sandbox.Status.Conditions, string(sandboxv1alpha1.SandboxConditionReady))
return cond != nil && cond.Status == metav1.ConditionTrue
}
Loading