Skip to content

Conversation

@seanlaii
Copy link
Contributor

@seanlaii seanlaii commented Sep 3, 2025

Why are these changes needed?

  • Introduce the new DeletionRule in RayJob API for multi-stage deletion.
  • Implement RayJob DeletionStrategy Logic and Tests.
  • Implement Validation Logic in RayJob Controller for DeletionStrategy.

Related issue number

Closes #4019 #4020 #4021

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Design

API Changes

Old

type DeletionStrategy struct {
	OnSuccess DeletionPolicy `json:"onSuccess"`
	OnFailure DeletionPolicy `json:"onFailure"`
}
type DeletionPolicy struct {
	// Valid values are 'DeleteCluster', 'DeleteWorkers', 'DeleteSelf' or 'DeleteNone'.
	// +kubebuilder:validation:XValidation:rule="self in ['DeleteCluster', 'DeleteWorkers', 'DeleteSelf', 'DeleteNone']",message="the policy field value must be either 'DeleteCluster', 'DeleteWorkers', 'DeleteSelf', or 'DeleteNone'"
	Policy *DeletionPolicyType `json:"policy"`
}

New

// +kubebuilder:validation:XValidation:rule="!((has(self.onSuccess) || has(self.onFailure)) && has(self.deletionRules))",message="legacy policies (onSuccess/onFailure) and deletionRules cannot be used together within the same deletionStrategy"
// +kubebuilder:validation:XValidation:rule="((has(self.onSuccess) && has(self.onFailure)) || has(self.deletionRules))",message="deletionStrategy requires either BOTH onSuccess and onFailure, OR the deletionRules field (should be non-empty)"
type DeletionStrategy struct {
	// OnSuccess is the deletion policy for a successful RayJob.
	// Deprecated: Use `deletionRules` instead for more flexible, multi-stage deletion strategies.
	// This field will be removed in release 1.6.0.
	// +optional
	OnSuccess *DeletionPolicy `json:"onSuccess,omitempty"`

	// OnFailure is the deletion policy for a failed RayJob.
	// Deprecated: Use `deletionRules` instead for more flexible, multi-stage deletion strategies.
	// This field will be removed in release 1.6.0.
	// +optional
	OnFailure *DeletionPolicy `json:"onFailure,omitempty"`

	// DeletionRules is a list of deletion rules, processed based on their trigger conditions.
	// While the rules can be used to define a sequence, if multiple rules are overdue (e.g., due to controller downtime),
	// the most impactful rule (e.g., DeleteSelf) will be executed first to prioritize resource cleanup.
	// +optional
	// +listType=atomic
	// +kubebuilder:validation:MinItems=1
	DeletionRules []DeletionRule `json:"deletionRules,omitempty"`
}


type DeletionRule struct {
    Policy      DeletionPolicyType `json:"policy"`
    Condition   DeletionCondition  `json:"condition"`
}

type DeletionCondition struct {
    JobStatus               JobStatus `json:"jobStatus"`
    TTLSeconds int32     `json:"ttlSeconds,omitempty"`
}

Validation

  • Preventing Mixed APIs: The validation logic ensures that the legacy (onSuccess/onFailure) and new (deletionRules) APIs cannot be used simultaneously.
  • Logical TTLs: When using deletionRules, the validation enforces a logical hierarchy for TTLs, ensuring that DeleteWorkers happens before DeleteCluster, and DeleteCluster happens before DeleteSelf.

Controller

The RayJob controller has been updated to support the new deletionRules.

  • Impact-Aware Deletion: If multiple deletion rules are overdue (e.g., due to controller downtime), the controller will prioritize and execute the most impactful rule first (e.g., DeleteCluster over DeleteWorkers). This ensures that only the most impactful operation is executed, instead of executing all the operation.
  • Idempotent Operations: The controller now checks the actual state of the cluster before performing a deletion action. This makes the reconciliation loop idempotent and prevents errors from repeated deletion attempts.

User Impact

CRD & Runtime Impact (Fully Backward Compatible)

  • Non-Breaking Change: The update to the RayJob CRD is non-breaking. Existing RayJob custom resources will continue to be valid and function as expected after the operator is upgraded.
  • API Server Validation: The Kubernetes API server will correctly handle both old and new manifests. The controller's logic can process RayJobs created with the old onSuccess/onFailure structure.
  • kubectl / End-User Impact: Users interacting with RayJob resources via kubectl and YAML files will experience a seamless transition. All existing YAML files will continue to work without any changes.

Go Client / Controller Impact (Minor Breaking Change)

  • Reasoning: To allow deletionRules to be the sole policy-defining field, the legacy onSuccess and onFailure fields were changed from DeletionPolicy to *DeletionPolicy (a pointer). This makes them truly optional in the API.
  • Required Code Changes: Any Go code that directly accesses these fields must be updated to check for nil before dereferencing the pointer.

Migration

Legacy Configuration

apiVersion: ray.io/v1
kind: RayJob
spec:
  # ... other RayJob specs
  ttlSeconds: 60 # This TTL applies to both success and failure
  deletionStrategy:
    onSuccess:
      policy: DeleteCluster
    onFailure:
      policy: DeleteWorkers

New Configuration

apiVersion: ray.io/v1
kind: RayJob
spec:
  deletionStrategy:
    deletionRules:
      - policy: DeleteCluster
        condition:
          jobStatus: SUCCEEDED
          ttlSeconds: 60
      - policy: DeleteWorkers
        condition:
          jobStatus: FAILED
          ttlSeconds: 60

Multi-Stage Configuration

Delete workers after 30 seconds, and delete the cluster after 60 seconds.

deletionRules:
    - condition: 
        jobStatus: FAILED
        ttlSeconds: 30
      policy: DeleteWorkers
    - condition:
        jobStatus: FAILED
        ttlSeconds: 60
      policy: DeleteCluster
    - condition: 
        jobStatus: SUCCEEDED
        ttlSeconds: 30
      policy: DeleteWorkers
    - condition:
        jobStatus: SUCCEEDED
        ttlSeconds: 60
      policy: DeleteCluster

Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest combining the controller logic in the same PR as the CRD, as the controller implementation may shed some light on how the API should change

@seanlaii seanlaii changed the title [CRD][RayJob] Define new DeletionStrategy in RayJob CRD [CRD][RayJob] Enhance RayJob DeletionStrategy to Support Multi-Stage Deletion Sep 7, 2025
@seanlaii seanlaii changed the title [CRD][RayJob] Enhance RayJob DeletionStrategy to Support Multi-Stage Deletion [RayJob] Enhance RayJob DeletionStrategy to Support Multi-Stage Deletion Sep 7, 2025
@seanlaii seanlaii force-pushed the deletionrule-api branch 2 times, most recently from d853ad6 to ac38a47 Compare September 9, 2025 00:09
@seanlaii seanlaii marked this pull request as ready for review September 9, 2025 00:59
@seanlaii
Copy link
Contributor Author

seanlaii commented Sep 9, 2025

Hi @rueian @Future-Outlier @kevin85421 @andrewsykim , please help take a look when you are available. The PR is ready for review. Thanks!

@seanlaii
Copy link
Contributor Author

/retest

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide an example called ray-job.deletion-strategy.yaml in the folder ray-operator/config/samples/, so users can easily play the deletion policy feature?
for example:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:
  deletionStrategy:
    deletionRules:
      - policy: DeleteCluster
        condition:
          jobStatus: SUCCEEDED
          ttlSecondsAfterFinished: 60
      - policy: DeleteWorkers
        condition:
          jobStatus: FAILED
          ttlSecondsAfterFinished: 60
  entrypoint: python /home/ray/samples/sample_code.py

... other can be the same as `ray-job.sample.yaml`

@Future-Outlier
Copy link
Member

Let’s wait for the other maintainers to take a look. It looks good to me, but I wasn’t involved in the design-doc discussions, so I’m worried that I might have missed something.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo:
check retry scenario to avoid deletion policy accidentally delete it.

@seanlaii
Copy link
Contributor Author

Thanks for the review! I have added a sample for DeletionRules, and addressed the comments. Please help take a look when you are available.

# 3. After 90 seconds, the RayJob custom resource itself is deleted, removing it from the Kubernetes API server.
deletionRules:
- condition:
jobStatus: FAILED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own education, is there a way to indicate a policy (e.g. DeleteWorkers) regardless of the job status, or do you need to specify two rules with both conditions for each?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we need to specify two rules with both conditions for each now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that's unfortunately really verbose. I wonder if we can make a case where if jobStatus is null, we apply the deletion policy after TTL, regardless of the status.

deletionRules:
- condition
     ttlSeconds: 30
   policy: DeleteWorkers

What do you think @seanlaii @rueian @Future-Outlier ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a list for status such as:

// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:Items:Enum=SUCCEEDED;FAILED
// +listType=set
JobStatuses []JobStatus `json:"jobStatuses"``

I prefer avoiding null == all for clarity and unexpected expanding if we add other supported status.
What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think the null case actually helps to keep the policy working when new statuses are introduced. Actually I think a list of JobStatuses is good too, but if jobStatuses is null I still think we should treat it as "any statuses after TTL".

Copy link
Collaborator

@rueian rueian Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a []JobStatus is verbose and complicates the implementation, but it might be a safer option.

If we go with null == all, then we need to have tests to make sure what all means explicitly. Specifically, the tests should ensure that if someday we expand the deletion policy to non-terminal statuses, the null case should not apply to RUNNING or other statuses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the safer option, since it will be easier to maintain in the future.
And since @seanlaii is the user of this PR, I think supporting the user’s use case is more important for now.
If many users want it, we can consider it in the future — but we must always maintain backward compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @andrewsykim do you agree?
Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with current implementation or []JobStatus, no strong preference

Copy link
Collaborator

@rueian rueian Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also okay with the current implementation. We can have a new PR for changing to either []JobStatus or null == all with tests preventing unexpected future changes.

Copy link
Contributor Author

@seanlaii seanlaii Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch policy {
case rayv1.DeleteWorkers:
if err := r.Get(ctx, clusterIdentifier, cluster); err != nil {
if errors.IsNotFound(err) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or when a RayCluster has a deletionTimestamp set, it should be treated as deleted as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated!

@rueian rueian requested a review from Copilot October 7, 2025 20:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the RayJob DeletionStrategy to support multi-stage deletion through the introduction of DeletionRules. The new system provides more granular control over resource cleanup timing compared to the legacy onSuccess/onFailure policies, while maintaining full backward compatibility.

  • Introduction of DeletionRules API for multi-stage deletion with per-rule TTL controls
  • Enhanced controller logic to handle rule-based deletion with impact-aware priority handling
  • Comprehensive validation system to prevent configuration conflicts between legacy and new APIs

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
ray-operator/apis/ray/v1/rayjob_types.go Adds new DeletionRule, DeletionCondition types and updates DeletionStrategy
ray-operator/controllers/ray/rayjob_controller.go Implements multi-stage deletion handling with impact-aware rule execution
ray-operator/controllers/ray/utils/validation.go Adds comprehensive validation for deletion rules and legacy policy conflicts
ray-operator/test/e2erayjob/rayjob_deletion_strategy_test.go Comprehensive end-to-end tests covering all deletion scenarios
ray-operator/pkg/client/applyconfiguration/ray/v1/*.go Generated client configuration for new DeletionRule types

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

namespace := test.NewTestNamespace()

// Job scripts - using existing counter.py for successful jobs and fail.py for failed jobs
// Note: This test suite requires the RayJobDeletionPolicy feature gate to be enabled
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should specify how to enable the feature gate for users running these tests manually, as this is important setup information that's not immediately obvious.

Suggested change
// Note: This test suite requires the RayJobDeletionPolicy feature gate to be enabled
// Note: This test suite requires the RayJobDeletionPolicy feature gate to be enabled.
// To enable it when running tests manually, start the Ray operator with the following environment variable:
// FEATURE_GATES=RayJobDeletionPolicy=true
// or add "--feature-gates=RayJobDeletionPolicy=true" to the operator's startup arguments, depending on your deployment method.

Copilot uses AI. Check for mistakes.
Comment on lines +1371 to +1372
func requeueDelayFor(t time.Time) time.Duration {
return time.Until(t) + 2*time.Second
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded 2-second buffer should be defined as a named constant to make it easier to adjust and understand its purpose.

Copilot uses AI. Check for mistakes.
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will find time do a final pass for this PR this week.
(I want to think about any concurrency edge case will happen or not)
thank you!

graph TD
    A[RayJob Reaches Terminal State] --> B{DeletionStrategy Configured?}
    B -->|No| C[Use Default Behavior]
    B -->|Yes| D{Feature Gate Enabled?}
    D -->|No| E[Error: Feature Gate Required]
    D -->|Yes| F{DeletionRules or Legacy?}
    
    F -->|Legacy onSuccess/onFailure| G[handleLegacyDeletionPolicy]
    F -->|DeletionRules| H[handleDeletionRules]
    
    G --> I[Check TTLSecondsAfterFinished]
    I --> J{TTL Met?}
    J -->|No| K[Requeue for Later]
    J -->|Yes| L[Execute Single Policy]
    
    H --> M[Process Each Rule]
    M --> N{Rule Matches JobStatus?}
    N -->|No| O[Skip Rule]
    N -->|Yes| P{TTL Met?}
    P -->|No| Q[Add to Pending Rules]
    P -->|Yes| R{Action Completed?}
    R -->|Yes| S[Skip Completed Rule]
    R -->|No| T[Add to Overdue Rules]
    
    Q --> U[Calculate Next Requeue Time]
    T --> V{Overdue Rules Exist?}
    V -->|No| U
    V -->|Yes| W[selectMostImpactfulRule]
    
    W --> X[Execute Most Impactful Policy]
    X --> Y[DeleteSelf Priority: 4]
    Y --> Z[DeleteCluster Priority: 3]
    Z --> AA[DeleteWorkers Priority: 2]
    AA --> BB[DeleteNone Priority: 1]
    
    BB --> CC[executeDeletionPolicy]
    CC --> DD{Policy Type}
    DD -->|DeleteCluster| EE[Delete RayCluster CR]
    DD -->|DeleteWorkers| FF[Suspend Worker Groups]
    DD -->|DeleteSelf| GG[Delete RayJob CR]
    DD -->|DeleteNone| HH[No Action]
    
    EE --> II[Requeue for Next Rule]
    FF --> II
    GG --> JJ[Terminal - No More Rules]
    HH --> II
    
    II --> KK[Continue Processing Rules]
    KK --> M
    
    L --> LL[Single Policy Execution]
    LL --> CC


Loading

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Wei-Cheng Lai <[email protected]>
continue
}

deletionTime := rayJob.Status.EndTime.Add(time.Duration(rule.Condition.TTLSeconds) * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @seanlaii
is it guaranteed that rayJob.Status.EndTime will not be nil?
Should we check if rayJob.Status.EndTime is not nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will always be non-nil if it is in terminal state. We set it here: https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/rayjob_controller.go#L918.
Additionally, the original implementation assumes that it is non-nil as well: https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/rayjob_controller.go#L391.

@rueian rueian merged commit 58a3ff0 into ray-project:master Oct 10, 2025
27 checks passed
@seanlaii seanlaii deleted the deletionrule-api branch October 12, 2025 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Define new DeletionStrategy in RayJob CRD

5 participants