Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .buildkite/build-start-operator.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,14 @@
# to kick off from the release branch so tests should match up accordingly.

if [ "$IS_FROM_RAY_RELEASE_AUTOMATION" = 1 ]; then
helm repo update && helm install kuberay/kuberay-operator
helm repo update
echo "Installing helm chart with test override values (feature gates enabled as needed)"
# NOTE: The override file is CI/test-only. It is NOT part of the released chart defaults.
helm install kuberay-operator kuberay/kuberay-operator -f ../.buildkite/values-kuberay-operator-override.yaml
KUBERAY_TEST_RAY_IMAGE="rayproject/ray:nightly.$(date +'%y%m%d').${RAY_NIGHTLY_COMMIT:0:6}-py39" && export KUBERAY_TEST_RAY_IMAGE
else
IMG=kuberay/operator:nightly make docker-image &&
kind load docker-image kuberay/operator:nightly &&
IMG=kuberay/operator:nightly make deploy
echo "Deploying operator with test overrides (feature gates via test-overrides overlay)"
IMG=kuberay/operator:nightly make deploy-with-override
fi
18 changes: 18 additions & 0 deletions .buildkite/values-kuberay-operator-override.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Generic Helm values override used only in CI / e2e test environments.
# Intent:
# - Allow e2e tests to turn on alpha / experimental feature gates (e.g. RayJobDeletionPolicy)
# - Provide a single place contributors can extend with additional overrides needed for tests
# - Keep the default published Helm chart behavior unchanged for normal users
# Scope / Safety:
# - This file is never referenced by the base chart; it is opt‑in via buildkite or manual helm install
# - Do NOT rename it to values.yaml or commit changes that enable unstable features by default
# Usage examples:
# helm install kuberay-operator kuberay/kuberay-operator -f ../.buildkite/values-kuberay-operator-override.yaml
# (add or remove feature gates below as e2e scenarios expand)
#
# Current overrides: enable RayJobDeletionPolicy alpha feature gate alongside the existing status conditions gate.
featureGates:
- name: RayClusterStatusConditions
enabled: true
- name: RayJobDeletionPolicy
enabled: true
65 changes: 60 additions & 5 deletions docs/reference/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,11 +55,28 @@ _Appears in:_



#### DeletionPolicy
#### DeletionCondition



DeletionCondition specifies the trigger conditions for a deletion action.



_Appears in:_
- [DeletionRule](#deletionrule)

| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `ttlSeconds` _integer_ | TTLSeconds is the time in seconds from when the JobStatus<br />reaches the specified terminal state to when this deletion action should be triggered.<br />The value must be a non-negative integer. | 0 | Minimum: 0 <br /> |


#### DeletionPolicy



DeletionPolicy is the legacy single-stage deletion policy.
Deprecated: This struct is part of the legacy API. Use DeletionRule for new configurations.



Expand All @@ -68,7 +85,7 @@ _Appears in:_

| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `policy` _[DeletionPolicyType](#deletionpolicytype)_ | Valid values are 'DeleteCluster', 'DeleteWorkers', 'DeleteSelf' or 'DeleteNone'. | | |
| `policy` _[DeletionPolicyType](#deletionpolicytype)_ | Policy is the action to take when the condition is met.<br />This field is logically required when using the legacy OnSuccess/OnFailure policies.<br />It is marked as '+optional' at the API level to allow the 'deletionRules' field to be used instead. | | Enum: [DeleteCluster DeleteWorkers DeleteSelf DeleteNone] <br /> |


#### DeletionPolicyType
Expand All @@ -81,14 +98,51 @@ _Underlying type:_ _string_

_Appears in:_
- [DeletionPolicy](#deletionpolicy)
- [DeletionRule](#deletionrule)



#### DeletionRule



DeletionRule defines a single deletion action and its trigger condition.
This is the new, recommended way to define deletion behavior.



_Appears in:_
- [DeletionStrategy](#deletionstrategy)

| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `policy` _[DeletionPolicyType](#deletionpolicytype)_ | Policy is the action to take when the condition is met. This field is required. | | Enum: [DeleteCluster DeleteWorkers DeleteSelf DeleteNone] <br /> |
| `condition` _[DeletionCondition](#deletioncondition)_ | The condition under which this deletion rule is triggered. This field is required. | | |


#### DeletionStrategy



DeletionStrategy configures automated cleanup after the RayJob reaches a terminal state.
Two mutually exclusive styles are supported:


Legacy: provide both onSuccess and onFailure (deprecated; removal planned for 1.6.0). May be combined with shutdownAfterJobFinishes and (optionally) global TTLSecondsAfterFinished.
Rules: provide deletionRules (non-empty list). Rules mode is incompatible with shutdownAfterJobFinishes, legacy fields, and the global TTLSecondsAfterFinished (use per‑rule condition.ttlSeconds instead).


Semantics:
- A non-empty deletionRules selects rules mode; empty lists are treated as unset.
- Legacy requires both onSuccess and onFailure; specifying only one is invalid.
- Global TTLSecondsAfterFinished > 0 requires shutdownAfterJobFinishes=true; therefore it cannot be used with rules mode or with legacy alone (no shutdown).
- Feature gate RayJobDeletionPolicy must be enabled when this block is present.


Validation:
- CRD XValidations prevent mixing legacy fields with deletionRules and enforce legacy completeness.
- Controller logic enforces rules vs shutdown exclusivity and TTL constraints.
- onSuccess/onFailure are deprecated; migration to deletionRules is encouraged.



Expand All @@ -97,8 +151,9 @@ _Appears in:_

| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `onSuccess` _[DeletionPolicy](#deletionpolicy)_ | | | |
| `onFailure` _[DeletionPolicy](#deletionpolicy)_ | | | |
| `onSuccess` _[DeletionPolicy](#deletionpolicy)_ | OnSuccess is the deletion policy for a successful RayJob.<br />Deprecated: Use `deletionRules` instead for more flexible, multi-stage deletion strategies.<br />This field will be removed in release 1.6.0. | | |
| `onFailure` _[DeletionPolicy](#deletionpolicy)_ | OnFailure is the deletion policy for a failed RayJob.<br />Deprecated: Use `deletionRules` instead for more flexible, multi-stage deletion strategies.<br />This field will be removed in release 1.6.0. | | |
| `deletionRules` _[DeletionRule](#deletionrule) array_ | DeletionRules is a list of deletion rules, processed based on their trigger conditions.<br />While the rules can be used to define a sequence, if multiple rules are overdue (e.g., due to controller downtime),<br />the most impactful rule (e.g., DeleteSelf) will be executed first to prioritize resource cleanup. | | MinItems: 1 <br /> |



Expand Down Expand Up @@ -242,7 +297,7 @@ _Appears in:_
| `clusterSelector` _object (keys:string, values:string)_ | clusterSelector is used to select running rayclusters by labels | | |
| `submitterConfig` _[SubmitterConfig](#submitterconfig)_ | Configurations of submitter k8s job. | | |
| `managedBy` _string_ | ManagedBy is an optional configuration for the controller or entity that manages a RayJob.<br />The value must be either 'ray.io/kuberay-operator' or 'kueue.x-k8s.io/multikueue'.<br />The kuberay-operator reconciles a RayJob which doesn't have this field at all or<br />the field value is the reserved string 'ray.io/kuberay-operator',<br />but delegates reconciling the RayJob with 'kueue.x-k8s.io/multikueue' to the Kueue.<br />The field is immutable. | | |
| `deletionStrategy` _[DeletionStrategy](#deletionstrategy)_ | DeletionStrategy indicates what resources of the RayJob and how they are deleted upon job completion.<br />If unset, deletion policy is based on 'spec.shutdownAfterJobFinishes'.<br />This field requires the RayJobDeletionPolicy feature gate to be enabled. | | |
| `deletionStrategy` _[DeletionStrategy](#deletionstrategy)_ | DeletionStrategy automates post-completion cleanup.<br />Choose one style or omit:<br /> - Legacy: both onSuccess & onFailure (deprecated; may combine with shutdownAfterJobFinishes and TTLSecondsAfterFinished).<br /> - Rules: deletionRules (non-empty) — incompatible with shutdownAfterJobFinishes, legacy fields, and global TTLSecondsAfterFinished (use per-rule condition.ttlSeconds).<br />Global TTLSecondsAfterFinished > 0 requires shutdownAfterJobFinishes=true.<br />Feature gate RayJobDeletionPolicy must be enabled when this field is set. | | |
| `entrypoint` _string_ | Entrypoint represents the command to start execution. | | |
| `runtimeEnvYAML` _string_ | RuntimeEnvYAML represents the runtime environment configuration<br />provided as a multi-line YAML string. | | |
| `jobId` _string_ | If jobId is not set, a new jobId will be auto-generated. | | |
Expand Down
66 changes: 49 additions & 17 deletions helm-chart/kuberay-operator/crds/ray.io_rayjobs.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 17 additions & 1 deletion ray-operator/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,6 @@ test: ENVTEST_K8S_VERSION ?= 1.24.2
test: manifests fmt vet envtest ## Run tests.
KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" go test $(WHAT) -coverprofile cover.out

# You can use `go test -timeout 30m -v ./test/e2e/rayjob_test.go ./test/e2e/support.go` if you only want to run tests in `rayjob_test.go`.
test-e2e: WHAT ?= ./test/e2e
test-e2e: manifests fmt vet ## Run e2e tests.
go test -timeout 30m -v $(WHAT)
Expand All @@ -88,6 +87,14 @@ test-sampleyaml: WHAT ?= ./test/sampleyaml
test-sampleyaml: manifests fmt vet
go test -timeout 30m -v $(WHAT)

test-e2e-rayjob: WHAT ?= ./test/e2erayjob
test-e2e-rayjob: manifests fmt vet ## Run e2e tests.
go test -timeout 30m -v $(WHAT)

test-e2e-rayservice: WHAT ?= ./test/e2erayservice
test-e2e-rayservice: manifests fmt vet ## Run e2e tests.
go test -timeout 30m -v $(WHAT)

sync: helm api-docs
./hack/update-codegen.sh

Expand Down Expand Up @@ -136,6 +143,15 @@ deploy: manifests kustomize ## Deploy controller to the K8s cluster specified in
cd config/default && $(KUSTOMIZE) edit set image kuberay/operator=${IMG}
$(KUSTOMIZE) build config/default | kubectl apply --server-side=true -f -

# NOTE FOR CONTRIBUTORS:
# deploy-with-override is an e2e/CI-only deployment path. It applies a Kustomize overlay that
# enables test-only feature gates (e.g. RayJobDeletionPolicy) without changing the default
# behavior of the base Helm chart or the standard 'make deploy'. Add additional test overrides
# to the overlay (config/overlays/rayjob-deletion-policy) rather than modifying the base.
deploy-with-override: manifests kustomize ## Deploy controller with test-only feature gate overrides (does NOT affect default chart).
cd config/default && $(KUSTOMIZE) edit set image kuberay/operator=${IMG}
$(KUSTOMIZE) build config/overlays/test-overrides | kubectl apply --server-side=true -f -

undeploy: ## Undeploy controller from the K8s cluster specified in ~/.kube/config.
$(KUSTOMIZE) build config/default | kubectl delete -f -

Expand Down
Loading
Loading