Skip to content

Conversation

@AndySung320
Copy link
Contributor

Why are these changes needed?

This PR adds an e2e test covering RayCronJob suspend/resume behavior:

  • When spec.suspend=true, the RayCronJob controller should not create any new RayJobs even after the scheduled time passes.
  • After setting spec.suspend=false, the controller should resume scheduling and start creating RayJobs again.

To run this e2e test in CI, we also enable the RayCronJob feature gate in the test/CI-only operator overrides (Helm values override and kustomize test override).

Note: RayCronJob previously did not have // +genclient, so the typed client/applyconfiguration for RayCronJob was not generated. The PR adds +genclient and commits the resulting regenerated clientset/applyconfiguration/informer/lister code so the test can compile and run consistently.

Related issue number

Closes #4323

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Copy link
Contributor

@seanlaii seanlaii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

//+kubebuilder:printcolumn:name="age",type="date",JSONPath=".metadata.creationTimestamp",priority=0
//+kubebuilder:printcolumn:name="suspend",type=boolean,JSONPath=".spec.suspend",priority=0

// +genclient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
Maybe we can also add the following annotations as well:

// +kubebuilder:storageversion
// +kubebuilder:resource:categories=all

Copy link
Member

@win5923 win5923 Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • // +kubebuilder:storageversion: marker to indicate the GVK that should be used to store data by the API server.
  • // +kubebuilder:resource:categories=all: Puts the CRD into the all category, so it shows up in kubectl get all.

Ref:
https://book.kubebuilder.io/reference/markers/crd
https://book.kubebuilder.io/reference/generating-crd#multiple-versions

Cool!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion~

g.Expect(err).NotTo(HaveOccurred())
ownerUID := rcj.UID

// No RayJob should be created
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// No RayJob should be created
// No RayJob should be created
LogWithTimestamp(test.T(), "Waiting to ensure no RayJobs are created while suspended")

return n
}, 130*time.Second, 5*time.Second).Should(Equal(0))

// Resume
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// Resume
LogWithTimestamp(test.T(), "Resuming RayCronJob %s/%s", rayCronJob.Namespace, rayCronJob.Name)

Comment on lines 89 to 92
if err != nil {
return -1
}
return n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To preserve error:

Suggested change
if err != nil {
return -1
}
return n
g.Expect(err).NotTo(HaveOccurred())
return n


// Spec.suspend should be false
g.Eventually(func() bool {
rcj, err := test.Client().Ray().RayV1().RayCronJobs(namespace.Name).Get(test.Ctx(), rayCronJob.Name, metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can replace this with a util function similar to GetRayJob.

@Future-Outlier
Copy link
Member

cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no bugs!

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @machichima @CheyuWu @justinyeh1995 @fscnick to take a look, thank you!

@Future-Outlier Future-Outlier moved this from can be merged to Todo in @Future-Outlier's kuberay project Jan 7, 2026
@CheyuWu CheyuWu self-requested a review January 7, 2026 16:33
Comment on lines 20 to 29
func rayCronJobACTemplate(name, namespace, schedule string) *rayv1ac.RayCronJobApplyConfiguration {
return rayv1ac.RayCronJob(name, namespace).
WithSpec(
rayv1ac.RayCronJobSpec().
WithSchedule(schedule).
WithJobTemplate(
rayv1ac.RayJobSpec().
WithEntrypoint("sleep 1").
WithRayClusterSpec(
rayv1ac.RayClusterSpec().
WithHeadGroupSpec(
rayv1ac.HeadGroupSpec().
WithTemplate(
corev1ac.PodTemplateSpec().
WithSpec(
corev1ac.PodSpec().
WithContainers(
corev1ac.Container().
WithName("ray-head").
WithImage(GetRayImage()).
WithResources(
corev1ac.ResourceRequirements().
WithRequests(corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("500m"),
corev1.ResourceMemory: resource.MustParse("500Mi"),
}),
),
),
),
),
),
),
),
)
}
Copy link
Contributor

@justinyeh1995 justinyeh1995 Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for readability, would it make sense to construct it bottom-up using intermediate variables?

For example

func rayCronJobACTemplate(name, namespace, schedule string) *rayv1ac.RayCronJobApplyConfiguration {
	headContainer := corev1ac.Container().
		WithName("ray-head").
		WithImage(GetRayImage()).
		WithResources(
			corev1ac.ResourceRequirements().
				WithRequests(corev1.ResourceList{
					corev1.ResourceCPU:    resource.MustParse("500m"),
					corev1.ResourceMemory: resource.MustParse("500Mi"),
				}),
		)

	podTmpl := corev1ac.PodTemplateSpec().
		WithSpec(corev1ac.PodSpec().WithContainers(headContainer))

	cluster := ...
	job := ...
	spec := ...

	return rayv1ac.RayCronJob(name, namespace).WithSpec(spec)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simply use:

func rayCronJobACTemplate(name, namespace, schedule string) *rayv1ac.RayCronJobApplyConfiguration {
	return rayv1ac.RayCronJob(name, namespace).
		WithSpec(
			rayv1ac.RayCronJobSpec().
				WithSchedule(schedule).
				WithJobTemplate(
					rayv1ac.RayJobSpec().
						WithEntrypoint("sleep 1").
						WithRayClusterSpec(NewRayClusterSpec()),
				),
		)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great suggestion, thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion~

return -1
}
return n
}, 130*time.Second, 5*time.Second).Should(Equal(0))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
}, 130*time.Second, 5*time.Second).Should(Equal(0))
}, 130*time.Second, 5*time.Second).Should(BeZero())

Comment on lines +116 to +79
err = test.Client().Ray().RayV1().RayCronJobs(namespace.Name).Delete(test.Ctx(), rayCronJob.Name, metav1.DeleteOptions{})
g.Expect(err).NotTo(HaveOccurred())
LogWithTimestamp(test.T(), "Deleted RayCronJob %s/%s successfully", rayCronJob.Namespace, rayCronJob.Name)
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we assert that the related resource has been deleted correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test Delete() is only cleanup; the assertions focus on suspend/resume affecting RayJob creation. The RayJob suspend e2e asserts cascade deletion because that’s part of its feature contract. Happy to add an Eventually(Get).Should(BeNotFound()) after Delete for explicitness if you prefer.
what do you think ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, it could make sure this test exits safe and sound without leaving something to interfere others.

However, some tests do it and some don't. I guess it might not be mandatory. Feel free to do it or not.

@machichima
Copy link
Collaborator

We also need to add it to https://github.com/ray-project/kuberay/blob/master/.buildkite/test-e2e.yml so that the CI can run it. Could you please add Test RayCronJob E2E (nightly operator) to test-e2e.yml?

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@fscnick fscnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Future-Outlier Future-Outlier moved this from to review to In Progress in @Future-Outlier's kuberay project Jan 15, 2026
Copy link
Collaborator

@machichima machichima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a small nit

AndySung320 and others added 7 commits January 18, 2026 01:37
…guration

Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com>
Signed-off-by: AndySung320 <71032763+AndySung320@users.noreply.github.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
@AndySung320 AndySung320 force-pushed the raycronjob/add-suspend-e2e-test branch from 91df5ac to 9cc2e90 Compare January 18, 2026 10:14
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Signed-off-by: AndySung320 <andysung0320@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: AndySung320 <andysung0320@gmail.com>
@Future-Outlier Future-Outlier moved this from In Progress to can be merged in @Future-Outlier's kuberay project Jan 19, 2026
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with current's implementation, but will it be better if we use label selector to filter RayJob?
https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#list-and-watch-filtering

@AndySung320
Copy link
Contributor Author

label selector could work, but I chose ownerRef UID filtering for correctness.
Using OwnerReferences.UID makes sure that we only count RayJobs created by this specific RayCronJob instance, while labels don’t provide uniqueness and could accidentally match other RayJobs if labels are reused/changed in the future. So UID-based filtering would be less error-prone for this e2e assertion.

@Future-Outlier
Copy link
Member

label selector could work, but I chose ownerRef UID filtering for correctness.

Using OwnerReferences.UID makes sure that we only count RayJobs created by this specific RayCronJob instance, while labels don’t provide uniqueness and could accidentally match other RayJobs if labels are reused/changed in the future. So UID-based filtering would be less error-prone for this e2e assertion.

But we will have unique namespace, right? So this will still work

@AndySung320
Copy link
Contributor Author

I'm wondering that should we optimize for the current test scenario (where label selector is simpler), or consider potential future extensions?
For example, if we in the future add tests that create multiple RayCronJobs in the same namespace, OwnerReference.UID would be more robust. But if we don't foresee that scenario, I'm happy to switch to label selector for simplicity.
what do you think?

@Future-Outlier
Copy link
Member

I'm wondering that should we optimize for the current test scenario (where label selector is simpler), or consider potential future extensions? For example, if we in the future add tests that create multiple RayCronJobs in the same namespace, OwnerReference.UID would be more robust. But if we don't foresee that scenario, I'm happy to switch to label selector for simplicity. what do you think?

make sense

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Test] [RayCronJob] Add e2e tests for suspend behavior

7 participants