Introduce multi-Kueue support for ElasticJobs, with specific support for `batch/v1.Job` workloads when ElasticJob functionality is enabled. #6445

ichekrygin · 2025-08-04T21:46:09Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Now that the core ElasticJobsViaWorkloadSlices functionality has been introduced in Kueue as part of KEP-77, this PR introduces MultiKueue support for ElasticJobsViaWorkloadSlices.

Which issue(s) this PR fixes:

Fixes #6335

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Multikueue × ElasticJobs: The elastic `batchv1/Job` supports MultiKueue.

netlify · 2025-08-04T21:46:24Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`b88b4c7`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68d6cd93a474e80008a98edc

ichekrygin · 2025-08-04T23:02:11Z

/retest

pkg/workloadslicing/workloadslicing.go

pkg/controller/jobs/job/job_multikueue_adapter.go

pkg/workloadslicing/workloadslicing.go

ichekrygin · 2025-08-06T04:35:13Z

/retest

ichekrygin · 2025-08-11T16:22:29Z

Hello @mimowo, @tenzen-y, @GonzaloSaez - I updated the PR to address comments, PTAL (don't want this PR to go stale) :)

apis/kueue/v1beta1/workload_types.go

tenzen-y

@ichekrygin Sorry for the delayed review. Additionally, could you update https://github.com/kubernetes-sigs/kueue/tree/main/keps/77-dynamically-sized-jobs#phase-3--enabling-workload-slicing-for-batchv1job-in-multi-cluster-configuration?

It would be better to mention about cluster nomination (dispatching) mechanism for combination of MultiKueue and DynamicJob.

NOTE I have not reviewed the integration test, yet.

pkg/controller/admissionchecks/multikueue/workload.go

pkg/workloadslicing/workloadslicing.go

pkg/controller/jobs/job/job_multikueue_adapter.go

apis/kueue/v1beta1/workload_types.go

tenzen-y · 2025-08-26T18:30:17Z

pkg/controller/jobs/job/job_multikueue_adapter.go

Could you update UTs for TestMultiKueueAdapter in pkg/controller/jobs/job/job_multikueue_adapter_test.go?

Updated, PTAL.
Note: instead of modifying the existing TestMultiKueueAdapter, I created a new Test_multiKueueAdapter_SyncJob that follows a more Go-idiomatic testing style and improves coverage. The new test reaches "100%" coverage—matching the original test while adding coverage for ElasticJobs. If we like this approach, we can replace the old test with it. If not, I can trim down the duplicated coverage and limit the new test to just the missing ElasticJobs cases.

pkg/controller/jobframework/validation.go

ichekrygin · 2025-09-25T15:33:33Z

/test pull-kueue-test-e2e-kueueviz-main
/test pull-kueue-test-e2e-main-1-32
/test pull-kueue-test-e2e-main-1-33
/test pull-kueue-test-e2e-main-1-34

mimowo · 2025-09-26T10:10:42Z

@ichekrygin please rebase, I'm going to review today

ichekrygin · 2025-09-26T14:29:06Z

@ichekrygin please rebase, I'm going to review today

Updated

tenzen-y · 2025-09-26T15:13:47Z

pkg/controller/admissionchecks/multikueue/workload.go

+// IsElasticWorkload returns true if the workload is considered elastic,
+// meaning the ElasticJobsViaWorkloadSlices feature is enabled and the
+// workload has the corresponding annotation set.
+func (g *wlGroup) IsElasticWorkload() bool {
+	return workloadslicing.IsElasticWorkload(g.local)
+}
+


Suggested change

// IsElasticWorkload returns true if the workload is considered elastic,

// meaning the ElasticJobsViaWorkloadSlices feature is enabled and the

// workload has the corresponding annotation set.

func (g *wlGroup) IsElasticWorkload() bool {

return workloadslicing.IsElasticWorkload(g.local)

}

Do we still need? In each place, couldn't we just call workloadslicing.IsElasticWorkload(g.local)?

I think this helper is consistent with other group functions (e.g., IsFinished()), where we don’t need to explicitly unpack group attributes.

That said, I can remove it if it’s a blocker.

Im ok either way. I like consistency

tenzen-y · 2025-09-26T15:26:39Z

pkg/controller/jobframework/reconciler.go

+
+		// Keep the same cluster assignment between slices.
+		wl.Status.ClusterName = oldSlice.Status.ClusterName


The status field is the actual state, and spec is the desired state.
However, if we assign any clusterName to the status field w/o QuotaReservation, the .status.clusterName represents the desired state since the .status.clusterName is no longer actual cluster name reserved by the cluster.

So, I think we can (1 leverage NCN in the dispatcher as I mentioned above, or (2 worker clusters ' kueue scheduler assigns their cluster name to WorkloadSlice by themselves while QuotaReservation. In that case, we can keep the .status.clusterName as the actual state.

However, (2 indicates bringing cluster assignment responsibilities to the Kueue scheduler, which is not good since technically, the Kueue-scheduler is responsible for everything, but we decouple responsibilities across multiple internal components.

ichekrygin · 2025-09-26T15:49:31Z

@tenzen-y, thank you for the comment (not sure why I couldn’t quote or reply directly, so responding here in a separate thread).

The status field is the actual state, and spec is the desired state.
However, if we assign any clusterName to the status field w/o QuotaReservation, the .status.clusterName represents the desired state since the .status.clusterName is no longer actual cluster name reserved by the cluster.

For a new workload slice, the desired cluster name is carried forward from the old workload slice’s .status.clusterName.
To ensure consistent placement of all scaled-up workloads, we want them to continue using the original placement (i.e., the same cluster name) and to skip cluster nomination altogether. We achieve this by setting status.clusterName on the new slice.

Does this explanation make sense?

tenzen-y · 2025-09-26T15:58:33Z

@tenzen-y, thank you for the comment (not sure why I couldn’t quote or reply directly, so responding here in a separate thread).

The status field is the actual state, and spec is the desired state.
However, if we assign any clusterName to the status field w/o QuotaReservation, the .status.clusterName represents the desired state since the .status.clusterName is no longer actual cluster name reserved by the cluster.

For a new workload slice, the desired cluster name is carried forward from the old workload slice’s .status.clusterName. To ensure consistent placement of all scaled-up workloads, we want them to continue using the original placement (i.e., the same cluster name) and to skip cluster nomination altogether. We achieve this by setting status.clusterName on the new slice.

Does this explanation make sense?

My point is .status.clusterName must be the actual reserved cluster place. IIUC, in the current implementations, the WorkloadSlice's .status.clusterName is just desired since the clusterName is assigned before QuotaReservation in the worker cluster name?

I might be missing something. If my understanding is incorrect, please let me know.

ichekrygin · 2025-09-26T16:07:59Z

My point is .status.clusterName must be the actual reserved cluster place. IIUC, in the current implementations, the WorkloadSlice's .status.clusterName is just desired since the clusterName is assigned before QuotaReservation in the worker cluster name?

For ElasticWorkloads, that’s exactly the case. We want each new workload slice to sync to the same cluster as the original. If that workload ends up stuck in a Pending state (due to lack of quota or other reasons) on that worker cluster, that is the intended outcome under the current design. In other words, we don’t want to try different clusters.

mimowo · 2025-09-26T16:17:18Z

Im also thinking that going via NominatedClusterNames can cause race conditions if the external dispatcher updates the value before it is promoted to status.clusterName. Maybe we could prevent updating status.nominatedClusterNames for prebuilt workloads to prevent that phenomenon, but it seems ad hoc. Requiring all external Dispatchers to avoid mutating prebuilt workloads is not preferred if can be avoided.

ichekrygin · 2025-09-26T16:39:51Z

Im also thinking that going via NominatedClusterNames can cause race conditions if the external dispatcher updates the value before it is promoted to status.clusterName. Maybe we could prevent updating status.nominatedClusterNames for prebuilt workloads to prevent that phenomenon, but it seems ad hoc. Requiring all external Dispatchers to avoid mutating prebuilt workloads is not preferred if can be avoided.

I might be missing part of your point here. Do you feel we’re aligned with the PR change, or do you still see a concern? If it’s the latter, could you help point me to the specific area and what you’d recommend?

pkg/controller/admissionchecks/multikueue/workload.go

…for `batch/v1.Job` workloads when ElasticJob functionality is enabled. Signed-off-by: Illya Chekrygin <[email protected]>

mimowo · 2025-09-26T17:57:21Z

I might be missing part of your point here. Do you feel we’re aligned with the PR change, or do you still see a concern? If it’s the latter, could you help point me to the specific area and what you’d recommend?

I think we are aligned to the part about skipping going via the nomination phase (status.nominatedClusterNames). My comment was actually describing there could be race conditions if we were going via the nomination phase (as suggested by @tenzen-y in #6445 (comment)).

Even if we skip the nomination phase a race condition exists where the external dispatcher could nominate a cluster name and scheduler could admit before we propagate the .status.clusterName. As suggested in #6445 (comment) one idea of solving the issue is to prevent setting status.nominatedClusterName entirely for elastic prebuilt workloads.

ichekrygin · 2025-09-26T18:09:42Z

/test pull-kueue-test-e2e-main-1-34

mimowo · 2025-09-26T18:12:43Z

pkg/scheduler/scheduler.go

+		// In a single-cluster context, this should lead to Job suspension.
+		// In a MultiKueue context, this should also trigger removal of remote workload/Job objects.
 		if features.Enabled(features.ElasticJobsViaWorkloadSlices) && oldWorkloadSlice != nil {
+			e.Obj.Status.ClusterName = oldWorkloadSlice.WorkloadInfo.Obj.Status.ClusterName


Oh, I somehow missed this (I think it wasn't here in the first revision). I think this actually eliminates the race condition I was describing in #6445.

I think this actually does not increase the scheduler logic signifficantly (no ifs or branching), and neatly prevents the races as we guarantee the replacement workloads sticks to the same clusterName without a time window when external dispatchers could choose differently.

mimowo · 2025-09-26T18:21:36Z

Even if we skip the nomination phase a race condition exists where the external dispatcher could nominate a cluster name and scheduler could admit before we propagate the .status.clusterName. As suggested in #6445 (comment) one idea of solving the issue is to prevent setting status.nominatedClusterName entirely for elastic prebuilt workloads.

Actually, after reading this PR more carefully and syncing with @ichekrygin I understand why this race condition I considered is now solved thanks to #6445 (comment).

Basically setting the clusterName to the new slice at the scheduling time gives no time to external dispatchers to mess up the assignment. I like this neat and simple approach.

mimowo · 2025-09-26T18:36:19Z

I'm happy with the PR, especially as this is Alpha. I think this approach allows to eliminate the race condition I considered, and I appreciate @ichekrygin considered the race and fixed in the latest revision. I don't see any more practical issues with this PR.

I acknowledge there are two "design cleanliness" issues discussed raised by @tenzen-y and me during review:

to go via the nomination phase always, and offload the scheduler logic, see comment. However, I think this is very hard in practice to make race-free. The main issue is that while we are going via the nomination phase, the external (or internal) dispatchers could mess up the nomination. We could maybe prevent that with validation, but it is certainly more complex than the one-liner in scheduler, see comment (in a well isolated code behind feature gate)
reuse of the kueue.x-k8s.io/elastic-job for Jobs and Workloads. One idea would be to introduce a dedicated Workload annotation with a different name, but this sounds like overkill. Another idea mentioned was to have a generic filtering mechanism to copy selected annotations. I also feel this is overkill as mentioned in the thread.

So, while I agree raising these questions was worth a while, and I don't claim we have the best answers in the PR, but I don't see any better proposals during the discussions than what is implemented, or certainly not achievable for 0.14. If we come up with better ideas we can revisit for Beta as the feature remains Alpha.

Thank you folks for iterating on the PR.
/lgtm
/approve

k8s-ci-robot · 2025-09-26T18:36:28Z

LGTM label has been added.

Details

Git tree hash: e1c6e9558a581dc7f198134228dc89f621699859

k8s-ci-robot · 2025-09-26T18:36:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ichekrygin, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mimowo · 2025-09-29T07:10:00Z

@ichekrygin ptal: #7033

tenzen-y · 2025-09-30T13:02:30Z

/release-note-edit

Multikueue × ElasticJobs: The elastic `batchv1/Job` supports MultiKueue.

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 4, 2025

k8s-ci-robot requested review from gabesaba and kannon92 August 4, 2025 21:46

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 4, 2025

ichekrygin force-pushed the wl-slices-multicluster branch from 4e9a611 to 3e747fc Compare August 4, 2025 23:46

mimowo reviewed Aug 5, 2025

View reviewed changes

pkg/workloadslicing/workloadslicing.go Outdated Show resolved Hide resolved

GonzaloSaez reviewed Aug 5, 2025

View reviewed changes

pkg/controller/jobs/job/job_multikueue_adapter.go Outdated Show resolved Hide resolved

ichekrygin force-pushed the wl-slices-multicluster branch from 3e747fc to d23b998 Compare August 5, 2025 06:40

tenzen-y reviewed Aug 5, 2025

View reviewed changes

pkg/workloadslicing/workloadslicing.go Outdated Show resolved Hide resolved

ichekrygin mentioned this pull request Aug 5, 2025

Add native Kueue support for ReplicaSets and Deployments #6334

Open

ichekrygin force-pushed the wl-slices-multicluster branch from d23b998 to b18c5a3 Compare August 5, 2025 23:37

ichekrygin force-pushed the wl-slices-multicluster branch from b60cde0 to 7333195 Compare August 6, 2025 06:21

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 6, 2025

amy moved this to PRs for 0.14.0 in Progress in Kueue Release Tracking Aug 10, 2025

amy added this to Kueue Release Tracking Aug 10, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2025

ichekrygin force-pushed the wl-slices-multicluster branch from 7333195 to 8548e85 Compare August 13, 2025 20:42

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 13, 2025

ichekrygin force-pushed the wl-slices-multicluster branch from 8548e85 to 08ed89d Compare August 13, 2025 20:50

tenzen-y reviewed Aug 21, 2025

View reviewed changes

apis/kueue/v1beta1/workload_types.go Show resolved Hide resolved

tenzen-y reviewed Aug 26, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 1, 2025

ichekrygin force-pushed the wl-slices-multicluster branch from ad78fe5 to 7eab699 Compare September 3, 2025 17:11

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2025

ichekrygin force-pushed the wl-slices-multicluster branch from e7558da to 3361203 Compare September 26, 2025 14:28

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2025

tenzen-y reviewed Sep 26, 2025

View reviewed changes

mimowo reviewed Sep 26, 2025

View reviewed changes

pkg/controller/admissionchecks/multikueue/workload.go Outdated Show resolved Hide resolved

Introduce multi-Kueue support for ElasticJobs, with specific support …

b88b4c7

…for `batch/v1.Job` workloads when ElasticJob functionality is enabled. Signed-off-by: Illya Chekrygin <[email protected]>

ichekrygin force-pushed the wl-slices-multicluster branch from 3361203 to b88b4c7 Compare September 26, 2025 17:29

mimowo reviewed Sep 26, 2025

View reviewed changes

k8s-ci-robot assigned mimowo Sep 26, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 26, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2025

k8s-ci-robot merged commit 831b2d6 into kubernetes-sigs:main Sep 26, 2025
23 checks passed

k8s-ci-robot added this to the v0.14 milestone Sep 26, 2025

mimowo mentioned this pull request Sep 30, 2025

Release v0.14.0 #6756

Closed

36 tasks


		// Keep the same cluster assignment between slices.
		wl.Status.ClusterName = oldSlice.Status.ClusterName

Introduce multi-Kueue support for ElasticJobs, with specific support for batch/v1.Job workloads when ElasticJob functionality is enabled. #6445

Introduce multi-Kueue support for ElasticJobs, with specific support for batch/v1.Job workloads when ElasticJob functionality is enabled. #6445

Uh oh!

Conversation

ichekrygin commented Aug 4, 2025 • edited by k8s-ci-robot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Uh oh!

ichekrygin commented Aug 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ichekrygin commented Aug 6, 2025

Uh oh!

ichekrygin commented Aug 11, 2025

Uh oh!

Uh oh!

tenzen-y left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tenzen-y Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

ichekrygin Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ichekrygin commented Sep 25, 2025

Uh oh!

mimowo commented Sep 26, 2025

Uh oh!

ichekrygin commented Sep 26, 2025

Uh oh!

tenzen-y Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

ichekrygin Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

mimowo Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

tenzen-y Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

ichekrygin commented Sep 26, 2025

Uh oh!

tenzen-y commented Sep 26, 2025

Uh oh!

ichekrygin commented Sep 26, 2025

Uh oh!

mimowo commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ichekrygin commented Sep 26, 2025

Uh oh!

Uh oh!

mimowo commented Sep 26, 2025

Uh oh!

ichekrygin commented Sep 26, 2025

Uh oh!

mimowo Sep 26, 2025

Choose a reason for hiding this comment

Introduce multi-Kueue support for ElasticJobs, with specific support for `batch/v1.Job` workloads when ElasticJob functionality is enabled. #6445

Introduce multi-Kueue support for ElasticJobs, with specific support for `batch/v1.Job` workloads when ElasticJob functionality is enabled. #6445

ichekrygin commented Aug 4, 2025 •

edited by k8s-ci-robot

Loading

netlify bot commented Aug 4, 2025 •

edited

Loading

mimowo commented Sep 26, 2025 •

edited

Loading

mimowo commented Sep 26, 2025 •

edited

Loading