Skip to content

Conversation

@ichekrygin
Copy link
Contributor

@ichekrygin ichekrygin commented Aug 4, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Now that the core ElasticJobsViaWorkloadSlices functionality has been introduced in Kueue as part of KEP-77, this PR introduces MultiKueue support for ElasticJobsViaWorkloadSlices.

Which issue(s) this PR fixes:

Fixes #6335

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Multikueue × ElasticJobs: The elastic `batchv1/Job` supports MultiKueue.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 4, 2025
@netlify
Copy link

netlify bot commented Aug 4, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit b88b4c7
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68d6cd93a474e80008a98edc

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 4, 2025
@ichekrygin
Copy link
Contributor Author

/retest

@ichekrygin ichekrygin force-pushed the wl-slices-multicluster branch from 4e9a611 to 3e747fc Compare August 4, 2025 23:46
@ichekrygin ichekrygin force-pushed the wl-slices-multicluster branch from 3e747fc to d23b998 Compare August 5, 2025 06:40
@ichekrygin
Copy link
Contributor Author

/retest

@ichekrygin ichekrygin force-pushed the wl-slices-multicluster branch from b60cde0 to 7333195 Compare August 6, 2025 06:21
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 6, 2025
@amy amy moved this to PRs for 0.14.0 in Progress in Kueue Release Tracking Aug 10, 2025
@ichekrygin
Copy link
Contributor Author

Hello @mimowo, @tenzen-y, @GonzaloSaez - I updated the PR to address comments, PTAL (don't want this PR to go stale) :)

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 11, 2025
@ichekrygin ichekrygin force-pushed the wl-slices-multicluster branch from 7333195 to 8548e85 Compare August 13, 2025 20:42
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 13, 2025
@ichekrygin ichekrygin force-pushed the wl-slices-multicluster branch from 8548e85 to 08ed89d Compare August 13, 2025 20:50
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ichekrygin Sorry for the delayed review. Additionally, could you update https://github.com/kubernetes-sigs/kueue/tree/main/keps/77-dynamically-sized-jobs#phase-3--enabling-workload-slicing-for-batchv1job-in-multi-cluster-configuration?

It would be better to mention about cluster nomination (dispatching) mechanism for combination of MultiKueue and DynamicJob.

NOTE I have not reviewed the integration test, yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update UTs for TestMultiKueueAdapter in pkg/controller/jobs/job/job_multikueue_adapter_test.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, PTAL.
Note: instead of modifying the existing TestMultiKueueAdapter, I created a new Test_multiKueueAdapter_SyncJob that follows a more Go-idiomatic testing style and improves coverage. The new test reaches "100%" coverage—matching the original test while adding coverage for ElasticJobs. If we like this approach, we can replace the old test with it. If not, I can trim down the duplicated coverage and limit the new test to just the missing ElasticJobs cases.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 1, 2025
@ichekrygin ichekrygin force-pushed the wl-slices-multicluster branch from ad78fe5 to 7eab699 Compare September 3, 2025 17:11
@ichekrygin
Copy link
Contributor Author

/test pull-kueue-test-e2e-kueueviz-main
/test pull-kueue-test-e2e-main-1-32
/test pull-kueue-test-e2e-main-1-33
/test pull-kueue-test-e2e-main-1-34

@mimowo
Copy link
Contributor

mimowo commented Sep 26, 2025

@ichekrygin please rebase, I'm going to review today

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2025
@ichekrygin ichekrygin force-pushed the wl-slices-multicluster branch from e7558da to 3361203 Compare September 26, 2025 14:28
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2025
@ichekrygin
Copy link
Contributor Author

@ichekrygin please rebase, I'm going to review today

Updated

Comment on lines +99 to +105
// IsElasticWorkload returns true if the workload is considered elastic,
// meaning the ElasticJobsViaWorkloadSlices feature is enabled and the
// workload has the corresponding annotation set.
func (g *wlGroup) IsElasticWorkload() bool {
return workloadslicing.IsElasticWorkload(g.local)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// IsElasticWorkload returns true if the workload is considered elastic,
// meaning the ElasticJobsViaWorkloadSlices feature is enabled and the
// workload has the corresponding annotation set.
func (g *wlGroup) IsElasticWorkload() bool {
return workloadslicing.IsElasticWorkload(g.local)
}

Do we still need? In each place, couldn't we just call workloadslicing.IsElasticWorkload(g.local)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this helper is consistent with other group functions (e.g., IsFinished()), where we don’t need to explicitly unpack group attributes.

That said, I can remove it if it’s a blocker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im ok either way. I like consistency

Comment on lines 1240 to 1235

// Keep the same cluster assignment between slices.
wl.Status.ClusterName = oldSlice.Status.ClusterName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status field is the actual state, and spec is the desired state.
However, if we assign any clusterName to the status field w/o QuotaReservation, the .status.clusterName represents the desired state since the .status.clusterName is no longer actual cluster name reserved by the cluster.

So, I think we can (1 leverage NCN in the dispatcher as I mentioned above, or (2 worker clusters ' kueue scheduler assigns their cluster name to WorkloadSlice by themselves while QuotaReservation. In that case, we can keep the .status.clusterName as the actual state.

However, (2 indicates bringing cluster assignment responsibilities to the Kueue scheduler, which is not good since technically, the Kueue-scheduler is responsible for everything, but we decouple responsibilities across multiple internal components.

@ichekrygin
Copy link
Contributor Author

@tenzen-y, thank you for the comment (not sure why I couldn’t quote or reply directly, so responding here in a separate thread).

The status field is the actual state, and spec is the desired state.
However, if we assign any clusterName to the status field w/o QuotaReservation, the .status.clusterName represents the desired state since the .status.clusterName is no longer actual cluster name reserved by the cluster.

For a new workload slice, the desired cluster name is carried forward from the old workload slice’s .status.clusterName.
To ensure consistent placement of all scaled-up workloads, we want them to continue using the original placement (i.e., the same cluster name) and to skip cluster nomination altogether. We achieve this by setting status.clusterName on the new slice.

Does this explanation make sense?

@tenzen-y
Copy link
Member

@tenzen-y, thank you for the comment (not sure why I couldn’t quote or reply directly, so responding here in a separate thread).

The status field is the actual state, and spec is the desired state.
However, if we assign any clusterName to the status field w/o QuotaReservation, the .status.clusterName represents the desired state since the .status.clusterName is no longer actual cluster name reserved by the cluster.

For a new workload slice, the desired cluster name is carried forward from the old workload slice’s .status.clusterName. To ensure consistent placement of all scaled-up workloads, we want them to continue using the original placement (i.e., the same cluster name) and to skip cluster nomination altogether. We achieve this by setting status.clusterName on the new slice.

Does this explanation make sense?

My point is .status.clusterName must be the actual reserved cluster place. IIUC, in the current implementations, the WorkloadSlice's .status.clusterName is just desired since the clusterName is assigned before QuotaReservation in the worker cluster name?

I might be missing something. If my understanding is incorrect, please let me know.

@ichekrygin
Copy link
Contributor Author

My point is .status.clusterName must be the actual reserved cluster place. IIUC, in the current implementations, the WorkloadSlice's .status.clusterName is just desired since the clusterName is assigned before QuotaReservation in the worker cluster name?

For ElasticWorkloads, that’s exactly the case. We want each new workload slice to sync to the same cluster as the original. If that workload ends up stuck in a Pending state (due to lack of quota or other reasons) on that worker cluster, that is the intended outcome under the current design. In other words, we don’t want to try different clusters.

@mimowo
Copy link
Contributor

mimowo commented Sep 26, 2025

Im also thinking that going via NominatedClusterNames can cause race conditions if the external dispatcher updates the value before it is promoted to status.clusterName. Maybe we could prevent updating status.nominatedClusterNames for prebuilt workloads to prevent that phenomenon, but it seems ad hoc. Requiring all external Dispatchers to avoid mutating prebuilt workloads is not preferred if can be avoided.

@ichekrygin
Copy link
Contributor Author

Im also thinking that going via NominatedClusterNames can cause race conditions if the external dispatcher updates the value before it is promoted to status.clusterName. Maybe we could prevent updating status.nominatedClusterNames for prebuilt workloads to prevent that phenomenon, but it seems ad hoc. Requiring all external Dispatchers to avoid mutating prebuilt workloads is not preferred if can be avoided.

I might be missing part of your point here. Do you feel we’re aligned with the PR change, or do you still see a concern? If it’s the latter, could you help point me to the specific area and what you’d recommend?

…for `batch/v1.Job` workloads when ElasticJob functionality is enabled.

Signed-off-by: Illya Chekrygin <[email protected]>
@ichekrygin ichekrygin force-pushed the wl-slices-multicluster branch from 3361203 to b88b4c7 Compare September 26, 2025 17:29
@mimowo
Copy link
Contributor

mimowo commented Sep 26, 2025

I might be missing part of your point here. Do you feel we’re aligned with the PR change, or do you still see a concern? If it’s the latter, could you help point me to the specific area and what you’d recommend?

I think we are aligned to the part about skipping going via the nomination phase (status.nominatedClusterNames). My comment was actually describing there could be race conditions if we were going via the nomination phase (as suggested by @tenzen-y in #6445 (comment)).

Even if we skip the nomination phase a race condition exists where the external dispatcher could nominate a cluster name and scheduler could admit before we propagate the .status.clusterName. As suggested in #6445 (comment) one idea of solving the issue is to prevent setting status.nominatedClusterName entirely for elastic prebuilt workloads.

@ichekrygin
Copy link
Contributor Author

/test pull-kueue-test-e2e-main-1-34

// In a single-cluster context, this should lead to Job suspension.
// In a MultiKueue context, this should also trigger removal of remote workload/Job objects.
if features.Enabled(features.ElasticJobsViaWorkloadSlices) && oldWorkloadSlice != nil {
e.Obj.Status.ClusterName = oldWorkloadSlice.WorkloadInfo.Obj.Status.ClusterName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I somehow missed this (I think it wasn't here in the first revision). I think this actually eliminates the race condition I was describing in #6445.

I think this actually does not increase the scheduler logic signifficantly (no ifs or branching), and neatly prevents the races as we guarantee the replacement workloads sticks to the same clusterName without a time window when external dispatchers could choose differently.

@mimowo
Copy link
Contributor

mimowo commented Sep 26, 2025

Even if we skip the nomination phase a race condition exists where the external dispatcher could nominate a cluster name and scheduler could admit before we propagate the .status.clusterName. As suggested in #6445 (comment) one idea of solving the issue is to prevent setting status.nominatedClusterName entirely for elastic prebuilt workloads.

Actually, after reading this PR more carefully and syncing with @ichekrygin I understand why this race condition I considered is now solved thanks to #6445 (comment).

Basically setting the clusterName to the new slice at the scheduling time gives no time to external dispatchers to mess up the assignment. I like this neat and simple approach.

@mimowo
Copy link
Contributor

mimowo commented Sep 26, 2025

I'm happy with the PR, especially as this is Alpha. I think this approach allows to eliminate the race condition I considered, and I appreciate @ichekrygin considered the race and fixed in the latest revision. I don't see any more practical issues with this PR.

I acknowledge there are two "design cleanliness" issues discussed raised by @tenzen-y and me during review:

  1. to go via the nomination phase always, and offload the scheduler logic, see comment. However, I think this is very hard in practice to make race-free. The main issue is that while we are going via the nomination phase, the external (or internal) dispatchers could mess up the nomination. We could maybe prevent that with validation, but it is certainly more complex than the one-liner in scheduler, see comment (in a well isolated code behind feature gate)
  2. reuse of the kueue.x-k8s.io/elastic-job for Jobs and Workloads. One idea would be to introduce a dedicated Workload annotation with a different name, but this sounds like overkill. Another idea mentioned was to have a generic filtering mechanism to copy selected annotations. I also feel this is overkill as mentioned in the thread.

So, while I agree raising these questions was worth a while, and I don't claim we have the best answers in the PR, but I don't see any better proposals during the discussions than what is implemented, or certainly not achievable for 0.14. If we come up with better ideas we can revisit for Beta as the feature remains Alpha.

Thank you folks for iterating on the PR.
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 26, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: e1c6e9558a581dc7f198134228dc89f621699859

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ichekrygin, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2025
@k8s-ci-robot k8s-ci-robot merged commit 831b2d6 into kubernetes-sigs:main Sep 26, 2025
23 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.14 milestone Sep 26, 2025
@mimowo
Copy link
Contributor

mimowo commented Sep 29, 2025

@ichekrygin ptal: #7033

@tenzen-y
Copy link
Member

/release-note-edit

Multikueue × ElasticJobs: The elastic `batchv1/Job` supports MultiKueue.

@mimowo mimowo mentioned this pull request Sep 30, 2025
36 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Add MultiKueue support for WorkloadSlices (KEP-77)

5 participants