-
Notifications
You must be signed in to change notification settings - Fork 511
Introduce multi-Kueue support for ElasticJobs, with specific support for batch/v1.Job workloads when ElasticJob functionality is enabled.
#6445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
/retest |
4e9a611 to
3e747fc
Compare
3e747fc to
d23b998
Compare
d23b998 to
b18c5a3
Compare
|
/retest |
b60cde0 to
7333195
Compare
|
Hello @mimowo, @tenzen-y, @GonzaloSaez - I updated the PR to address comments, PTAL (don't want this PR to go stale) :) |
7333195 to
8548e85
Compare
8548e85 to
08ed89d
Compare
tenzen-y
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ichekrygin Sorry for the delayed review. Additionally, could you update https://github.com/kubernetes-sigs/kueue/tree/main/keps/77-dynamically-sized-jobs#phase-3--enabling-workload-slicing-for-batchv1job-in-multi-cluster-configuration?
It would be better to mention about cluster nomination (dispatching) mechanism for combination of MultiKueue and DynamicJob.
NOTE I have not reviewed the integration test, yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update UTs for TestMultiKueueAdapter in pkg/controller/jobs/job/job_multikueue_adapter_test.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, PTAL.
Note: instead of modifying the existing TestMultiKueueAdapter, I created a new Test_multiKueueAdapter_SyncJob that follows a more Go-idiomatic testing style and improves coverage. The new test reaches "100%" coverage—matching the original test while adding coverage for ElasticJobs. If we like this approach, we can replace the old test with it. If not, I can trim down the duplicated coverage and limit the new test to just the missing ElasticJobs cases.
ad78fe5 to
7eab699
Compare
|
/test pull-kueue-test-e2e-kueueviz-main |
|
@ichekrygin please rebase, I'm going to review today |
e7558da to
3361203
Compare
Updated |
| // IsElasticWorkload returns true if the workload is considered elastic, | ||
| // meaning the ElasticJobsViaWorkloadSlices feature is enabled and the | ||
| // workload has the corresponding annotation set. | ||
| func (g *wlGroup) IsElasticWorkload() bool { | ||
| return workloadslicing.IsElasticWorkload(g.local) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // IsElasticWorkload returns true if the workload is considered elastic, | |
| // meaning the ElasticJobsViaWorkloadSlices feature is enabled and the | |
| // workload has the corresponding annotation set. | |
| func (g *wlGroup) IsElasticWorkload() bool { | |
| return workloadslicing.IsElasticWorkload(g.local) | |
| } |
Do we still need? In each place, couldn't we just call workloadslicing.IsElasticWorkload(g.local)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this helper is consistent with other group functions (e.g., IsFinished()), where we don’t need to explicitly unpack group attributes.
That said, I can remove it if it’s a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im ok either way. I like consistency
|
|
||
| // Keep the same cluster assignment between slices. | ||
| wl.Status.ClusterName = oldSlice.Status.ClusterName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status field is the actual state, and spec is the desired state.
However, if we assign any clusterName to the status field w/o QuotaReservation, the .status.clusterName represents the desired state since the .status.clusterName is no longer actual cluster name reserved by the cluster.
So, I think we can (1 leverage NCN in the dispatcher as I mentioned above, or (2 worker clusters ' kueue scheduler assigns their cluster name to WorkloadSlice by themselves while QuotaReservation. In that case, we can keep the .status.clusterName as the actual state.
However, (2 indicates bringing cluster assignment responsibilities to the Kueue scheduler, which is not good since technically, the Kueue-scheduler is responsible for everything, but we decouple responsibilities across multiple internal components.
|
@tenzen-y, thank you for the comment (not sure why I couldn’t quote or reply directly, so responding here in a separate thread).
For a new workload slice, the desired cluster name is carried forward from the old workload slice’s Does this explanation make sense? |
My point is I might be missing something. If my understanding is incorrect, please let me know. |
For ElasticWorkloads, that’s exactly the case. We want each new workload slice to sync to the same cluster as the original. If that workload ends up stuck in a |
|
Im also thinking that going via NominatedClusterNames can cause race conditions if the external dispatcher updates the value before it is promoted to status.clusterName. Maybe we could prevent updating status.nominatedClusterNames for prebuilt workloads to prevent that phenomenon, but it seems ad hoc. Requiring all external Dispatchers to avoid mutating prebuilt workloads is not preferred if can be avoided. |
I might be missing part of your point here. Do you feel we’re aligned with the PR change, or do you still see a concern? If it’s the latter, could you help point me to the specific area and what you’d recommend? |
…for `batch/v1.Job` workloads when ElasticJob functionality is enabled. Signed-off-by: Illya Chekrygin <[email protected]>
3361203 to
b88b4c7
Compare
I think we are aligned to the part about skipping going via the nomination phase (status.nominatedClusterNames). My comment was actually describing there could be race conditions if we were going via the nomination phase (as suggested by @tenzen-y in #6445 (comment)). Even if we skip the nomination phase a race condition exists where the external dispatcher could nominate a cluster name and scheduler could admit before we propagate the .status.clusterName. As suggested in #6445 (comment) one idea of solving the issue is to prevent setting |
|
/test pull-kueue-test-e2e-main-1-34 |
| // In a single-cluster context, this should lead to Job suspension. | ||
| // In a MultiKueue context, this should also trigger removal of remote workload/Job objects. | ||
| if features.Enabled(features.ElasticJobsViaWorkloadSlices) && oldWorkloadSlice != nil { | ||
| e.Obj.Status.ClusterName = oldWorkloadSlice.WorkloadInfo.Obj.Status.ClusterName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I somehow missed this (I think it wasn't here in the first revision). I think this actually eliminates the race condition I was describing in #6445.
I think this actually does not increase the scheduler logic signifficantly (no ifs or branching), and neatly prevents the races as we guarantee the replacement workloads sticks to the same clusterName without a time window when external dispatchers could choose differently.
Actually, after reading this PR more carefully and syncing with @ichekrygin I understand why this race condition I considered is now solved thanks to #6445 (comment). Basically setting the clusterName to the new slice at the scheduling time gives no time to external dispatchers to mess up the assignment. I like this neat and simple approach. |
|
I'm happy with the PR, especially as this is Alpha. I think this approach allows to eliminate the race condition I considered, and I appreciate @ichekrygin considered the race and fixed in the latest revision. I don't see any more practical issues with this PR. I acknowledge there are two "design cleanliness" issues discussed raised by @tenzen-y and me during review:
So, while I agree raising these questions was worth a while, and I don't claim we have the best answers in the PR, but I don't see any better proposals during the discussions than what is implemented, or certainly not achievable for 0.14. If we come up with better ideas we can revisit for Beta as the feature remains Alpha. Thank you folks for iterating on the PR. |
|
LGTM label has been added. DetailsGit tree hash: e1c6e9558a581dc7f198134228dc89f621699859 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ichekrygin, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@ichekrygin ptal: #7033 |
|
/release-note-edit |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Now that the core ElasticJobsViaWorkloadSlices functionality has been introduced in Kueue as part of KEP-77, this PR introduces MultiKueue support for ElasticJobsViaWorkloadSlices.
Which issue(s) this PR fixes:
Fixes #6335
Special notes for your reviewer:
Does this PR introduce a user-facing change?