-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: add enhance mid-tier resource proposal #1762
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1762 +/- ##
========================================
Coverage 66.11% 66.11%
========================================
Files 388 390 +2
Lines 42425 42589 +164
========================================
+ Hits 28048 28159 +111
- Misses 12305 12346 +41
- Partials 2072 2084 +12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
||
#### Story 1 | ||
|
||
There are low-priority online-service tasks, which performance requirements is same as Prod+LS while it do not want to be suppressed but can tolerate being evicted, when the machine usage spike. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the Mid pods are allowed to allocate the Prod unallocated resources, the Prod apps can be affected since ProdPeak + MidAllocated > NodeAllocatable
is possible. So to avoid the affection, there should be a design about when and how we let the Prod pods preempt/evict Mid pods when they want to win back the resources that Prod unallocated but Mid allocated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Because prod pods use cpu/memory resource, while mid pods use mid-cpu/mid-memory, so prod can't preempt mid directly.
How about evict by mid-allocated / mid-allocatable
.
Signed-off-by: j4ckstraw <[email protected]>
0da22aa
to
5f4f670
Compare
Signed-off-by: j4ckstraw <[email protected]>
5f4f670
to
d41eed8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one more question: How does the scheduler's LoadAware Scheduling plugin support middle tiers?
|
||
### Prerequisites | ||
|
||
Must use koordinator node reservation if someone wants to use Mid+LS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide more details about the prerequisites? Why must use koordinator node reservation? Is it the Reservation described in this document(20221227-node-resource-reservation.md)or the Reservation defined by the Koordinator SLO?
**native resource or extended resource** | ||
|
||
*native resource*: | ||
hijack node update, change `node.Status.allocatable`, mid pod also use native resource, in this situation, Mid is equivalent to a sub-priority in prod, resource quota need to make adaptive modification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t quite understand the logic described in this paragraph. Why do we need to hijack node update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hijack node update to add up prod-reclaimable to original node.Status.allocatable
for Mid+BE pods, it can be located burstable, even guaranteed to disobey the QoS level policy. | ||
|
||
*extended resource*: | ||
add mid-cpu/mid-memory, insert "extended resource" field by webhook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here, do you want to add new webhook plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nop.
Signed-off-by: j4ckstraw <[email protected]>
I need to think about it. |
|
||
Let us look at the scenario without overselling. | ||
|
||
if prod and mid pods share resource account, a preempt is required for an upcoming prod pod. koord-scheduler needs filter and preept plugins to handle this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please clarify the specific rules of filter and preemption
|
||
**share resource account or not** | ||
|
||
Let us look at the scenario without overselling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does the overselling
mean here?
**cpuShares** | ||
|
||
Configured according requests.mid-cpu | ||
- for Mid+LS, same as Prod+LS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both the pod-level and container-level of the Mid pods follow the same rule as the Batch extended resources if the pods allocate Mid extended resources. So this statement is confusing. I am not sure if you are talking about the QoS-level cgroups. We'd better either make a concise and clear expression or add a comprehensive diagram to clarify the design.
**CPU Evicton** | ||
|
||
CPU eviction is currently linked to pod satisfaction. | ||
in the long term, however, it should be done from the perspective of the operating system, like memory eviction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please provide more information on why and how the cpu eviction is done from the perspective of the OS?
|
||
Eviction is sorted by priority and resource model | ||
- Batch first and then Mid. | ||
- Mid+LS first and then Mid+BE, for Mid pods, request and usage should be taken into account when evicting for fairness reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mid+LS first and then Mid+BE
It may not work well if you deploy online services as Mid+LS pods and deploy stream-computing jobs as Mid+BE. Think about the online service pods being evicted earlier than the job pods. Though the Mid+BE pods can be suppressed to reduce interference with Prod resources, it cannot be a reason for Mid+LS to be a lower priority in eviction. Please avoid unnecessary coupling of the priority and QoS if there is no proper design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get, Thank you for your advice
Signed-off-by: j4ckstraw <[email protected]>
/milestone 1.5 |
@ZiMengSheng: The provided milestone is not valid for this repository. Milestones in this repository: [ Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/milestone v1.5 |
at the moment we need to change this to | ||
|
||
``` | ||
Allocatable[Mid] := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) + Unallocated[Mid] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to my understanding, adding unallocated here is to allow Mid + LS to also use unallocated resources in the cluster, but there is a problem here.Using unallocated resources will affect the view of Prod resources, and ultimately need to support Prod's preemption of Mid, which in turn affects the stability of Mid resources.
We are also considering applying node prediction to the amplification factor of Node, so that Prod can be directly oversold, so that the Quota management and priority preemption are consistent with the native semantics of k8s.
Does this satisfy your Mid + LS need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider such a scenario,
The user deployed a Mid + LS pod, while the Mid resource of the cluster was insufficient, so the cluster autoscaler scaled out one new node.
The problem is that there are no Prod pods in the new node, so no Mid resources are available too. so we want to allowing Mid + LS to use unallocated resources in the cluster.
Ⅰ. Describe what this PR does
Ⅱ. Does this pull request fix one issue?
Ⅲ. Describe how to verify it
Ⅳ. Special notes for reviews
V. Checklist
make test