Skip to content

Commit

Permalink
fix typo
Browse files Browse the repository at this point in the history
Signed-off-by: j4ckstraw <[email protected]>
  • Loading branch information
j4ckstraw committed Nov 29, 2023
1 parent 9534906 commit d41eed8
Showing 1 changed file with 70 additions and 72 deletions.
142 changes: 70 additions & 72 deletions docs/proposals/20231123-enhance-mid-tier-resource.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,110 +18,109 @@ see-also:

# Enhance Mid-tier resources

## Table of Contents
## Table of contents

<!--ts-->
- [Enhance Mid-tier resources](#enhance-mid-tier-resources)
- [Table of Contents](#table-of-contents)
- [Table of contents](#table-of-contents)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals/Future Work](#non-goalsfuture-work)
- [Non-goals/Future work](#non-goalsfuture-work)
- [Proposal](#proposal)
- [User Stories](#user-stories)
- [User stories](#user-stories)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Prerequisites](#prerequisites)
- [Design Principles](#design-principles)
- [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
- [Design principles](#design-principles)
- [Implementation details/Notes/Constraints](#implementation-detailsnotesconstraints)
- [Cgroup basic configuration](#cgroup-basic-configuration)
- [QoS policy](#qos-policy)
- [Node QoS](#node-qos)
- [Risks and Mitigations](#risks-and-mitigations)
- [Risks and mitigations](#risks-and-mitigations)
- [Alternatives](#alternatives)
- [Upgrade Strategy](#upgrade-strategy)
- [Additional Details](#additional-details)
- [Implementation History](#implementation-history)
- [Upgrade strategy](#upgrade-strategy)
- [Additional details](#additional-details)
- [Implementation history](#implementation-history)
<!--te-->

## Summary

The *Mid-tier resources* is proposed to both improve the node utilization and avoid overloading, which rely on [node prediction](https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/20230613-node-prediction.md).

While *node prediction* clarify how the Mid-tier resources are calculated with the prediction, and this proposal will clarify Mid-tier cgroup and QoS design, suppress and eviction policy.
While *node prediction* clarifies how the Mid-tier resources are calculated with the prediction, and this proposal clarifies Mid-tier cgroup and QoS design, suppression and eviction policy.

## Motivation

Here I want to explain some concepts:
Here I would like to explain some concepts:
1. koord-QoS

Quality of Service, we assue the same QoS level has similar operating performance, operating quality.
Quality of Service, we assume that the same QoS level has similar operational performance, operational quality.

2. koord-priority

Scheduling priority, high priority can preempt low priority by default.

3. Resource Type
3. Resource models

We have four resource type now, prod, mid, batch and free resource
resource type care about whether the resource is oversold, whether it is stable, which affects pod eviction.
We now have four resource models: prod, mid, batch and free. The resource model takes care of whether the resource is oversold and whether it is stable, which affects pod eviction.

koordinator bind koord-priority and resource type, different priority has different resource type.
koordinator bind koord-priority and resource model, different priorities have different resource models.

This proposal introduce Mid+LS and Mid+BE fill the gap in Prod+LS and Batch+BE
meet the requirements of different types of tasks.

### Goals

- How to calculate and update mid resource amount of node.
- Clarify Mid-tier cgroup and QoS
- Clarify Mid-tier suppress and eviction policy
- The optimization needed in scheduler/descheduler
- How to calculate and update the mid-tier resource amount of a node.
- Clarification of Mid-tier cgroup and QoS
- Clarification of Mid-tier suppression and eviction policy
- Necessary optimization of the scheduler/descheduler

### Non-Goals
### Non-goals

- Replace Batch-tier resources
- Add new QoS type

## Proposal

### User Stories
### User stories

#### Story 1

There are low-priority online-service tasks, which performance requirements is same as Prod+LS while it do not want to be suppressed but can tolerate being evicted, when the machine usage spike.
There are online service tasks with low priority whose performance requirements match those of Prod+LS, but which should not be suppressed but can tolerate being evicted, when machine usage increases.

Mid+LS can conquer it.
Mid+LS can handle them.

#### Story 2

There are resource consumption tasks, AI or stream computing, such as Apache Spark, which may consume a lot of resources. It need stable resource and it can be suppressed and do not want to be evicted.
There are resource-intensive tasks such as AI or stream computing, e.g. Apache Spark, that can consume a lot of resources. They require stable resources, can be suppressed and do not want to be evicted.

Mid+BE can conquer it.
Mid+BE can defeat them.

### Prerequisites

Must use koordinator node reservation if anyone want to use Mid+LS
Must use koordinator node reservation if someone wants to use Mid+LS

### Design Principles
### Design principles

**QoS**

By default, the performance of same QoS should be similar, Prod+LS and Mid+LS should be *basically* the same, also suitable to Mid+BE and Batch+BE.

QoS policy adaptation needs to be specific. when QoS is the same and priorities are different, generally speaking, there is no need to make a big difference, such as the Priority of eviction, and some advanced configurations in memory reclamation,normally, no configuration is required.
But as CPU Group Identity, there is no finer configuration granularity, so no additional adaptation is required for the time being.
In the future, advanced capabilities such as Core Schedule and CPU idle need to be considered.
The adaptation of the QoS policy must be specific. In general, if the QoS is the same and the priorities are different, there is no need to make a big difference, such as the priority of eviction, and some advanced configurations in memory reclamation, usually no configuration is required.
However, there is no finer configuration granularity for CPU group identity, so no additional customization is required for the time being.
In the future, advanced features such as core schedule and CPU idle need to be considered.

Alao you can introduce a QoS level, such as MID, between LS and BE, to implement features such as QoS detail adjustment.
Also, you can introduce a QoS level, such as MID between LS and BE to implement functions such as QoS detail customization.

**resource oversale**

Mid resource is calculated by
```
Allocatable[Mid] := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio)
```
at present, we need to change it to
at the moment we need to change this to

```
Allocatable[Mid] := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) + Unallocated[Mid]
Expand All @@ -130,38 +129,38 @@ Unallocated[Mid] = max(NodeAllocatable - Allocated[Prod], 0)

reference [mid-tier-overcommitment](https://koordinator.sh/docs/designs/node-prediction/#mid-tier-overcommitment)

in the long term it can be support non-oversale by feature-gate.
in the long term, non-overselling can be supported by Feature-Gate.
```
Unallocated * thresholdRatio
```

**native resource or extend resource**
**native resource or extended resource**

*native resource*:
hijack node update, change `node.Status.allocatable`, mid pod also use native resource, in this situation, Mid is equivalent to a sub-priority in prod, resource quota need to make adaptive modification.
for Mid+BE pods, it can be located burstable, even guaranteed, which disobey QoS level policy.
for Mid+BE pods, it can be located burstable, even guaranteed to disobey the QoS level policy.

*extend resource*:
add mid-cpu/mid-memory, inject extend resource field by webhook.
for Mid+BE pods it will be located in besteffort.
for Mid+LS pods it will be located in burstable by reserve limits.cpu/limits.memory.
*extended resource*:
add mid-cpu/mid-memory, insert "extended resource" field by webhook.
for Mid+BE pods it is localized in besteffort.
for Mid+LS pods, it is located in burstable by reserving limits.cpu/limits.memory.

**share resource account or not**

Let's consider the non-oversold scenario.
Let us look at the scenario without overselling.

if prod and mid pods share resource account, when prod pod pending, preemption is needed. koord-scheduler need filter and preept plugins to handle it.
if prod and mid pods share resource account, a preempt is required for an upcoming prod pod. koord-scheduler needs filter and preept plugins to handle this.

if prod and mid pods do not share resource account, mechanism is needed for standalone or descheduler to migrate mid pods in necessary cases, such as hot spots.
if prod and mid pods do not share a resource account, a mechanism for standalone or descheduler is needed to migrate mid pods in necessary cases, e.g. hot spots.

In this scenario, Mid+LS may affect Prod+LS, it depends on the calculation strategy of Mid-tier resource and the degree of interference brought by mid pods.
the calculation of Mid-tier resource should be relatively conservative in the preset target scenario, and the application type should not introduce too much interference.
In this scenario, Mid+LS can affect Prod+LS; this depends on the calculation strategy for the Mid-tier resource and the degree of interference from mid-pods.
the calculation of the Mid-tier resource should be relatively conservative in the given target scenario and the application type should not cause too much interference.

**trade-off-result**
- keep Mid+LS and Mid+BE,do not introduce new QoS level.
- change mid resource calculate method, add unallocated resource into mid resource.
- use extend-resource mid-cpu/mid-memory
- do not share resource account
- retain Mid+LS and Mid+BE,do not introduce a new QoS tier.
- change the calculation method for mid resources, add unallocated resources to mid resources.
- use of the mid-cpu/mid-memory
- no shared resource account

### Implementation Details/Notes/Constraints

Expand All @@ -179,68 +178,67 @@ Configured according requests.mid-cpu

**cgroup hierarchy**

- Mid+LS, inject limits.cpu/limits.memory by webhoook, so it can be located in Burstable.
- Mid+BE, located in Besteffort by default.
- Mid+LS, injects limits.cpu/limits.memory through webhoook, so that it can be located in Burstable.
- Mid+BE, is located in Besteffort by default.

*Notification*
Burstable cpuShares/memoryLimit may be update by kubelet periodically.
Burstable cpuShares/memoryLimit can be updated regularly by kubelet.

#### QoS Policy
#### QoS policy

Configured according koord-QoS
Configured according to koord-QoS
- LS for Mid+LS
- BE for Mid+BE

#### Node QoS

**CPU Suppress**

- Mid+LS do not be suppressed by default, if task performance do not meet SLA, eviction is accepted.
- Mid+LS are not suppressed by default, if task performance does not meet SLA, eviction is accepted.
- Mid+BE can be suppressed, and do not want to be evicted frequently.

CPU suppress should consider Batch+BE and Mid+BE.
Batch+BE and Mid+BE should be considered for CPU suppression.

**CPU Evicton**

CPU eviction is related to pod satisfaction at present.
but in the long term, it should be done from the perspective of OS, like memory eviction.
CPU eviction is currently linked to pod satisfaction.
in the long term, however, it should be done from the perspective of the operating system, like memory eviction.

**Memory Evict**

The eviction is sorted according to the priority and resource type
Eviction is sorted by priority and resource model
- Batch first and then Mid.
- Mid+LS first and then Mid+BE, for Mid pods, request and usage should be consideration when evict for fairness.
- Mid+LS first and then Mid+BE, for Mid pods, request and usage should be taken into account when evicting for fairness reasons.

### Risks and Mitigations
### Risks and mitigations

- Burstable cpuShares may be updated by kubelet periodically, which is confict with koordlet update. [reference](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/qos_container_manager_linux.go#L170)
- Burstable cpuShares can be periodically updated by kubelet, leading to conflicts with koordlet update. [reference](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/qos_container_manager_linux.go#L170)

We can do not update burstable cpuShares, then Prod+LS and Mid+LS may mutual interference each other only when high load.
We cannot update burstable cpuShares, then Prod+LS and Mid+LS can only interfere with each other under heavy load.

- Burstable memory limit may be updated by kubelet periodically, which is confict with koordlet update. [reference](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/qos_container_manager_linux.go#L343)
- Burstable memory limit can be periodically updated by kubelet, which conflicts with koordlet update. [reference](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/qos_container_manager_linux.go#L343)

We should disable kubefeatures.QOSReserved for memory resource,
see [Prerequisites](#prerequisites).
We should disable kubefeatures.QOSReserved for memory resources, see [Prerequisites](#prerequisites).

## Alternatives

**Add Mid QoS**

Introduce new QoS level, which can adjust Mid-tier pod QoS finely.
Introduction of a new QoS level that can be used to fine-tune the QoS of the mid-tier pod.

## Upgrade Strategy

- [ ] add midresource runtimehook, configure cgroup
- [ ] update Mid-tier calculate policy
- [ ] update BE suppress to support Mid+BE
- [ ] update BE suppression to support Mid+BE
- [ ] update CPU/Memory eviction to support Mid-tier
- [ ] scheduler/descheduler filter and policy, for example do not over-allocate for mid+prod resource if some feature-gate enabled, migrate some mid pod on hotspots for load-balance.
- [ ] scheduler/descheduler filters and policies, e.g. no over-allocation for mid+prod resources when a feature gate is enabled, migration of some mid-pods to hotspots for load balancing.

## Additional Details

With mid resource enhanced, we have panorama as follow:
With the improved mid-resource we have the following panorama:

koor-priority | resource type | koord-QoS | k8s-QoS | scenario |
koor-priority | resource model | koord-QoS | k8s-QoS | scenario |
-- | -- | -- | -- | -- |
koord-prod | cpu/memory | LSE | guaranteed | middleware |
koord-prod | cpu/memory | LSR | guaranteed | high-priority online-service,CPU bind |
Expand Down

0 comments on commit d41eed8

Please sign in to comment.