Skip to content

Commit 6e66978

Browse files
committed
proposal: add dynamic edge load balancer support for openyurt (#400)
1 parent 754f4c2 commit 6e66978

File tree

2 files changed

+206
-0
lines changed

2 files changed

+206
-0
lines changed

docs/img/elb.png

30.5 KB
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
---
2+
title: Edge Load Balancer (ELB) for OpenYurt Nodepool
3+
authors:
4+
- "@lindayu17"
5+
- "@gnunu"
6+
- "@zzguang"
7+
reviewers:
8+
- "@rambohe-ch"
9+
- "@kadisi"
10+
creation-date: 2021-10-17
11+
status: provisional
12+
---
13+
14+
# An edge load balancer for OpenYurt nodepool
15+
16+
## Table of Contents
17+
18+
[Tools for generating](https://github.com/ekalinin/github-markdown-toc) a table of contents from markdown are available.
19+
20+
- [Title](#title)
21+
- [Table of Contents](#table-of-contents)
22+
- [Glossary](#glossary)
23+
- [Summary](#summary)
24+
- [Motivation](#motivation)
25+
- [Goals](#goals)
26+
- [Non-Goals/Future Work](#non-goalsfuture-work)
27+
- [Proposal](#proposal)
28+
- [User Stories](#user-stories)
29+
- [Story 1](#story-1)
30+
- [Story 2](#story-2)
31+
- [Story 3](#story-3)
32+
- [Story 4](#story-4)
33+
- [Requirements (Optional)](#requirements-optional)
34+
- [Functional Requirements](#functional-requirements)
35+
- [FR1](#fr1)
36+
- [FR2](#fr2)
37+
- [FR3](#fr3)
38+
- [FR4](#fr4)
39+
- [Non-Functional Requirements](#non-functional-requirements)
40+
- [NFR1](#nfr1)
41+
- [NFR2](#nfr2)
42+
- [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
43+
- [Risks and Mitigations](#risks-and-mitigations)
44+
- [Alternatives](#alternatives)
45+
- [Upgrade Strategy](#upgrade-strategy)
46+
- [Additional Details](#additional-details)
47+
- [Test Plan [optional]](#test-plan-optional)
48+
- [Implementation History](#implementation-history)
49+
50+
## Glossary
51+
52+
Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).
53+
54+
## Summary
55+
56+
Edge Load Balancer (ELB) is a key feature of cloud/edge native cluster. Requsets, after dispatched by edge ingress, should be further routed to the most appropriate PODs based on various criteria:
57+
1) nodes/PODs with specific devices such as GPU or other accelerators for AI inference;
58+
2) current available resources of PODs including CPU, memory, GPU, etc.
59+
3) other considerations such as debugging, testing, fault injection, rate limiting, etc.
60+
61+
## Motivation
62+
63+
There are kinds of workloads whose requests are not just simple networking based, instead would incur sustaining resource consumption of CPU, memory and GPU, etc., such as video analytics and cloud gaming typically. These workloads especially fit edge environment deployment and need traffic management involving current available resources of the backend PODs and nodes. A dynamic edge load balancer, after ingress and before workload PODs, should be inserted to do the traffic mangement for optimal performance of the edge cluster.
64+
65+
### Goals
66+
67+
- Allow users to specify requests routing policies;
68+
- Collect system metrics through metrics monitoring services such as Prometheus;
69+
- Analyze and verify the requests and match them with cluster's system capabilities;
70+
- Route requests to proper PODs according to user specified devices priority and policies.
71+
72+
### Non-Goals/Future Work
73+
74+
- Metrics services for OpenYurt is not part of this proposal.
75+
76+
## Proposal
77+
78+
Edge Load Balancer (ELB) CRD definition lists below:
79+
80+
```go
81+
82+
// Policy defines how to distribute workloads among different PODs/nodes.
83+
// balance: schedule workload to the POD/node with the most compute resource which is specified by Device field;
84+
// round-robin: schedule workload to the PODs/nodes in round-robin mode, the threshold (ex., FPS) should be taken into consideration as well;
85+
// If the threshold runs lower than a watermark, the next candidate will be evaluated;
86+
// squeeze: schedule workloads to the PODs/nodes as less as possible, inadequate threshold indicates to invoking a new node;
87+
// random: simply schedule request accorduing to a generated random number.
88+
89+
// ELBSpec defines the desired state of ELB
90+
type ELBSpec struct {
91+
Usecase string `json:"usecase"`
92+
Devices []string `json:"devices,omitempty"` // device priority list
93+
Policy string `json:"policy,omitempty"` // balance(default), round-robin, squeeze, random
94+
Performance int32 `json:"performance,omitempty"` // fps
95+
}
96+
97+
// ELBStatus defines the observed state of ELB
98+
type ELBStatus struct {
99+
Usecase string `json:"usecase"`
100+
Endpoints []string `json:"endpoints,omitempty"` // endpoints available
101+
}
102+
103+
type ELB struct {
104+
metav1.TypeMeta `json:",inline"`
105+
metav1.ObjectMeta `json:"metadata,omitempty"`
106+
107+
Spec ELBSpec `json:"spec,omitempty"`
108+
Status ELBStatus `json:"status,omitempty"`
109+
}
110+
111+
```
112+
113+
Development plan: hopefully this feature can be implementated and merged into OpenYurt Release 1.0.
114+
115+
### User Stories
116+
117+
#### Story 1
118+
119+
Requests of workloads should be routed to proper PODs according to specific load balance policies.
120+
In one shopping mall scenario, many cameras have been deployed for doing crowd statistics. The streams would feed to backend compute servers to do in field video analysis. Operator of the systems want the workloads be evenly distributed among the compute resources.
121+
122+
#### Story 2
123+
124+
Users want to be able to customize request routing rules.
125+
In the afore-mentioned shopping mall case, users may want prioritize between various compute devices, such as CPU, GPU, VPU, FPGA, etc.
126+
127+
#### Story 3
128+
129+
Users want to be able to decide the target device for their workloads. Users man want specify a concrete device to do computation.
130+
131+
#### Story 4
132+
133+
Users want to run workloads with optimal performance expectation. For example, users may specify concrete performance target such as fps, ELB would try to fullfill this goal by routing request to the most capable POD.
134+
135+
### Requirements (Optional)
136+
137+
#### Functional Requirements
138+
139+
##### FR1
140+
141+
ELB controller reconciles ELB CRs including use case name, user defined policy, etc., and collects compute resouces metrics push them to ELB proxy.
142+
143+
##### FR2
144+
145+
ELB proxy receives requests from YurtIngress, then analyzes and verifies the requests and matches them with the nodepool's system capabilities.
146+
147+
##### FR3
148+
149+
Based on the metrics data and the specified policy, ELB proxy routes the requests to the proper PODs.
150+
151+
#### Non-Functional Requirements
152+
153+
##### NFR1
154+
155+
We suppose metrics service is working correctly in OpenYurt nodepool environment, so Metrics services for OpenYurt is not part of this proposal.
156+
157+
##### NFR2
158+
159+
We suppose that the required xPU device plugins are available, for ex., those for Intel's discrete GPU.
160+
161+
### Implementation Details/Notes/Constraints
162+
163+
The ELB is located after Ingress, and it runs at the nodepool level as ingress. So same as ingress, we also need an operator to handle ELB's deploy, delete, upgrade for nodepools.
164+
165+
The ELB consists of two parts, a controller and a proxy (worker). For the part of porxy, we can reuse off the shelf proxy solutions such as Envoy (https://github.com/envoyproxy/envoy) proxy or linkerd2-proxy (https://github.com/linkerd/linkerd2-proxy). The reason is that essentially the ELB worker is a proxy for traffic management and the core function overlaps with the mentioned products. Given that they implement HTTP/gRPC or L4 traffic management, and they are born to do the networking transparently and elegantly, by sidecar for instance, so we can augment them with metrics based traffic management and corresponding configuration information ingestion. This is the data plane.
166+
167+
The controller which runs in control plane will do metrics collection, workload backend collection, configuration management. It will push the configuration and metrics data updates to the proxy.
168+
169+
![ELB Architecture](../img/elb.png)
170+
171+
172+
### Risks and Mitigations
173+
174+
- What are the risks of this proposal and how do we mitigate? Think broadly.
175+
If we reuse the off the shelf product, we need to do augmentation elegantly, for easy development and maintenance.
176+
- How will UX be reviewed and by whom?
177+
ELB shoulde be best used by an End-to-End deployment opetrator for automatic CR injection.
178+
- How will security be reviewed and by whom?
179+
Security is solved by reusing proxy products.
180+
- Consider including folks that also work outside the SIG or subproject.
181+
182+
## Alternatives
183+
184+
The `Alternatives` section is used to highlight and record other possible approaches to delivering the value proposed by a proposal.
185+
186+
## Upgrade Strategy
187+
188+
If applicable, how will the component be upgraded? Make sure this is in the test plan.
189+
190+
Consider the following in developing an upgrade strategy for this enhancement:
191+
- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to keep previous behavior?
192+
- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to make use of the enhancement?
193+
If the hardware updates, especially new CPU/GPU is engaged, or new inference algorithm is involved for example, we may need to upgrade the software, since the ELB algorithm is depending on the above mentioned.
194+
195+
## Additional Details
196+
197+
### Test Plan [optional]
198+
199+
## Implementation History
200+
201+
- [x] 06/25/2022: Revision
202+
- [ ] MM/DD/YYYY: Compile a Google Doc following the CAEP template (link here)
203+
- [ ] MM/DD/YYYY: First round of feedback from community
204+
- [ ] MM/DD/YYYY: Present proposal at a [community meeting]
205+
- [ ] MM/DD/YYYY: Open proposal PR
206+

0 commit comments

Comments
 (0)