Skip to content

Commit

Permalink
proposal: add dynamic load balancer support for openyurt (#400)
Browse files Browse the repository at this point in the history
  • Loading branch information
LindaYu17 committed Nov 10, 2021
1 parent 754f4c2 commit e463a79
Showing 1 changed file with 303 additions and 0 deletions.
303 changes: 303 additions & 0 deletions docs/proposals/20211017-adding-dynamic-load-balancer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,303 @@
---
title: Proposal Template
authors:
- "@lindayu17"
- "@gnunu"
- "@zzguang"
reviewers:
- "@rambohe-ch"
- "@Fei-Guo"
- "@kadisi"
creation-date: 2021-10-17
status: provisional
---

# A dynamic load balancer for edge cluster

## Table of Contents

[Tools for generating](https://github.com/ekalinin/github-markdown-toc) a table of contents from markdown are available.

- [Title](#title)
- [Table of Contents](#table-of-contents)
- [Glossary](#glossary)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals/Future Work](#non-goalsfuture-work)
- [Proposal](#proposal)
- [User Stories](#user-stories)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Story 3](#story-3)
- [Story 4](#story-4)
- [Requirements (Optional)](#requirements-optional)
- [Functional Requirements](#functional-requirements)
- [FR1](#fr1)
- [FR2](#fr2)
- [FR3](#fr3)
- [FR4](#fr4)
- [Non-Functional Requirements](#non-functional-requirements)
- [NFR1](#nfr1)
- [NFR2](#nfr2)
- [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
- [Risks and Mitigations](#risks-and-mitigations)
- [Alternatives](#alternatives)
- [Upgrade Strategy](#upgrade-strategy)
- [Additional Details](#additional-details)
- [Test Plan [optional]](#test-plan-optional)
- [Implementation History](#implementation-history)

## Glossary

Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).

## Summary

Dynamic Load Balancer (DLB) is a key feature of cloud/edge native cluster. Requsets, after dispatched by edge ingress, should be further routed to the most appropriate PODs based on various criteria:
1) nodes/PODs with specific devices such as GPU or other accelerators for AI inference;
2) current available resources of PODs including CPU, memory, GPU, etc.
3) other considerations such as debugging, testing, fault injection, rate limiting, etc.

## Motivation

There are kinds of workloads whose requests are not just simple networking based, instead would incur sustaining resource consumption of CPU, memory and GPU, etc., such as video analytics and cloud gaming typically. These workloads especially fit edge environment deployment and need traffic management involving current available resources of the backend PODs and nodes. A dynamic load balancer, after ingress and before dynamic PODs, should be inserted to do the traffic mangement for optimal performance of the edge cluster.

### Goals

- Allow users to specify requests routing policies;
- Collect system metrics through metrics monitoring services such as Prometheus;
- Analyze and verify the requests and match them with cluster's system capabilities;
- Route requests to proper PODs according to user specified algorithms and policies.

### Non-Goals/Future Work

- Metrics services for OpenYurt is not part of this proposal.

## Proposal

Dynamic Load Balancer (DLB) operator, and its CRD definition lists below:

```go

// Device setting description:
// Device setting defines how to route workloads considering different xPUs.
// cpu: use CPU only as compute device;
// gpu: use GPU ohly as compute device;
// auto:cpu,gpu: multiple device can be used as compute device automatically,
// this setting means use CPU first before it is exhausted.

// Algorithm setting description:
// Algorithm setting defines how to distribute workloads among different PODs/nodes.
// balance: schedule workload to the POD/node with the most compute resource which is specified by Device field;
// round-robin: schedule workload to the PODs/nodes in round-robin mode, the threshold (ex., FPS) should be taken into consideration as well;
// If the threshold runs lower than a watermark, the next candidate will be evaluated;
// squeeze: schedule workloads to the PODs/nodes as less as possible, inadequate threshold indicates to invoking a new node;
// manual: node1,node2,...: schedule workload to assigned nodes, mostly used for debug purpose.

type DynamicLBPolicy struct {
Device string `json:"device,omitempty"` //cpu(default), gpu, auto:cpu,gpu...
Algorithm string `json:"algorithm,omitempty"` //balance(default), round-robin, squeeze, manual:nodename
Threshold string `json:"threshold"` //e.g.fps:24
}

type UseCase struct {
UseCaseClass string `json:"useCaseClass"` //AI, media, gaming...
UseCaseName string `json:"useCaseName"` //detect, classify...
DLBPolicy DynamicLBPolicy `json:"dlbPolicy"`
}

type Resource struct {
CPU int32 `json:"cpu,omitempty"`
GPU int32 `json:"gpu,omitempty"`
MEM int32 `json:"mem,omitempty"`
FPS int32 `json:"fps,omitempty"`
}

type ReqStat struct {
ReqID string `json:"reqID,omitempty"`
PodID string `json:"podID,omitempty"`
Content string `json:"content,omitempty"` //full request content
Status string `json:"status,omitempty"` //receiving, running, stopping
ReqTop Resource `json:"reqTop,omitempty"` //resource consumption of this request
}

type PodStat struct {
PodID string `json:"podID,omitempty"`
NodeID string `json:"nodeID,omitempty"`
PodQuota Resource `json:"podQuota,omitempty"` //assigned resources of this POD
PodTop Resource `json:"podTop,omitempty"` //resource consumption of this POD
}

type NodeStat struct {
NodeID string `json:"nodeID,omitempty"`
NodeQuota Resource `json:"nodeQuota,omitempty"` //all resources of this node
NodeTop Resource `json:"nodeTop,omitempty"` //resource consumption of this node
}

type UseCaseStat struct {
Usecase UseCase `json:"usecase,omitempty"`
ReqStat []ReqStat `json:"reqStat,omitempty"` //every request for the usecase
PodStat []PodStat `json:"podStat,omitempty"` //every pod stat for the usecase
}

// DynamicLBSpec defines the desired state of DynamicLB
type DynamicLBSpec struct {
// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
// Important: Run "make" to regenerate code after modifying this file

Usecase UseCase `json:"usecase"`
}

// DynamicLBStatus defines the observed state of DynamicLB
type DynamicLBStatus struct {
// INSERT ADDITIONAL STATUS FIELD - define observed state of cluster
// Important: Run "make" to regenerate code after modifying this file
UsecaseStatList []UseCaseStat `json:"usecaseStatList,omitempty"`
NodeStatList []NodeStat `json:"nodeStatList,omitempty"`
Watermark float32 `json:"watermark,omitempty"` //total resource consumption percentage

}

//+kubebuilder:object:root=true
//+kubebuilder:subresource:status

// DynamicLB is the Schema for the dynamiclbs API
type DynamicLB struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`

Spec DynamicLBSpec `json:"spec,omitempty"`
Status DynamicLBStatus `json:"status,omitempty"`
}

//+kubebuilder:object:root=true

// DynamicLBList contains a list of DynamicLB
type DynamicLBList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []DynamicLB `json:"items"`
}

```

Development plan: hopefully this feature can be implementated and merged into OpenYurt Release 0.8.

### User Stories

#### Story 1

Requests of workloads will be routed to proper PODs according to specific load balance policies.

#### Story 2

Users want to be able to customize request routing rules.

#### Story 3

Users want to be able to decide the target device for their workloads.

#### Story 4

Users want to run workloads with optimal performance expectation.

### Requirements (Optional)

#### Functional Requirements

##### FR1

DLB controller collects configuration information such as use case descriptions, user defined rules from CRs, the exposing service and the nodes' basic information, and store them into ConfigMaps for sharing with worker container, other data sharing method can be involved as well.

##### FR2

DLB receives requests from ingress of the cluster, then analyzes and verifies the requests and matches them with the cluster's system capabilities.

##### FR3

Based on the metrics data and the specified algorithm, DLB worker container routes the requests to the proper PODs.

#### Non-Functional Requirements

##### NFR1

We suppose metrics service is working correctly in OpenYurt nodepool environment, so Metrics services for OpenYurt is not part of this proposal.

##### NFR2

We suppose that the required xPU device plugins is available, for ex., ones for Intel's discrete GPU.

### Implementation Details/Notes/Constraints

The DLB is located after Ingress, and it runs at the nodepool level as ingress. So same as ingress, we also need an operator to handle DLB's deploy, delete, upgrade for nodepools.

The DLB consists of two parts, a controller and a worker. For the part of worker container, we can reuse off the shelf proxy solutions such as Envoy(https://github.com/envoyproxy/envoy) proxy or linkerd(https://linkerd.io/). The reason is that essentially the DLB worker container is a proxy for traffic management and the core function overlaps with the mentioned products. Given that they implement HTTP/gRPC or L4 traffic management, and they are born to do the networking transparently and elegantly, by sidecar for instance, so we can augment them with metrics based traffic management and corresponding configuration information ingestion. This is the data plane.

And the controller which runs in control plane will do metrics collection, configuration management through go-control-plane(https://github.com/envoyproxy/go-control-plane), it will push the configuration and metrics data updates to the worker container. The latter, based on traffic routing rules and current metrics, routes the requests to the appropriate PODs/nodes.

Since the traffic should be routed to the designated PODs chosen by specified policies if this DLB is enabled, the original load balance function of K8S's Serivce should be ignored. To ease the usage of DLB, the worker proxy could be injected automatically/manually to replace the Service of a deployment, then the proxy would discover the Service's backend PODs for which to do traffic routing using the augmented algorithms.

|----------------------------------------------------------------------------|
| |
| OpenYurt Node Pool ----------------------- |
| |node | |
| | ------------ | |
| ----|------>| Workload | | |
| | | | service | | |
| | | ------------ | |
| ---------------------- | |---------------------| |
| |node | | |
| ------------- | ------------- | | |
------->| service |-----|---->|Envoy proxy|--|--| ---------------------- |
| ------------- | | --------------- | | |node | |
| | | |dynamic | | | | ------------ | |
| ------------- | | |load blancer | | |---|------>| Workload | | |
------->| Ingress |--| | --------------- | | | | service | | |
| ------------- | | | | ------------ | |
| |--------------------| | |---------------------| |
| | |
| | ---------------------- |
| | |node | |
| | | ------------ | |
| ----|------>| Workload | | |
| | | service | | |
| | ------------ | |
| |---------------------| |
|----------------------------------------------------------------------------|

### Risks and Mitigations

- What are the risks of this proposal and how do we mitigate? Think broadly.
If we reuse the off the shelf product, we need to do augmentation elegantly, for easy development and maintenance.
- How will UX be reviewed and by whom?
DLB shoulde be best used by an End-to-End deployment opetrator for automatic proxy injection.
- How will security be reviewed and by whom?
Security is solved by reusing proxy products.
- Consider including folks that also work outside the SIG or subproject.

## Alternatives

The `Alternatives` section is used to highlight and record other possible approaches to delivering the value proposed by a proposal.

## Upgrade Strategy

If applicable, how will the component be upgraded? Make sure this is in the test plan.

Consider the following in developing an upgrade strategy for this enhancement:
- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to keep previous behavior?
- What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade in order to make use of the enhancement?
If the hardware updates, especially new CPU/GPU is engaged, or new inference algorithm is involved for example, we may need to upgrade the software, since the DLB algorithm is depending on the above mentioned.

## Additional Details

### Test Plan [optional]

## Implementation History

- [ ] MM/DD/YYYY: Proposed idea in an issue or [community meeting]
- [ ] MM/DD/YYYY: Compile a Google Doc following the CAEP template (link here)
- [ ] MM/DD/YYYY: First round of feedback from community
- [ ] MM/DD/YYYY: Present proposal at a [community meeting]
- [ ] MM/DD/YYYY: Open proposal PR

0 comments on commit e463a79

Please sign in to comment.