-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: docs: Add RFC for multi-node consolidation partitioning #1547
Open
cnmcavoy
wants to merge
1
commit into
kubernetes-sigs:main
Choose a base branch
from
cnmcavoy:cmcavoy/rfc-multi-node-consolidation-partitioning
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Multi-node Consolidation Partitioning | ||
|
||
## Background | ||
|
||
Multi-node consolidation actions currently attempt to replace N nodes with a single node that can satisfy all the tenant workloads. This is desirable because there is a daemonset cost and baseline kubelet overhead associated with each node in a cluster, so reducing the node count by merging smaller nodes can be cheaper even if node costs are equal. | ||
|
||
## Problem Statement | ||
|
||
In a cluster that has many non-homogeneous nodepools (typically due to multi-tenancy), or where nodepools support multiple cpu architectures, multi-node consolidation fails because tenant workloads are unlikely to be successfully migrated to different nodepool or different cpu architecture. | ||
|
||
While tenant workloads may support multiple cpu architectures, in practice we have rarely observed workloads be configured to attempt to spread their workloads fairly across multiple cpu architectures. Tenant workloads typically prefer the cpu architecture they were most optimized for, and customers rarely pursue having a multi-architecture deployment strategy. Instead one cpu architecture is typically favored, for performance, cost, availability, or other reasons. When a cluster has multiple such tenants, with different cpu architecture choices, this results in inconsolidatable groups of nodes, which could still be consolidated within their group. | ||
|
||
A similar problem occurs with non-homogeneous nodepools. In multi-tenant clusters, nodepools for each customer often have requirements that differ. Each tenant workload will target nodes which match their nodepool requirements. The more multi-tenancy in a cluster, the more likely many nodes with incompatible requirements will exist. | ||
|
||
As a result, multi-node consolidation suffers when the cluster is not homogeneous. In a non-homogeneous clusters, candidate nodes will contain tenant workloads that have few or no valid destinations due to the cpu architecture or nodepool mismatching. Ultimately, this results in extra daemonsets costs and kubelet overhead on the controlplane, as non-homogeneous clusters will accumulate many tiny nodes that fail to be consolidated. | ||
|
||
## Proposal | ||
|
||
Multi-node consolidation should be partitioned into multiple consolidation actions, each of which is responsible for consolidating nodes that are homogenous in terms of cpu architecture and nodepool. This will allow for more successful consolidation actions, and will reduce the number of nodes in the cluster more effectively. | ||
|
||
A naive, simple approach would be to have Karpenter create M bins for multi-node consolidation, based on cpu architecture and the nodepool of the node. Multi-node consolidation would sort the bins by the number of nodes contained, and attempt multi-node consolidation on the nodes within each bin, starting with the largest, descending. After any solution is found, return the solution. | ||
|
||
## Drawbacks | ||
|
||
### Homogeneous Nodepools | ||
Partitioning multi-node consolidation may result in fewer successful consolidations if a cluster has many nodepools which are homogeneous (e.g weighted fallback nodepools). The naive proposal will result in potentially consolidatable nodes in two or more different nodepools not being consolidated together. | ||
|
||
One possible way to mitigate this would be to make Karpenter smart enough to know if two nodepools requirements are similar, and then treat these nodepools as one bin. The simplest approach would be to look for equality in the nodepool requirements. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is more around how we order our nodes in consolidation, right?
We order our nodes by the estimated cost to disrupt the pods on that node, and then find some contiguous group of nodes from 0 that are compatible. If the set of candidates are ordered in such a way that each alternating node has a different architecture, you'll never be able to get multi-node consolidation. How were you thinking this might impact that complication today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the estimated disruption cost ordering need to be changed? The disruption cost is independent of whether the node is compatible to be consolidated with another node.
What I am proposing is to bin the nodes together with nodes they are compatible. This part is the hand-wavey, and a bit arbitrary; Karpenter has to decide where the boundaries for whether candidates should be consolidated. If Karpenter picks bad boundaries (current situation) then there are two many or two few bins, and consolidation fails. This is also where I am most interested in input... what are the boundaries for each bin that is roughly compatible. Just CPU architecture? CPU architecture and node pool? Maybe the node's operating system should be a boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is more about compatibility in between pods, right? I can see what you're saying that right now, every node is in its own bucket (no pre-defined ordering). We can either create arbitrary buckets/groupings and use that as a heuristic for grouping nodes in our consolidation ordering. The closest generalization for partitioning would be if we could group all nodes from the same nodepool together, but I think even then you're still hitting the same multi-arch issue on your NodePool, so you actually probably want to make the grouping separate from the NodePool.
If we just make it on CPU architecture, you probably solve most of the issues here, but I need to think more about how this might work in practice.