-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4962: Standardizing the Representation of Cluster Network Topology #4965
base: master
Are you sure you want to change the base?
Conversation
Welcome @dmitsh! |
Hi @dmitsh. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cc @aojea |
@dmitsh: GitHub didn't allow me to request PR reviews from the following users: tardieu, arsenetar, brianhammons. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
||
### Network QoS Annotation | ||
Format: `network.qos.kubernetes.io/switches: <QoS>` | ||
- `<QoS>`: A JSON object where each key is a switch name (matching the network topology label) with a value containing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this object contains N items (of below structure), where N is the number of predefined topology units (accelerator, block, datacenter, zone), right?
What I want to ensure is that those needs to be changed/updated when the cluster grows/shrinks (i.e. that the don't define distance between nodes themselves). But given that for a given node its contents don't depend on other nodes (rather on placements in the physical network), this seems to be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t , that's correct, each node will contain QoS metrics between the node and every reachable switch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We simplified this proposal - removed custom labels and removed annotations.
/assign @johnbelamaric for sig-architecture |
- `<switch-name>`: Unique identifier for the switch | ||
|
||
### Network QoS Annotation | ||
Format: `network.qos.kubernetes.io/switches: <QoS>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really the switch, right, is the interface in the node that connects to the switch ... and we already have properties to define attributes on the network interfaces with DRA, see slice 14 https://docs.google.com/presentation/d/1Vdr7BhbYXeWjwmLjGmqnUkvJr_eOUdU0x-JxfXWxUT8/edit#slide=id.g2f750386db2_5_0 so I don't feel we need this additional annotation here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a NIC on the node. These are QoS metrics from the node to every reachable switch. Also, the "switch" in this context could be a physical network device, or an aggregated entity defined by a CSP. For example, AWS returns 3 levels of switches per node, but the actual number of physical switches is unknown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are QoS metrics from the node to every reachable switch
how is the Node connected to the first switch :) ?
network.qos.kubernetes.io/switches: { | ||
"nvl10": { | ||
"latency": "2us", | ||
"bandwidth": "100Gbps" | ||
}, | ||
"sw11": { | ||
"latency": "50us", | ||
"bandwidth": "40Gbps" | ||
}, | ||
"sw21": { | ||
"latency": "500us", | ||
"bandwidth": "20Gbps" | ||
}, | ||
"sw31": { | ||
"latency": "1ms", | ||
"bandwidth": "10Gbps" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are network interfaces on the node, I think we should better model them with DRA https://github.com/kubernetes/enhancements/pull/4965/files#r1846865095 , that also allows us to provide dynamic capabilities to the interfaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, as mentioned in the earlier comment, these QoS numbers represent node-to-reachable-switch metrics. They are not per NIC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean that some switches may not be connected directly?
how then the latency and bandwidth are obtained to guarantee those values?
Format: `network.topology.kubernetes.io/<nw-switch-type>: <switch-name>` | ||
- `<nw-switch-type>`: Logical type of the network switch (can be one of the reserved names or a custom name) | ||
- Reserved names: `accelerator`, `block`, `datacenter`, `zone` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the part we need to loop in SIG architecture, I briefly touched on this with @thockin and it seems it took some time to settle on the region/zone.
So my understanding is that we need to model a hierarchy, this KEP suggest to use nested structures: zone > datacenter > block > accelerator , but we should at least describe in alternative why weights are not better than this, it seems to me that weights are more easy to standardize different topologies, as you only focus on the distance provided between layers, and can support different architectures with multiple layers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This KEP is proposing to use reserved network types for typical network architectures, while allowing to extend network topology using custom network types.
We are providing means for weighted approach by specifying distance and/or bandwidth, latency or other metrics.
These are actual measurable physical characteristics of the network, and will be more accurate than specifying static weights.
Once again, we are providing QoS between a node and a switch, so the distance is the number of hops between the node and the switch. Same goes for bandwidth/latency.
This proposal is designed with extensibility in mind, enabling the use of custom network types. This ensures that the standard can adapt to future advancements in cluster networking without requiring significant overhauls. | ||
|
||
For custom network types, Network QoS Annotations are required, with distance being the minimum mandatory metric. Specifying latency and bandwidth is optional, but including them can offer a more detailed view of link performance, enabling more efficient scheduling decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUIC the use of custom network types means that environment that use them may not be compatible with other environments, that practically removes all the benefits of standarization, another thing where I think weighted/distances model can help better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was already addressed.
|
||
The same network topology depicted in Example 2 can be represented using custom network types. | ||
|
||
Let's use `tor` for top-of-rack switches, `area` for the second level of switches, and `center` for the third level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After seen this example I'm definitevely not in favor of custom types as it is impossible for a generic tool to infer the distance between these custom types ... it will also create fragmentation and incompatibility as multiple tools can define the same name with different meaning ... the example also talks about levels, that reinforces my idea of weights, so something like network.topology.kubernetes.io/tier: 1
In this example network.topology.kubernetes.io/tor: sw13
will be
- Node:
network.topology.kubernetes.io/tier: 1
- ResourceSlice:
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSlice
…
spec:
devices:
- basic:
attributes:
tier:
int: 1
type:
string: nvlink
ip:
string: 10.1.1.1/24
latency:
string: "50us"
bandwith:
string: "100gbps"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using custom network types, it is mandatory to define distance in the QoS annotation.
We explicitly expressed that in the KEP.
- Gang-scheduling auto-scaler | ||
- DRA scheduler plugin | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there existing topology formats or consumers that we want Kubernetes to integrate with? If so, are these integrations goals or non-goals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the best of my knowledge, only EKS exposes their custom node labels for network layers.
The goal of this KEP is to create a standard way of describing switch network topology.
Ultimately, we want to use this standard in development of Kubernetes-native network-aware scheduler plugin for multi-node workloads. We put this task as a "no goal", as it would be an independent effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately, we want to use this standard in development of Kubernetes-native network-aware scheduler plugin for multi-node workloads.
I'm (still) not seeing why that development work requires a KEP. An out-of-tree, out-of-project custom scheduler could use labels with keys outside of Kubernetes namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes-native network-aware scheduler plugin for multi-node workloads
Perhaps a reference implementation to show how the end to end works or even use it as part of the integration test.
8d2ba06
to
6253878
Compare
|
||
## Summary | ||
|
||
This document proposes a standard for declaring switch network topology in Kubernetes clusters, representing the hierarchy of nodes, switches, and interconnects. In this context, a switch can refer to a physical network device or a collection of such devices with close proximity and functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes has generally shied away from describing physical attributes vs user intent.
One of the challenges that this proposal needs to address is how this can be future-proofed against changes in underlying technology.
|
||
Beyond CSPs, certain on-premises clusters support network topology discovery, though this capability depends on the features of the underlying switch network vendors. | ||
|
||
An open-source project, [Topograph](https://github.com/NVIDIA/topograph), has implemented these approaches and is successfully deployed in production environments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to discuss what this proposal does differently than topograph?
- Is it porting over the general concepts? What worked/did not?
- Does it address the problems with the topograph project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal establishes a standard for node labels.
Topograph will assign these labels to cluster nodes.
Signed-off-by: Dmitry Shmulevich <[email protected]>
Having these labels available in Kubernetes clusters will help in designing cloud agnostic scheduling systems. | ||
The scheduler will prioritize switches according to the order outlined above, providing a standardized approach for network-aware scheduling across a range of configurations. | ||
|
||
### User Stories (Optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### User Stories (Optional) | |
### User Stories |
The scheduler plugin reconstructs the cluster network topology by interpreting `network.topology.kubernetes.io/...` node labels. | ||
Using this topology information, it optimally binds pods to suitable nodes, reducing overall latency and improving performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the solution more than part of the story.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
Signed-off-by: Dmitry Shmulevich <[email protected]>
It would be good for this proposal to talk about how a provider would validate if their use of the labels met the requirements for the layer types. In other words -- is it possible to write a conformance test for this feature? The various terms may not line up exactly with what is surfaced across providers --or-- the latency/bandwidth tradeoffs differ depending on technology stack. As a contract for the consuming systems, do we have a really good understanding of what the "standard" behavior is? Also, is this required for providers to populate? If some levels are missing, is this a problem? How to future proof the definitions -- how would a CSP add/remove a layer to the hierarchy? What are the implications for systems that consume the hierarchy in this case. Also -- there are region/zone labels already: https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesioregion It would be good to discuss the relationship between those and this proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this file empty?
Is the intent to check the accuracy of node labeling by CSP? |
Signed-off-by: Dmitry Shmulevich <[email protected]>
| Also -- there are region/zone labels already Here are the updated proposed labels:
|
Keeping the core of Kubernetes simple enough to test and maintain feels valuable, and building in a specific kind of complexity into a stable API (Pod) could end up with us having to keep a compatibility promise we hadn't planned to make. |
We are not suggesting to change built-in scheduler. Adding extra node labels doesn't change any existing functionality. |
OK, so: we need to make a clear case for why this needs a KEP. Imagine someone wants to make a new component acceleratorsched.example and base that one kube-scheduler but with custom plugins. Great, they can. But the extra labels could be in the domain Right now I don't even see that approach listed as an alternative @dmitsh, and that concerns me. There's an alternative about relying on cloud providers, but I'm suggesting something vendor-neutral. Vendor-neutral doesn't have to imply "part of Kubernetes". |
- Development of Kubernetes-native scheduler plugins for network-topology-aware scheduling, such as: | ||
- Topology-aware gang-scheduling plugin. | ||
- Gang-scheduling auto-scaler. | ||
- Device resource allocation (DRA) scheduler plugin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: DRA = Dynamic Resource Allocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed.
Signed-off-by: Dmitry Shmulevich <[email protected]>
operator: In | ||
values: | ||
- training | ||
topologyKey: network.topology.kubernetes.io/accelerator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this proposal address this user story? If so, can you add an example:
As a data scientist, I want to ensure the pods are always placed within the same nvlink domain first before choosing another nvlink domain and I do not want to modify the job specification when migrating the job across different Kubernetes environments.
In this current example, network.topology.kubernetes.io/accelerator value would be unique for different nvlink domain right? If so, users would need to modify the job spec moving across different Kubernetes environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the network.topology.kubernetes.io/accelerator
label value is unique per nvlink domain. But the user only provides the label name, i.e. network.topology.kubernetes.io/accelerator
.
So, the scheduler will try to find a set of nodes with the same label value, hence placing the jobs in the same nvlink domain.
There is no need to modify the job spec when migrating environments.
The User Story 1 describes this exact scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM, I mixed up the labelSector used in the example with the topologyKey, which sounds like when the same topologyKey is used, it is assumed that the scheduler will ensure pods are placed on the nodes with the require resources AND the same topologyKey value.
|
||
Additionally, I do not want to modify the job specification when migrating the job across different Kubernetes environments. | ||
|
||
To achieve this, I can leverage the pod affinity feature of the default Kubernetes scheduler with topology keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the using the beta DRA APIs would be a better way to represent this. By definition, preferredDuringSchedulingIgnoredDuringExecution
is not a hard constraint, and would enable the scheduler to ignore these requirements if they were unavailable on the cluster and instead schedule on a node that has a non-working or degraded topology.
Additionally, this would suggest that the scope of the KEP could expand beyond just node labels, and instead define a set of well-known standard namespaces that could be applied in any suitable API data data model (e.g., node label, resourceslice [a DRA-provided resource], deviceclass [another DRA-provided resource], etc).
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can always use requiredDuringSchedulingIgnoredDuringExecution
for hard constraint.
I used preferredDuringSchedulingIgnoredDuringExecution
as an example.
|
||
## Summary | ||
|
||
This document proposes a standard for declaring network topology in Kubernetes clusters, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "proposes a standard set of terminologies for declaring network topology"... more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, updated!
scheduling pods in close network proximity is becoming essential. | ||
Examples of such workloads include AI/ML training jobs or sets of interdependent, data-intensive services. | ||
|
||
However, Kubernetes currently lacks a standard method to describe cluster network topology, which is a key area for improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this KEP is proposing a standard method, but rather it is proposing a set of common terms. A method would be "label your nodes using label XYZ" or "implement a custom operator". We don't seem to be advocating for one method or the other, but rather we want to define a standard terminology so that any chosen method can render the underlying network topology using a language that we all understand. We anticipate that different environments may have good reason to prefer one method over an other (device-plugin + node labels vs DRA vs custom scheduler vs custom operator).
Do other folks agree? I think "Kubernetes currently lacks a standard set of terms to describe network topology" would be a clearer expression of whatever we're doing, if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the wording, thank you!
Signed-off-by: Dmitry Shmulevich <[email protected]>
@dmitsh: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
||
### Goals | ||
|
||
- Introduce a standard way of representing network topology in Kubernetes clusters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this section was bigger before, but now it's too small. There has to be some goal of using that topology representation for something. Otherwise why would we want it?
We propose to use the following four network hierarchy layer types: | ||
1. `accelerator`: Network interconnect for direct accelerator communication (e.g., Multi-node NVLink interconnect between NVIDIA GPUs) | ||
2. `block`: Rack-level switches connecting hosts in one or more racks as a block. | ||
3. `spine`: Spine-level switches connecting multiple blocks inside a datacenter. | ||
4. `datacenter`: Zonal switches connecting multiple datacenters inside an availability zone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a weird breakdown to me. I feel like every time people have talked about adding more topology labels beyond "region" and "zone", someone brings up "rack", but you just skip right over that layer...
In VM-based environments, you would want a layer for "hypervisor", but again, that doesn't correspond to anything here.
And you talk about people wanting to use these generically, but what if a given network is missing one of these layers? Eg, if my datacenter has no "block
"s, but someone deploys a job that tries to use affinity by block, then what happens?
For that matter, if my application wants 50ms of latency between pods, then what level should I use for the pod affinity? Presumably that would be spine
-level latency in some datacenters, block
-level latency in others, and unachievable in yet others. With the existing "zone" and "region" levels, the assumption is that there's an order of magnitude of difference between the levels, so you can think about them in an agnostic way, but that doesn't really seem to work for these lower topology levels.
So, it seems to me like either:
- The labels have cluster-specific-meanings (in which case they don't need to be standardized), or
- The scheduling should be based not on topology per se, but on the actual latency/distance metrics that are computed from the topology (in which case the latency/distance information needs to be provided in a standardized way but the topology itself does not).
3. `spine`: Spine-level switches connecting multiple blocks inside a datacenter. | ||
4. `datacenter`: Zonal switches connecting multiple datacenters inside an availability zone. | ||
|
||
These types will accommodate the majority of common network hierarchies across different CSP and on-prem environments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyone writing this KEP two years ago would not have included "accelerator
" as a level... so I worry about how future-proof this approach is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As technology advances, things change ...
|
||
As a data scientist running a data-intensive large-scale AI training job, I want to optimize the runtime | ||
by binding pods to nodes that are in close network proximity. | ||
This ensures better performance for my distributed workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as others have noted, DRA is already addressing this use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm. not sure DRA can be used to represent topology of the entire cluster
Discussed during SIG Architecture Meeting on 12 Dec 2024 https://docs.google.com/document/d/1BlmHq5uPyBUDlppYqAAzslVbAO8hilgjqZUTaNXUhKM/edit?tab=t.0 The overall agreement seems we're going to hold off on implementing this, mainly because the ecosystem is still a bit wild west and it's not clear what the "standard" will look like in the future, new patterns or technologies may succeed or become outdated quickly, and that will cause problems in the future because enforcing a specific way to represent network topology could limit that flexibility and might even break compatibility with other projects. /hold |
@dmitsh if you want help building an out-of-project but vendor-neutral way to capture this - and maybe to provide hints to the scheduler - then I think we can find some help for you. The underlying ambition is important, even if right now this solution doesn't feel right. |
Standardizing Cluster Network Topology Representation