KEP-4962: Standardizing the Representation of Cluster Network Topology #4965

dmitsh · 2024-11-15T17:29:49Z

One-line PR description:
Standardizing Cluster Network Topology Representation

Issue link: Standardizing the Representation of Cluster Network Topology #4962

Other comments:

k8s-ci-robot · 2024-11-15T17:29:57Z

Welcome @dmitsh!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-11-15T17:29:58Z

Hi @dmitsh. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

dmitsh · 2024-11-15T17:31:27Z

/cc @aojea

dmitsh · 2024-11-15T20:51:43Z

/cc @brianhammons @mwielgus @tardieu @mickeyboxell @arsenetar

k8s-ci-robot · 2024-11-15T20:51:46Z

@dmitsh: GitHub didn't allow me to request PR reviews from the following users: tardieu, arsenetar, brianhammons.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @brianhammons @mwielgus @tardieu @mickeyboxell @arsenetar

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

wojtek-t · 2024-11-18T14:00:43Z

keps/sig-network/4962-network-topology-standard/README.md

+
+### Network QoS Annotation
+Format: `network.qos.kubernetes.io/switches: <QoS>`
+- `<QoS>`: A JSON object where each key is a switch name (matching the network topology label) with a value containing:


So this object contains N items (of below structure), where N is the number of predefined topology units (accelerator, block, datacenter, zone), right?

What I want to ensure is that those needs to be changed/updated when the cluster grows/shrinks (i.e. that the don't define distance between nodes themselves). But given that for a given node its contents don't depend on other nodes (rather on placements in the physical network), this seems to be fine.

@wojtek-t , that's correct, each node will contain QoS metrics between the node and every reachable switch.

We simplified this proposal - removed custom labels and removed annotations.

aojea · 2024-11-18T15:50:12Z

/assign @johnbelamaric

for sig-architecture

aojea · 2024-11-18T16:04:47Z

keps/sig-network/4962-network-topology-standard/README.md

+- `<switch-name>`: Unique identifier for the switch
+
+### Network QoS Annotation
+Format: `network.qos.kubernetes.io/switches: <QoS>`


This is not really the switch, right, is the interface in the node that connects to the switch ... and we already have properties to define attributes on the network interfaces with DRA, see slice 14 https://docs.google.com/presentation/d/1Vdr7BhbYXeWjwmLjGmqnUkvJr_eOUdU0x-JxfXWxUT8/edit#slide=id.g2f750386db2_5_0 so I don't feel we need this additional annotation here

cc: @johnbelamaric @thockin

It is not a NIC on the node. These are QoS metrics from the node to every reachable switch. Also, the "switch" in this context could be a physical network device, or an aggregated entity defined by a CSP. For example, AWS returns 3 levels of switches per node, but the actual number of physical switches is unknown.

These are QoS metrics from the node to every reachable switch

how is the Node connected to the first switch :) ?

aojea · 2024-11-18T16:05:40Z

keps/sig-network/4962-network-topology-standard/README.md

+network.qos.kubernetes.io/switches: {
+   "nvl10": {
+      "latency": "2us",
+      "bandwidth": "100Gbps"
+   },
+   "sw11": {
+      "latency": "50us",
+      "bandwidth": "40Gbps"
+   },
+   "sw21": {
+      "latency": "500us",
+      "bandwidth": "20Gbps"
+   },
+  "sw31": {
+      "latency": "1ms",
+      "bandwidth": "10Gbps"
+   }


These are network interfaces on the node, I think we should better model them with DRA https://github.com/kubernetes/enhancements/pull/4965/files#r1846865095 , that also allows us to provide dynamic capabilities to the interfaces

Again, as mentioned in the earlier comment, these QoS numbers represent node-to-reachable-switch metrics. They are not per NIC.

does it mean that some switches may not be connected directly?
how then the latency and bandwidth are obtained to guarantee those values?

aojea · 2024-11-18T16:14:52Z

keps/sig-network/4962-network-topology-standard/README.md

+Format: `network.topology.kubernetes.io/<nw-switch-type>: <switch-name>`
+- `<nw-switch-type>`: Logical type of the network switch (can be one of the reserved names or a custom name)
+  - Reserved names: `accelerator`, `block`, `datacenter`, `zone`


this is the part we need to loop in SIG architecture, I briefly touched on this with @thockin and it seems it took some time to settle on the region/zone.

So my understanding is that we need to model a hierarchy, this KEP suggest to use nested structures: zone > datacenter > block > accelerator , but we should at least describe in alternative why weights are not better than this, it seems to me that weights are more easy to standardize different topologies, as you only focus on the distance provided between layers, and can support different architectures with multiple layers

This KEP is proposing to use reserved network types for typical network architectures, while allowing to extend network topology using custom network types.
We are providing means for weighted approach by specifying distance and/or bandwidth, latency or other metrics.
These are actual measurable physical characteristics of the network, and will be more accurate than specifying static weights.
Once again, we are providing QoS between a node and a switch, so the distance is the number of hops between the node and the switch. Same goes for bandwidth/latency.

aojea · 2024-11-18T16:17:20Z

keps/sig-network/4962-network-topology-standard/README.md

+This proposal is designed with extensibility in mind, enabling the use of custom network types. This ensures that the standard can adapt to future advancements in cluster networking without requiring significant overhauls.
+
+For custom network types, Network QoS Annotations are required, with distance being the minimum mandatory metric. Specifying latency and bandwidth is optional, but including them can offer a more detailed view of link performance, enabling more efficient scheduling decisions.


IIUIC the use of custom network types means that environment that use them may not be compatible with other environments, that practically removes all the benefits of standarization, another thing where I think weighted/distances model can help better

this was already addressed.

aojea · 2024-11-18T16:20:05Z

keps/sig-network/4962-network-topology-standard/README.md

+
+The same network topology depicted in Example 2 can be represented using custom network types.
+
+Let's use `tor` for top-of-rack switches, `area` for the second level of switches, and `center` for the third level.


After seen this example I'm definitevely not in favor of custom types as it is impossible for a generic tool to infer the distance between these custom types ... it will also create fragmentation and incompatibility as multiple tools can define the same name with different meaning ... the example also talks about levels, that reinforces my idea of weights, so something like network.topology.kubernetes.io/tier: 1

In this example network.topology.kubernetes.io/tor: sw13 will be

Node: network.topology.kubernetes.io/tier: 1

ResourceSlice:

apiVersion: resource.k8s.io/v1alpha3 kind: ResourceSlice … spec: devices: - basic: attributes: tier: int: 1 type: string: nvlink ip: string: 10.1.1.1/24 latency: string: "50us" bandwith: string: "100gbps"

When using custom network types, it is mandatory to define distance in the QoS annotation.
We explicitly expressed that in the KEP.

keithmattix · 2024-11-18T20:57:16Z

keps/sig-network/4962-network-topology-standard/README.md

+  - Gang-scheduling auto-scaler
+  - DRA scheduler plugin
+
+### Goals


Are there existing topology formats or consumers that we want Kubernetes to integrate with? If so, are these integrations goals or non-goals?

To the best of my knowledge, only EKS exposes their custom node labels for network layers.
The goal of this KEP is to create a standard way of describing switch network topology.
Ultimately, we want to use this standard in development of Kubernetes-native network-aware scheduler plugin for multi-node workloads. We put this task as a "no goal", as it would be an independent effort.

Ultimately, we want to use this standard in development of Kubernetes-native network-aware scheduler plugin for multi-node workloads.

I'm (still) not seeing why that development work requires a KEP. An out-of-tree, out-of-project custom scheduler could use labels with keys outside of Kubernetes namespaces.

Kubernetes-native network-aware scheduler plugin for multi-node workloads

Perhaps a reference implementation to show how the end to end works or even use it as part of the integration test.

bowei · 2024-11-21T17:28:22Z

keps/sig-network/4962-network-topology-standard/README.md

+
+## Summary
+
+This document proposes a standard for declaring switch network topology in Kubernetes clusters, representing the hierarchy of nodes, switches, and interconnects. In this context, a switch can refer to a physical network device or a collection of such devices with close proximity and functionality.


Kubernetes has generally shied away from describing physical attributes vs user intent.

One of the challenges that this proposal needs to address is how this can be future-proofed against changes in underlying technology.

bowei · 2024-11-27T03:05:22Z

keps/sig-network/4962-network-topology-standard/README.md

+
+Beyond CSPs, certain on-premises clusters support network topology discovery, though this capability depends on the features of the underlying switch network vendors.
+
+An open-source project, [Topograph](https://github.com/NVIDIA/topograph), has implemented these approaches and is successfully deployed in production environments.


It would be good to discuss what this proposal does differently than topograph?

Is it porting over the general concepts? What worked/did not?

Does it address the problems with the topograph project?

This proposal establishes a standard for node labels.
Topograph will assign these labels to cluster nodes.

Signed-off-by: Dmitry Shmulevich <[email protected]>

sftim · 2024-11-27T12:48:45Z

keps/sig-network/4962-network-topology-standard/README.md

+Having these labels available in Kubernetes clusters will help in designing cloud agnostic scheduling systems.
+The scheduler will prioritize switches according to the order outlined above, providing a standardized approach for network-aware scheduling across a range of configurations.
+
+### User Stories (Optional)


Suggested change

### User Stories (Optional)

### User Stories

sftim · 2024-11-27T12:49:37Z

keps/sig-network/4962-network-topology-standard/README.md

+The scheduler plugin reconstructs the cluster network topology by interpreting `network.topology.kubernetes.io/...` node labels.
+Using this topology information, it optimally binds pods to suitable nodes, reducing overall latency and improving performance.


This is the solution more than part of the story.

Signed-off-by: Dmitry Shmulevich <[email protected]>

bowei · 2024-12-03T07:20:34Z

It would be good for this proposal to talk about how a provider would validate if their use of the labels met the requirements for the layer types.

In other words -- is it possible to write a conformance test for this feature? The various terms may not line up exactly with what is surfaced across providers --or-- the latency/bandwidth tradeoffs differ depending on technology stack. As a contract for the consuming systems, do we have a really good understanding of what the "standard" behavior is?

Also, is this required for providers to populate? If some levels are missing, is this a problem?

How to future proof the definitions -- how would a CSP add/remove a layer to the hierarchy? What are the implications for systems that consume the hierarchy in this case.

Also -- there are region/zone labels already:

https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesioregion

It would be good to discuss the relationship between those and this proposal.

sftim · 2024-12-03T09:40:35Z

keps/sig-network/4962-network-topology-standard/kep.yaml

Why is this file empty?

dmitsh · 2024-12-05T18:30:18Z

It would be good for this proposal to talk about how a provider would validate if their use of the labels met the requirements for the layer types.

In other words -- is it possible to write a conformance test for this feature? The various terms may not line up exactly with what is surfaced across providers --or-- the latency/bandwidth tradeoffs differ depending on technology stack. As a contract for the consuming systems, do we have a really good understanding of what the "standard" behavior is?

Also, is this required for providers to populate? If some levels are missing, is this a problem?

How to future proof the definitions -- how would a CSP add/remove a layer to the hierarchy? What are the implications for systems that consume the hierarchy in this case.

Also -- there are region/zone labels already:

https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesioregion

It would be good to discuss the relationship between those and this proposal.

Is the intent to check the accuracy of node labeling by CSP?
I'm not sure if we can write a conformance test before this standard is adopted and implemented by CSPs.
But after that, yes, we can use CSP topology APIs to reconstruct cluster network topology and validate correctness of the labels.
To do so, we would need to provision a cluster in each participating CSP and run the validation there.

Signed-off-by: Dmitry Shmulevich <[email protected]>

dmitsh · 2024-12-05T19:07:16Z

| Also -- there are region/zone labels already
To avoid confusion, we replaced zone label.

Here are the updated proposed labels:

network.topology.kubernetes.io/accelerator: Network interconnect for direct accelerator communication (e.g., Multi-node NVLink interconnect between NVIDIA GPUs)
network.topology.kubernetes.io/block: Rack-level switches connecting hosts in one or more racks as a block.
network.topology.kubernetes.io/spine: Spine-level switches connecting multiple blocks inside a datacenter.
network.topology.kubernetes.io/datacenter: Zonal switches connecting multiple datacenters inside an availability zone.

sftim · 2024-12-05T22:50:26Z

I (still) hope we don't change the built-in scheduler to formally pay attention to these labels.
I (still) prefer the idea that we build in hooks so that an external scheduler can do what's needed

Keeping the core of Kubernetes simple enough to test and maintain feels valuable, and building in a specific kind of complexity into a stable API (Pod) could end up with us having to keep a compatibility promise we hadn't planned to make.

dmitsh · 2024-12-06T00:27:21Z

I (still) hope we don't change the built-in scheduler to formally pay attention to these labels.

I (still) prefer the idea that we build in hooks so that an external scheduler can do what's needed

Keeping the core of Kubernetes simple enough to test and maintain feels valuable, and building in a specific kind of complexity into a stable API (Pod) could end up with us having to keep a compatibility promise we hadn't planned to make.

We are not suggesting to change built-in scheduler. Adding extra node labels doesn't change any existing functionality.
Our goal is to develop scheduler plugins, which also doesn't affect default scheduler.

sftim · 2024-12-06T09:43:54Z

We are not suggesting to change built-in scheduler. Adding extra node labels doesn't change any existing functionality.
Our goal is to develop scheduler plugins, which also doesn't affect default scheduler.

OK, so: we need to make a clear case for why this needs a KEP. Imagine someone wants to make a new component acceleratorsched.example and base that one kube-scheduler but with custom plugins. Great, they can. But the extra labels could be in the domain label.acceleratorsched.example

Right now I don't even see that approach listed as an alternative @dmitsh, and that concerns me. There's an alternative about relying on cloud providers, but I'm suggesting something vendor-neutral. Vendor-neutral doesn't have to imply "part of Kubernetes".

jackfrancis · 2024-12-10T17:56:05Z

keps/sig-network/4962-network-topology-standard/README.md

+- Development of Kubernetes-native scheduler plugins for network-topology-aware scheduling, such as:
+  - Topology-aware gang-scheduling plugin.
+  - Gang-scheduling auto-scaler.
+  - Device resource allocation (DRA) scheduler plugin.


nit: DRA = Dynamic Resource Allocation

Thanks! Fixed.

Signed-off-by: Dmitry Shmulevich <[email protected]>

ritazh · 2024-12-10T22:34:50Z

keps/sig-network/4962-network-topology-standard/README.md

+                    operator: In
+                    values:
+                      - training
+              topologyKey: network.topology.kubernetes.io/accelerator


Does this proposal address this user story? If so, can you add an example:

As a data scientist, I want to ensure the pods are always placed within the same nvlink domain first before choosing another nvlink domain and I do not want to modify the job specification when migrating the job across different Kubernetes environments.

In this current example, network.topology.kubernetes.io/accelerator value would be unique for different nvlink domain right? If so, users would need to modify the job spec moving across different Kubernetes environments.

Right, the network.topology.kubernetes.io/accelerator label value is unique per nvlink domain. But the user only provides the label name, i.e. network.topology.kubernetes.io/accelerator.
So, the scheduler will try to find a set of nodes with the same label value, hence placing the jobs in the same nvlink domain.
There is no need to modify the job spec when migrating environments.
The User Story 1 describes this exact scenario.

NVM, I mixed up the labelSector used in the example with the topologyKey, which sounds like when the same topologyKey is used, it is assumed that the scheduler will ensure pods are placed on the nodes with the require resources AND the same topologyKey value.

keps/sig-network/4962-network-topology-standard/README.md

jackfrancis · 2024-12-10T22:54:23Z

keps/sig-network/4962-network-topology-standard/README.md

+
+Additionally, I do not want to modify the job specification when migrating the job across different Kubernetes environments.
+
+To achieve this, I can leverage the pod affinity feature of the default Kubernetes scheduler with topology keys:


I think the using the beta DRA APIs would be a better way to represent this. By definition, preferredDuringSchedulingIgnoredDuringExecution is not a hard constraint, and would enable the scheduler to ignore these requirements if they were unavailable on the cluster and instead schedule on a node that has a non-working or degraded topology.

Additionally, this would suggest that the scope of the KEP could expand beyond just node labels, and instead define a set of well-known standard namespaces that could be applied in any suitable API data data model (e.g., node label, resourceslice [a DRA-provided resource], deviceclass [another DRA-provided resource], etc).

wdyt?

You can always use requiredDuringSchedulingIgnoredDuringExecution for hard constraint.
I used preferredDuringSchedulingIgnoredDuringExecution as an example.

jackfrancis · 2024-12-11T20:13:29Z

keps/sig-network/4962-network-topology-standard/README.md

+
+## Summary
+
+This document proposes a standard for declaring network topology in Kubernetes clusters,


Is "proposes a standard set of terminologies for declaring network topology"... more clear?

thanks, updated!

jackfrancis · 2024-12-11T20:18:40Z

keps/sig-network/4962-network-topology-standard/README.md

+scheduling pods in close network proximity is becoming essential.
+Examples of such workloads include AI/ML training jobs or sets of interdependent, data-intensive services.
+
+However, Kubernetes currently lacks a standard method to describe cluster network topology, which is a key area for improvement.


I'm not sure this KEP is proposing a standard method, but rather it is proposing a set of common terms. A method would be "label your nodes using label XYZ" or "implement a custom operator". We don't seem to be advocating for one method or the other, but rather we want to define a standard terminology so that any chosen method can render the underlying network topology using a language that we all understand. We anticipate that different environments may have good reason to prefer one method over an other (device-plugin + node labels vs DRA vs custom scheduler vs custom operator).

Do other folks agree? I think "Kubernetes currently lacks a standard set of terms to describe network topology" would be a clearer expression of whatever we're doing, if so.

changed the wording, thank you!

Signed-off-by: Dmitry Shmulevich <[email protected]>

k8s-ci-robot · 2024-12-11T22:42:24Z

@dmitsh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-enhancements-test	`d5407f7`	link	true	`/test pull-enhancements-test`
pull-enhancements-verify	`d5407f7`	link	true	`/test pull-enhancements-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

danwinship · 2024-12-12T17:57:10Z

keps/sig-network/4962-network-topology-standard/README.md

+
+### Goals
+
+- Introduce a standard way of representing network topology in Kubernetes clusters


I know this section was bigger before, but now it's too small. There has to be some goal of using that topology representation for something. Otherwise why would we want it?

danwinship · 2024-12-12T17:59:30Z

keps/sig-network/4962-network-topology-standard/README.md

+We propose to use the following four network hierarchy layer types:
+1. `accelerator`: Network interconnect for direct accelerator communication (e.g., Multi-node NVLink interconnect between NVIDIA GPUs)
+2. `block`: Rack-level switches connecting hosts in one or more racks as a block.
+3. `spine`: Spine-level switches connecting multiple blocks inside a datacenter.
+4. `datacenter`: Zonal switches connecting multiple datacenters inside an availability zone.


This seems like a weird breakdown to me. I feel like every time people have talked about adding more topology labels beyond "region" and "zone", someone brings up "rack", but you just skip right over that layer...

In VM-based environments, you would want a layer for "hypervisor", but again, that doesn't correspond to anything here.

And you talk about people wanting to use these generically, but what if a given network is missing one of these layers? Eg, if my datacenter has no "block"s, but someone deploys a job that tries to use affinity by block, then what happens?

For that matter, if my application wants 50ms of latency between pods, then what level should I use for the pod affinity? Presumably that would be spine-level latency in some datacenters, block-level latency in others, and unachievable in yet others. With the existing "zone" and "region" levels, the assumption is that there's an order of magnitude of difference between the levels, so you can think about them in an agnostic way, but that doesn't really seem to work for these lower topology levels.

So, it seems to me like either:

The labels have cluster-specific-meanings (in which case they don't need to be standardized), or

The scheduling should be based not on topology per se, but on the actual latency/distance metrics that are computed from the topology (in which case the latency/distance information needs to be provided in a standardized way but the topology itself does not).

danwinship · 2024-12-12T18:00:26Z

keps/sig-network/4962-network-topology-standard/README.md

+3. `spine`: Spine-level switches connecting multiple blocks inside a datacenter.
+4. `datacenter`: Zonal switches connecting multiple datacenters inside an availability zone.
+
+These types will accommodate the majority of common network hierarchies across different CSP and on-prem environments.


Anyone writing this KEP two years ago would not have included "accelerator" as a level... so I worry about how future-proof this approach is.

As technology advances, things change ...

danwinship · 2024-12-12T18:01:39Z

keps/sig-network/4962-network-topology-standard/README.md

+
+As a data scientist running a data-intensive large-scale AI training job, I want to optimize the runtime
+by binding pods to nodes that are in close network proximity.
+This ensures better performance for my distributed workloads.


as others have noted, DRA is already addressing this use case

I'm. not sure DRA can be used to represent topology of the entire cluster

aojea · 2024-12-19T10:26:50Z

Discussed during SIG Architecture Meeting on 12 Dec 2024 https://docs.google.com/document/d/1BlmHq5uPyBUDlppYqAAzslVbAO8hilgjqZUTaNXUhKM/edit?tab=t.0

The overall agreement seems we're going to hold off on implementing this, mainly because the ecosystem is still a bit wild west and it's not clear what the "standard" will look like in the future, new patterns or technologies may succeed or become outdated quickly, and that will cause problems in the future because enforcing a specific way to represent network topology could limit that flexibility and might even break compatibility with other projects.

/hold

sftim · 2024-12-19T11:22:26Z

@dmitsh if you want help building an out-of-project but vendor-neutral way to capture this - and maybe to provide hints to the scheduler - then I think we can find some help for you.

The underlying ambition is important, even if right now this solution doesn't feel right.

KEP-4962: Standardizing Cluster Network Topology Representation

c9f2407

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Nov 15, 2024

k8s-ci-robot requested review from MikeZappa87 and thockin November 15, 2024 17:29

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Nov 15, 2024

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 15, 2024

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 15, 2024

dmitsh mentioned this pull request Nov 15, 2024

Standardizing the Representation of Cluster Network Topology #4962

Open

4 tasks

k8s-ci-robot requested a review from aojea November 15, 2024 17:31

update example for reserved network types

913c832

dmitsh force-pushed the ds-topology branch from ba21b04 to 913c832 Compare November 15, 2024 20:43

k8s-ci-robot requested a review from mickeyboxell November 15, 2024 20:51

k8s-ci-robot requested a review from mwielgus November 15, 2024 20:51

added 'Alternatives' section

fa36461

wojtek-t reviewed Nov 18, 2024

View reviewed changes

k8s-ci-robot assigned johnbelamaric Nov 18, 2024

thockin self-assigned this Nov 18, 2024

aojea reviewed Nov 18, 2024

View reviewed changes

keithmattix reviewed Nov 18, 2024

View reviewed changes

dmitsh force-pushed the ds-topology branch 3 times, most recently from 8d2ba06 to 6253878 Compare November 26, 2024 23:05

bowei reviewed Nov 27, 2024

View reviewed changes

simplified the proposal

623396b

Signed-off-by: Dmitry Shmulevich <[email protected]>

dmitsh force-pushed the ds-topology branch from 6253878 to 623396b Compare November 27, 2024 12:41

sftim reviewed Nov 27, 2024

View reviewed changes

addressed comments

3120f2e

Signed-off-by: Dmitry Shmulevich <[email protected]>

sftim reviewed Dec 3, 2024

View reviewed changes

keps/sig-network/4962-network-topology-standard/kep.yaml Outdated

Copy link

Contributor

sftim Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this file empty?

update network hierarchy layer types

eeddfa1

Signed-off-by: Dmitry Shmulevich <[email protected]>

jackfrancis reviewed Dec 10, 2024

View reviewed changes

fixed typo

5a6af31

Signed-off-by: Dmitry Shmulevich <[email protected]>

ritazh reviewed Dec 10, 2024

View reviewed changes

keps/sig-network/4962-network-topology-standard/README.md Show resolved Hide resolved

jackfrancis reviewed Dec 10, 2024

View reviewed changes

jackfrancis reviewed Dec 11, 2024

View reviewed changes

addressed comments

d5407f7

Signed-off-by: Dmitry Shmulevich <[email protected]>

danwinship reviewed Dec 12, 2024

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 19, 2024

		This proposal is designed with extensibility in mind, enabling the use of custom network types. This ensures that the standard can adapt to future advancements in cluster networking without requiring significant overhauls.

		For custom network types, Network QoS Annotations are required, with distance being the minimum mandatory metric. Specifying latency and bandwidth is optional, but including them can offer a more detailed view of link performance, enabling more efficient scheduling decisions.


		The same network topology depicted in Example 2 can be represented using custom network types.

		Let's use `tor` for top-of-rack switches, `area` for the second level of switches, and `center` for the third level.


		## Summary

		This document proposes a standard for declaring switch network topology in Kubernetes clusters, representing the hierarchy of nodes, switches, and interconnects. In this context, a switch can refer to a physical network device or a collection of such devices with close proximity and functionality.


		Beyond CSPs, certain on-premises clusters support network topology discovery, though this capability depends on the features of the underlying switch network vendors.

		An open-source project, [Topograph](https://github.com/NVIDIA/topograph), has implemented these approaches and is successfully deployed in production environments.

		The scheduler plugin reconstructs the cluster network topology by interpreting `network.topology.kubernetes.io/...` node labels.
		Using this topology information, it optimally binds pods to suitable nodes, reducing overall latency and improving performance.


		Additionally, I do not want to modify the job specification when migrating the job across different Kubernetes environments.

		To achieve this, I can leverage the pod affinity feature of the default Kubernetes scheduler with topology keys:


		### Goals

		- Introduce a standard way of representing network topology in Kubernetes clusters

KEP-4962: Standardizing the Representation of Cluster Network Topology #4965

Are you sure you want to change the base?

KEP-4962: Standardizing the Representation of Cluster Network Topology #4965

Conversation

dmitsh commented Nov 15, 2024

k8s-ci-robot commented Nov 15, 2024

k8s-ci-robot commented Nov 15, 2024

dmitsh commented Nov 15, 2024

dmitsh commented Nov 15, 2024

k8s-ci-robot commented Nov 15, 2024

Choose a reason for hiding this comment

dmitsh Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea commented Nov 18, 2024

aojea Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sftim Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bowei commented Dec 3, 2024

Choose a reason for hiding this comment

dmitsh commented Dec 5, 2024

dmitsh commented Dec 5, 2024

sftim commented Dec 5, 2024

dmitsh commented Dec 6, 2024

sftim commented Dec 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ritazh Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Dec 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea commented Dec 19, 2024

sftim commented Dec 19, 2024

dmitsh Nov 18, 2024 •

edited

Loading

aojea Nov 18, 2024 •

edited

Loading

aojea Nov 19, 2024 •

edited

Loading

aojea Nov 18, 2024 •

edited

Loading

sftim Nov 27, 2024 •

edited

Loading

ritazh Dec 10, 2024 •

edited

Loading