Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4962: Standardizing the Representation of Cluster Network Topology #4965

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

dmitsh
Copy link

@dmitsh dmitsh commented Nov 15, 2024

  • One-line PR description:
    Standardizing Cluster Network Topology Representation
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Nov 15, 2024
@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Nov 15, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @dmitsh!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 15, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @dmitsh. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 15, 2024
@dmitsh
Copy link
Author

dmitsh commented Nov 15, 2024

/cc @aojea

@k8s-ci-robot k8s-ci-robot requested a review from aojea November 15, 2024 17:31
@dmitsh
Copy link
Author

dmitsh commented Nov 15, 2024

@k8s-ci-robot
Copy link
Contributor

@dmitsh: GitHub didn't allow me to request PR reviews from the following users: tardieu, arsenetar, brianhammons.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @brianhammons @mwielgus @tardieu @mickeyboxell @arsenetar

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.


### Network QoS Annotation
Format: `network.qos.kubernetes.io/switches: <QoS>`
- `<QoS>`: A JSON object where each key is a switch name (matching the network topology label) with a value containing:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this object contains N items (of below structure), where N is the number of predefined topology units (accelerator, block, datacenter, zone), right?

What I want to ensure is that those needs to be changed/updated when the cluster grows/shrinks (i.e. that the don't define distance between nodes themselves). But given that for a given node its contents don't depend on other nodes (rather on placements in the physical network), this seems to be fine.

Copy link
Author

@dmitsh dmitsh Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t , that's correct, each node will contain QoS metrics between the node and every reachable switch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We simplified this proposal - removed custom labels and removed annotations.

@aojea
Copy link
Member

aojea commented Nov 18, 2024

/assign @johnbelamaric

for sig-architecture

- `<switch-name>`: Unique identifier for the switch

### Network QoS Annotation
Format: `network.qos.kubernetes.io/switches: <QoS>`
Copy link
Member

@aojea aojea Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really the switch, right, is the interface in the node that connects to the switch ... and we already have properties to define attributes on the network interfaces with DRA, see slice 14 https://docs.google.com/presentation/d/1Vdr7BhbYXeWjwmLjGmqnUkvJr_eOUdU0x-JxfXWxUT8/edit#slide=id.g2f750386db2_5_0 so I don't feel we need this additional annotation here

cc: @johnbelamaric @thockin

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a NIC on the node. These are QoS metrics from the node to every reachable switch. Also, the "switch" in this context could be a physical network device, or an aggregated entity defined by a CSP. For example, AWS returns 3 levels of switches per node, but the actual number of physical switches is unknown.

Copy link
Member

@aojea aojea Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are QoS metrics from the node to every reachable switch

how is the Node connected to the first switch :) ?

Comment on lines 270 to 286
network.qos.kubernetes.io/switches: {
"nvl10": {
"latency": "2us",
"bandwidth": "100Gbps"
},
"sw11": {
"latency": "50us",
"bandwidth": "40Gbps"
},
"sw21": {
"latency": "500us",
"bandwidth": "20Gbps"
},
"sw31": {
"latency": "1ms",
"bandwidth": "10Gbps"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are network interfaces on the node, I think we should better model them with DRA https://github.com/kubernetes/enhancements/pull/4965/files#r1846865095 , that also allows us to provide dynamic capabilities to the interfaces

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, as mentioned in the earlier comment, these QoS numbers represent node-to-reachable-switch metrics. They are not per NIC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean that some switches may not be connected directly?
how then the latency and bandwidth are obtained to guarantee those values?

Comment on lines 188 to 190
Format: `network.topology.kubernetes.io/<nw-switch-type>: <switch-name>`
- `<nw-switch-type>`: Logical type of the network switch (can be one of the reserved names or a custom name)
- Reserved names: `accelerator`, `block`, `datacenter`, `zone`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the part we need to loop in SIG architecture, I briefly touched on this with @thockin and it seems it took some time to settle on the region/zone.

So my understanding is that we need to model a hierarchy, this KEP suggest to use nested structures: zone > datacenter > block > accelerator , but we should at least describe in alternative why weights are not better than this, it seems to me that weights are more easy to standardize different topologies, as you only focus on the distance provided between layers, and can support different architectures with multiple layers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This KEP is proposing to use reserved network types for typical network architectures, while allowing to extend network topology using custom network types.
We are providing means for weighted approach by specifying distance and/or bandwidth, latency or other metrics.
These are actual measurable physical characteristics of the network, and will be more accurate than specifying static weights.
Once again, we are providing QoS between a node and a switch, so the distance is the number of hops between the node and the switch. Same goes for bandwidth/latency.

Comment on lines 327 to 329
This proposal is designed with extensibility in mind, enabling the use of custom network types. This ensures that the standard can adapt to future advancements in cluster networking without requiring significant overhauls.

For custom network types, Network QoS Annotations are required, with distance being the minimum mandatory metric. Specifying latency and bandwidth is optional, but including them can offer a more detailed view of link performance, enabling more efficient scheduling decisions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUIC the use of custom network types means that environment that use them may not be compatible with other environments, that practically removes all the benefits of standarization, another thing where I think weighted/distances model can help better

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was already addressed.


The same network topology depicted in Example 2 can be represented using custom network types.

Let's use `tor` for top-of-rack switches, `area` for the second level of switches, and `center` for the third level.
Copy link
Member

@aojea aojea Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After seen this example I'm definitevely not in favor of custom types as it is impossible for a generic tool to infer the distance between these custom types ... it will also create fragmentation and incompatibility as multiple tools can define the same name with different meaning ... the example also talks about levels, that reinforces my idea of weights, so something like network.topology.kubernetes.io/tier: 1

In this example network.topology.kubernetes.io/tor: sw13 will be

  • Node: network.topology.kubernetes.io/tier: 1
  • ResourceSlice:
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSlice
…
spec:
    devices:
    - basic:
        attributes:
          tier:
            int: 1
          type:
            string: nvlink
          ip:
            string: 10.1.1.1/24
          latency:
            string: "50us"
          bandwith:
            string: "100gbps"
          

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using custom network types, it is mandatory to define distance in the QoS annotation.
We explicitly expressed that in the KEP.

- Gang-scheduling auto-scaler
- DRA scheduler plugin

### Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there existing topology formats or consumers that we want Kubernetes to integrate with? If so, are these integrations goals or non-goals?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of my knowledge, only EKS exposes their custom node labels for network layers.
The goal of this KEP is to create a standard way of describing switch network topology.
Ultimately, we want to use this standard in development of Kubernetes-native network-aware scheduler plugin for multi-node workloads. We put this task as a "no goal", as it would be an independent effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, we want to use this standard in development of Kubernetes-native network-aware scheduler plugin for multi-node workloads.

I'm (still) not seeing why that development work requires a KEP. An out-of-tree, out-of-project custom scheduler could use labels with keys outside of Kubernetes namespaces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes-native network-aware scheduler plugin for multi-node workloads

Perhaps a reference implementation to show how the end to end works or even use it as part of the integration test.

@dmitsh dmitsh force-pushed the ds-topology branch 3 times, most recently from 8d2ba06 to 6253878 Compare November 26, 2024 23:05

## Summary

This document proposes a standard for declaring switch network topology in Kubernetes clusters, representing the hierarchy of nodes, switches, and interconnects. In this context, a switch can refer to a physical network device or a collection of such devices with close proximity and functionality.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes has generally shied away from describing physical attributes vs user intent.

One of the challenges that this proposal needs to address is how this can be future-proofed against changes in underlying technology.


Beyond CSPs, certain on-premises clusters support network topology discovery, though this capability depends on the features of the underlying switch network vendors.

An open-source project, [Topograph](https://github.com/NVIDIA/topograph), has implemented these approaches and is successfully deployed in production environments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to discuss what this proposal does differently than topograph?

  • Is it porting over the general concepts? What worked/did not?
  • Does it address the problems with the topograph project?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal establishes a standard for node labels.
Topograph will assign these labels to cluster nodes.

Signed-off-by: Dmitry Shmulevich <[email protected]>
Having these labels available in Kubernetes clusters will help in designing cloud agnostic scheduling systems.
The scheduler will prioritize switches according to the order outlined above, providing a standardized approach for network-aware scheduling across a range of configurations.

### User Stories (Optional)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### User Stories (Optional)
### User Stories

Comment on lines 273 to 274
The scheduler plugin reconstructs the cluster network topology by interpreting `network.topology.kubernetes.io/...` node labels.
Using this topology information, it optimally binds pods to suitable nodes, reducing overall latency and improving performance.
Copy link
Contributor

@sftim sftim Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the solution more than part of the story.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Signed-off-by: Dmitry Shmulevich <[email protected]>
@bowei
Copy link
Member

bowei commented Dec 3, 2024

It would be good for this proposal to talk about how a provider would validate if their use of the labels met the requirements for the layer types.

In other words -- is it possible to write a conformance test for this feature? The various terms may not line up exactly with what is surfaced across providers --or-- the latency/bandwidth tradeoffs differ depending on technology stack. As a contract for the consuming systems, do we have a really good understanding of what the "standard" behavior is?

Also, is this required for providers to populate? If some levels are missing, is this a problem?

How to future proof the definitions -- how would a CSP add/remove a layer to the hierarchy? What are the implications for systems that consume the hierarchy in this case.

Also -- there are region/zone labels already:

https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesioregion

It would be good to discuss the relationship between those and this proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this file empty?

@dmitsh
Copy link
Author

dmitsh commented Dec 5, 2024

It would be good for this proposal to talk about how a provider would validate if their use of the labels met the requirements for the layer types.

In other words -- is it possible to write a conformance test for this feature? The various terms may not line up exactly with what is surfaced across providers --or-- the latency/bandwidth tradeoffs differ depending on technology stack. As a contract for the consuming systems, do we have a really good understanding of what the "standard" behavior is?

Also, is this required for providers to populate? If some levels are missing, is this a problem?

How to future proof the definitions -- how would a CSP add/remove a layer to the hierarchy? What are the implications for systems that consume the hierarchy in this case.

Also -- there are region/zone labels already:

https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesioregion

It would be good to discuss the relationship between those and this proposal.

Is the intent to check the accuracy of node labeling by CSP?
I'm not sure if we can write a conformance test before this standard is adopted and implemented by CSPs.
But after that, yes, we can use CSP topology APIs to reconstruct cluster network topology and validate correctness of the labels.
To do so, we would need to provision a cluster in each participating CSP and run the validation there.

Signed-off-by: Dmitry Shmulevich <[email protected]>
@dmitsh
Copy link
Author

dmitsh commented Dec 5, 2024

| Also -- there are region/zone labels already
To avoid confusion, we replaced zone label.

Here are the updated proposed labels:

  1. network.topology.kubernetes.io/accelerator: Network interconnect for direct accelerator communication (e.g., Multi-node NVLink interconnect between NVIDIA GPUs)
  2. network.topology.kubernetes.io/block: Rack-level switches connecting hosts in one or more racks as a block.
  3. network.topology.kubernetes.io/spine: Spine-level switches connecting multiple blocks inside a datacenter.
  4. network.topology.kubernetes.io/datacenter: Zonal switches connecting multiple datacenters inside an availability zone.

@sftim
Copy link
Contributor

sftim commented Dec 5, 2024

  • I (still) hope we don't change the built-in scheduler to formally pay attention to these labels.
  • I (still) prefer the idea that we build in hooks so that an external scheduler can do what's needed

Keeping the core of Kubernetes simple enough to test and maintain feels valuable, and building in a specific kind of complexity into a stable API (Pod) could end up with us having to keep a compatibility promise we hadn't planned to make.

@dmitsh
Copy link
Author

dmitsh commented Dec 6, 2024

  • I (still) hope we don't change the built-in scheduler to formally pay attention to these labels.
  • I (still) prefer the idea that we build in hooks so that an external scheduler can do what's needed

Keeping the core of Kubernetes simple enough to test and maintain feels valuable, and building in a specific kind of complexity into a stable API (Pod) could end up with us having to keep a compatibility promise we hadn't planned to make.

We are not suggesting to change built-in scheduler. Adding extra node labels doesn't change any existing functionality.
Our goal is to develop scheduler plugins, which also doesn't affect default scheduler.

@sftim
Copy link
Contributor

sftim commented Dec 6, 2024

We are not suggesting to change built-in scheduler. Adding extra node labels doesn't change any existing functionality.
Our goal is to develop scheduler plugins, which also doesn't affect default scheduler.

OK, so: we need to make a clear case for why this needs a KEP. Imagine someone wants to make a new component acceleratorsched.example and base that one kube-scheduler but with custom plugins. Great, they can. But the extra labels could be in the domain label.acceleratorsched.example

Right now I don't even see that approach listed as an alternative @dmitsh, and that concerns me. There's an alternative about relying on cloud providers, but I'm suggesting something vendor-neutral. Vendor-neutral doesn't have to imply "part of Kubernetes".

- Development of Kubernetes-native scheduler plugins for network-topology-aware scheduling, such as:
- Topology-aware gang-scheduling plugin.
- Gang-scheduling auto-scaler.
- Device resource allocation (DRA) scheduler plugin.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: DRA = Dynamic Resource Allocation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed.

Signed-off-by: Dmitry Shmulevich <[email protected]>
operator: In
values:
- training
topologyKey: network.topology.kubernetes.io/accelerator
Copy link
Member

@ritazh ritazh Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this proposal address this user story? If so, can you add an example:

As a data scientist, I want to ensure the pods are always placed within the same nvlink domain first before choosing another nvlink domain and I do not want to modify the job specification when migrating the job across different Kubernetes environments.

In this current example, network.topology.kubernetes.io/accelerator value would be unique for different nvlink domain right? If so, users would need to modify the job spec moving across different Kubernetes environments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the network.topology.kubernetes.io/accelerator label value is unique per nvlink domain. But the user only provides the label name, i.e. network.topology.kubernetes.io/accelerator.
So, the scheduler will try to find a set of nodes with the same label value, hence placing the jobs in the same nvlink domain.
There is no need to modify the job spec when migrating environments.
The User Story 1 describes this exact scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM, I mixed up the labelSector used in the example with the topologyKey, which sounds like when the same topologyKey is used, it is assumed that the scheduler will ensure pods are placed on the nodes with the require resources AND the same topologyKey value.


Additionally, I do not want to modify the job specification when migrating the job across different Kubernetes environments.

To achieve this, I can leverage the pod affinity feature of the default Kubernetes scheduler with topology keys:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the using the beta DRA APIs would be a better way to represent this. By definition, preferredDuringSchedulingIgnoredDuringExecution is not a hard constraint, and would enable the scheduler to ignore these requirements if they were unavailable on the cluster and instead schedule on a node that has a non-working or degraded topology.

Additionally, this would suggest that the scope of the KEP could expand beyond just node labels, and instead define a set of well-known standard namespaces that could be applied in any suitable API data data model (e.g., node label, resourceslice [a DRA-provided resource], deviceclass [another DRA-provided resource], etc).

wdyt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can always use requiredDuringSchedulingIgnoredDuringExecution for hard constraint.
I used preferredDuringSchedulingIgnoredDuringExecution as an example.


## Summary

This document proposes a standard for declaring network topology in Kubernetes clusters,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "proposes a standard set of terminologies for declaring network topology"... more clear?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated!

scheduling pods in close network proximity is becoming essential.
Examples of such workloads include AI/ML training jobs or sets of interdependent, data-intensive services.

However, Kubernetes currently lacks a standard method to describe cluster network topology, which is a key area for improvement.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this KEP is proposing a standard method, but rather it is proposing a set of common terms. A method would be "label your nodes using label XYZ" or "implement a custom operator". We don't seem to be advocating for one method or the other, but rather we want to define a standard terminology so that any chosen method can render the underlying network topology using a language that we all understand. We anticipate that different environments may have good reason to prefer one method over an other (device-plugin + node labels vs DRA vs custom scheduler vs custom operator).

Do other folks agree? I think "Kubernetes currently lacks a standard set of terms to describe network topology" would be a clearer expression of whatever we're doing, if so.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the wording, thank you!

Signed-off-by: Dmitry Shmulevich <[email protected]>
@k8s-ci-robot
Copy link
Contributor

@dmitsh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-test d5407f7 link true /test pull-enhancements-test
pull-enhancements-verify d5407f7 link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.


### Goals

- Introduce a standard way of representing network topology in Kubernetes clusters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this section was bigger before, but now it's too small. There has to be some goal of using that topology representation for something. Otherwise why would we want it?

Comment on lines +214 to +218
We propose to use the following four network hierarchy layer types:
1. `accelerator`: Network interconnect for direct accelerator communication (e.g., Multi-node NVLink interconnect between NVIDIA GPUs)
2. `block`: Rack-level switches connecting hosts in one or more racks as a block.
3. `spine`: Spine-level switches connecting multiple blocks inside a datacenter.
4. `datacenter`: Zonal switches connecting multiple datacenters inside an availability zone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a weird breakdown to me. I feel like every time people have talked about adding more topology labels beyond "region" and "zone", someone brings up "rack", but you just skip right over that layer...

In VM-based environments, you would want a layer for "hypervisor", but again, that doesn't correspond to anything here.

And you talk about people wanting to use these generically, but what if a given network is missing one of these layers? Eg, if my datacenter has no "block"s, but someone deploys a job that tries to use affinity by block, then what happens?

For that matter, if my application wants 50ms of latency between pods, then what level should I use for the pod affinity? Presumably that would be spine-level latency in some datacenters, block-level latency in others, and unachievable in yet others. With the existing "zone" and "region" levels, the assumption is that there's an order of magnitude of difference between the levels, so you can think about them in an agnostic way, but that doesn't really seem to work for these lower topology levels.

So, it seems to me like either:

  1. The labels have cluster-specific-meanings (in which case they don't need to be standardized), or
  2. The scheduling should be based not on topology per se, but on the actual latency/distance metrics that are computed from the topology (in which case the latency/distance information needs to be provided in a standardized way but the topology itself does not).

3. `spine`: Spine-level switches connecting multiple blocks inside a datacenter.
4. `datacenter`: Zonal switches connecting multiple datacenters inside an availability zone.

These types will accommodate the majority of common network hierarchies across different CSP and on-prem environments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyone writing this KEP two years ago would not have included "accelerator" as a level... so I worry about how future-proof this approach is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As technology advances, things change ...


As a data scientist running a data-intensive large-scale AI training job, I want to optimize the runtime
by binding pods to nodes that are in close network proximity.
This ensures better performance for my distributed workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as others have noted, DRA is already addressing this use case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm. not sure DRA can be used to represent topology of the entire cluster

@aojea
Copy link
Member

aojea commented Dec 19, 2024

Discussed during SIG Architecture Meeting on 12 Dec 2024 https://docs.google.com/document/d/1BlmHq5uPyBUDlppYqAAzslVbAO8hilgjqZUTaNXUhKM/edit?tab=t.0

The overall agreement seems we're going to hold off on implementing this, mainly because the ecosystem is still a bit wild west and it's not clear what the "standard" will look like in the future, new patterns or technologies may succeed or become outdated quickly, and that will cause problems in the future because enforcing a specific way to represent network topology could limit that flexibility and might even break compatibility with other projects.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 19, 2024
@sftim
Copy link
Contributor

sftim commented Dec 19, 2024

@dmitsh if you want help building an out-of-project but vendor-neutral way to capture this - and maybe to provide hints to the scheduler - then I think we can find some help for you.

The underlying ambition is important, even if right now this solution doesn't feel right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.