DRA: Support scoring for devices and nodes in scheduling #4970

johnbelamaric · 2024-11-20T17:52:39Z

Enhancement Description

DRA supports the concept of "under specifying" a request. This gives the scheduler more flexibility to satisfy a request, increasing the likelihood of success in environments with scarce resources. For example, rather than asking for a specific model of a device, the user can ask for any one of a set of models, as long as it has some minimum specified amount of memory.

Currently, DRA uses a "first fit" algorithm during scheduling. This can lead to inefficient choices. Building on the example above, if the user asks for a device with at least 4GB of memory, if the first device found has 80GB of memory, it will be chosen, even if there is another option with exactly 4GB. If scoring were available, the scheduler could evaluate the "waste" associated with each possible option, and make a more efficient choice.

Scoring is also critical in other situations where there is optionality in how to satisfy a request. For instance, in #4816 the user is allowed to provide a list of preferences. While that works to choose the "best" option on a given node, in reality most nodes have homogeneous selections of devices. So, in a cluster with nodes that meet the first option in the list, and nodes that meet the second option in the list, either one could be chosen. If DRA could score the nodes based on whether they satisfy the first or second option, then preference could be given to the first option across nodes, not just within a node.

The last important place for scoring is to help with fragmentation and bin packing. This is especially relevant with the implementation of #4815 pending. The ability to dynamical choose partitions of a device can lead to fragmentation; scoring can alleviate that to some extent. It is not a complete solution to that problem, but can help.

One-line enhancement description (can be used as a release note): Scoring support for improved Node and device selection during Pod scheduling
Kubernetes Enhancement Proposal: TBD
Discussion Link: TBD (live discussions at KubeCon)
Primary contact (assignee): @johnbelamaric
Responsible SIGs: sig-scheduling
Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.33
- Beta release target (x.y): 1.34
- Stable release target (x.y): 1.35
Alpha
- KEP (k/enhancements) update PR(s):
- Code (k/k) update PR(s):
- Docs (k/website) update PR(s):

/assign @johnbelamaric
/cc @pohly @klueska @mortent @alculquicondor @wojtek-t
/sig scheduling

The text was updated successfully, but these errors were encountered:

johnbelamaric · 2024-11-20T18:01:41Z

/wg device-management

alculquicondor · 2024-11-20T18:24:26Z

cc @dom4ha

johnbelamaric · 2024-11-20T19:28:17Z

cc @asm582 @kannon92

Abhishek, I don't seem to have Olivier's github handle, please cc him as well.

johnbelamaric · 2024-11-20T19:29:54Z

cc @catblade

kannon92 · 2024-11-20T19:32:40Z

cc @tardieu

wojtek-t · 2024-11-21T11:28:30Z

Thanks for filing that John. Let me just share my high-level thoughts about it.

I obviously sympathize with the goal and usecase, but I would like us to spend time thinking about implementation.
I think the primary option for many k8s folks would be to somehow incorporate it into existing scheduler model, namely into the concept of priorities in the scheduler.
Saying that I'm not a fan of this mode would be an euphenism - the scoring model of scheduler has in my opinion two primary problems (and a number of smaller ones):

it's completely unintuitive for the and user and it's pretty much impossible to reason about
Given that the score is an affinity combination of individual scoring functions, as a user (despite the fact that I worked on that codebase in the past), having a set of feasible nodes, I can't predict which node would finally be chosen, because different functions are pulling in different directions.
it was historically causing (and still somewhat is) a bunch of pain from performance/scalability perspective
With introduction of even more complex rules, we would only add to this problem.

I didn't yet put enough thoughts into it, but I would like to explore a very different model of a cleaner decision tree. In that model, we stack-rank the preferences and on a given level we immediately reject all options with a score that is different than the highest for this particular scoring function.
This immediately solve the problem of intuitiveness and reasoning about the choice, but would also help with performance as in majority of cases we don't need to compute all scores for all nodes.

Obviously we have a compatibility problem, but given we're talking about new feature maybe it's the right moment to seriously consider it now. And once we prove it, maybe we will be able to somehow get rid the old model in the future...

johnbelamaric · 2024-11-21T21:55:08Z

Thanks @wojtek-t, good points. A few thoughts.

I would characterize your first point as "predictability" - as a user I have an idea of what's going to happen. We can try to make that a goal. One factor to consider is who needs to influence the decision, and how that affects predictability. I can think of (at least) four different roles that probably want some say in the scoring:

The end user - they may want to prioritize for performance or cost, for example.
The cluster admin - they will want to control how their infrastructure is used. Scoring can be one mechanism to influence that (for example, should we have a way for them to prioritize one device type over another, for cost/contractual reasons?).
The cloud provider / node architect - they may have some reasons to affect placement to ensure the best bin packing on their infrastructure, or to ensure the best performance
The device vendor - at the very least, they would be the source of performance data (device X is faster than device Y), if that becomes a scoring dimension

During the KEP process we will have to figure which of these we want to address, what the scope of control they should have, how to manage weights given to each, what APIs they need to influence decisions, and how to make all that comprehensible and predictable. It should be fun :)

On your second point, were those issues there for solely local decisions (scoring nodes for a single Pod), or did they arise out of things like pod affinity? One thing I would put explicitly out-of-scope for this KEP is any kind of cross-pod affinity or gang scheduling. I think we need a different solution than our existing kube-scheduler for that. As a corollary, I also don't want to address optimizing for future workloads that may be coming along soon (a la Kueue). For example, if I know that the next workload will need a full GPU, and I can fit the current workload on a MIG on an in-use GPU, but with slightly less performance, ideally I would probably choose the MIG. But in this KEP, I don't want to try to do that. I think it's already hard enough when we only make decisions for the single Pod we're looking at. That said, I think that whatever we do should be useful to solutions like Kueue or a multi-Pod scheduler to leverage as needed.

johnbelamaric · 2024-11-21T22:55:38Z

cc @jingxu97

dom4ha · 2024-11-21T23:54:00Z

If we want to support both the end user preference and cluster admin policy (to maintain high utilization), we probably cannot just stack rank, as we may end up either listening to the end user preference unconditionally, or follow admin policy ignoring the user preference (at least in the simplest implementation).

Tuning weights might be indeed challenging, but on the other hand reaching higher predictability (sacrificing flexibility) could be achieved by picking more radical weights without changing the whole model. For instance all kinds of DRA factors could have much stronger weights than all other non-DRA factors, without stack ranking all existing plugin types. I bet that scheduler uses flat weights because it's hard to definitely point out which factors are more and which are less important, so unpredictability at some level is probably inevitable.

Another small comment is that Autoscaler skips scoring, so it may give suboptimal predictions compared to the real scheduler future placements, but I'm not sure how much Autoscaler accuracy is a concern at this point.

wojtek-t · 2024-11-22T07:53:19Z

Thanks for your thoughts, so let me share my further thoughts (some of which I already had before, some are triggered by above comments).

I would characterize your first point as "predictability"

Yes - it's exactly that. And I fully agree that there are multiple personas that would potentially like to affect that. In my mental model I had only 3 (workload owner, cluster admin and provider), but indeed we may need to think if provider shouldn't be split. My thinking about how those 3 affect the decisions are:

provider - exposes what "scoring function" are actually available for cluster admins and workload users
cluster-admin - defines the default order on those functions and potentially some additional constraints (e.g. function X can't be used at all, function A has to be more important than B, ...)
workload owner - can somehow affect the default order of scoring functions within the constraints provided by the admin

I'm happy to hear other thoughts on it, but this is how I was thinking about it.

On your second point, were those issues there for solely local decisions (scoring nodes for a single Pod), or did they arise out of things like pod affinity?

It's a bit of both:

for non-local decisions (like pod affinity) these are obviously much harder - and I'm fine with making these our of scope for now
for local decisions, if we have only a single node that is strictly better than any other wrt the most important scoring function, then we also potentially lose a bunch of resources computing all O(10) scores for every node;

As a corollary, I also don't want to address optimizing for future workloads that may be coming along soon (a la Kueue)

I have a bunch of thoughts about that too, but I'm fine with making at out-of-scope. If we build a reasonable building block here, it should be enough. And given we want to focus purely on "intra-node" aspect, I think we're fine here.

Tuning weights might be indeed challenging, but on the other hand reaching higher predictability (sacrificing flexibility) could be achieved by picking more radical weights without changing the whole model.

Technically that's true, but still see my performance point above - we may unnecessary use order of magnitude more resources in that model. If we want predictability, let's not hack around to achieve it, but rather let's do that properly.

Another small comment is that Autoscaler skips scoring, so it may give suboptimal predictions compared to the real scheduler future placements, but I'm not sure how much Autoscaler accuracy is a concern at this point.

CA is a bit different here, because it doesn't look only on a single node. It has an internal concept of scoring, but it's a bit different. I would like to achieve better unification here, but for the sake of making progress, I would also put it out of scope of now.
@x13n - for your thoughts if you disagree

x13n · 2024-11-22T14:20:52Z

I don't disagree, but will share my thoughts nevertheless :)

CA indeed has its own scoring mechanism. In fact, it has more than one. The most complex one was indeed score based and good luck reasoning about this one:

	priceSubScore := (totalNodePrice + stabilizationPrice) / (totalPodPrice + stabilizationPrice)
	// How well the node matches generic cluster needs
	nodeUnfitness := p.nodeUnfitness(preferredNode, nodeInfo.Node())

	// TODO: normalize node count against preferred node.
	supressedUnfitness := (nodeUnfitness-1.0)*(1.0-math.Tanh(float64(option.NodeCount-1)/15.0)) + 1.0

(https://github.com/kubernetes/autoscaler/blob/5458e1c208d87e988eeef59523b166e2a9b2d622/cluster-autoscaler/expander/price/price.go#L141-L146)

More recently we've introduced a way of composing different expanders (i.e. CA scoring functions) in a way that each one can filter out some options that are preferred from its perspective, eventually leading to a single option being chosen. This is very similar to the decision tree proposal above and I agree it is much simpler to reason about (not only as a user, but also as CA maintainer & operator).

I think there might be some opportunity to bring the two scoring mechanisms closer, especially if we build some kind of multi-pod scheduler. One might argue CA is already a multi-pod scheduler that can also create nodes. If there's no capacity in the cluster, the nodes created by CA are essentially the only option for pods to be scheduled, so scheduler scoring doesn't matter much in an autoscaled cluster.

johnbelamaric · 2025-01-29T17:53:19Z

@44past4 this is relevant to our discussion on your concerns about fragmentation potential with #4874 and with multi-node use cases. It's not looking like we will have capacity to do much if anything with this in 1.33. Therefore, I think we should make sure DRA provides the hooks such that higher level components like Kueue and/or CA can be prescriptive about the where to locate the DRA resources. I think the current CEL-based attribute selectors enable this already. This allows those components to handle those issues, while not requiring those components for the base functionality.

I raised in SIG scheduling last week that I am concerned we are trying to layer too many things on top of the current pod-by-pod scheduling framework. There are a few different efforts that are using this framework because it is what exists, not because it's necessarily the right fit for their scheduling algorithms. It would be helpful to be able to experiment with alternative schedulers without necessarily being confined to the existing framework. The case of having a more prescriptive scheduling plan from Kueue or CA is one of those ideas.

Ideally, multiple independent schedulers could operate on the same set of nodes, and we would have a control plane resource that can reserve resources on a node to do that (see for example this discussion). However, that's pretty far off. In the short term, @Huang-Wei said it should be pretty easy to allow individual nodes to be designated as being for particular schedulers, such that kube-scheduler does not touch them at all. This would allow completely alternate schedulers (not using the scheduling framework at all) to co-exist with kube-scheduler without us having to figure out a Reservation resource yet. It has the downside in that those alternate schedulers may also need to handle basic things like the administrative DaemonSets that tend to be on every node. But I think it's a small, reasonable step to make progress while we figure out the long term vision.

@dom4ha would you or someone from your team be able to put together a sig-scheduling KEP for 1.33 that allows this "bypass kube-scheduler" on a node-by-node basis? We would need something on the podspec to indicate if a workload should use the alternate scheduler. Maybe the current schedulerName is sufficient, with some changes to the scheduler config APIs? That would be ideal. But I don't know enough about the details of how the scheduler config works to be sure.

cc @erictune @sftim @MikeSpreitzer @thockin

johnbelamaric · 2025-01-29T18:10:33Z

cc @bsalamat

x13n · 2025-01-31T08:38:14Z

I like the idea of taking an incremental step forwards by managing subsets of nodes exclusively by a dedicated scheduler. I wonder if this is something that should be indicated on a node object then? If we have schedulerName(s) on the node object, it will be clear if the binding is being done by the correct one.

johnbelamaric · 2025-01-31T20:05:29Z

I like the idea of taking an incremental step forwards by managing subsets of nodes exclusively by a dedicated scheduler. I wonder if this is something that should be indicated on a node object then? If we have schedulerName(s) on the node object, it will be clear if the binding is being done by the correct one.

I think @Huang-Wei had something specific in mind as to how to accomplish this.

dom4ha · 2025-02-04T02:00:46Z

Dominik Marciński would you or someone from your team be able to put together a sig-scheduling KEP for 1.33 that allows this "bypass kube-scheduler" on a node-by-node basis?

Sure, let me think about it. I was also thinking about using persistent (shared) ResourceClaims for resource reservation, but reserving whole nodes sounds like some initial simplification.

I guess that preventing scheduler from placing pods in such reservation is relatively easy thing to do, but such nodes reservation itself needs to be allocated somehow, which is also a form of scheduling. Do you know who would specify which nodes to reserve?

johnbelamaric · 2025-02-04T03:21:38Z

Do you know who would specify which nodes to reserve?

I think just humans or an out-of-band controller would mark the node.

x13n · 2025-02-04T14:50:53Z

Controllers or humans doing this post-node creation may lead to race conditions. Could this be specified at kubelet level, so the node object is created with this information already available? Or is the intent to make it sub-node eventually and the kubelet path won't be flexible enough?

sftim · 2025-02-04T15:00:33Z

Controllers or humans doing this post-node creation may lead to race conditions. Could this be specified at kubelet level, so the node object is created with this information already available? Or is the intent to make it sub-node eventually and the kubelet path won't be flexible enough?

We have an existing mechanism that feels relevant: we can taint a node during startup, and remove the taint once the Node is up to date.

Maybe we want a way to make a taint be tolerated by default, or to have an effect we didn't previously support, but taints feel right here.

This kind of discussion belongs in a pull request more than an issue (we don't use GitHub Discussions for KEPs).

dom4ha · 2025-02-04T23:39:56Z

Assuming what we want to achieve is simply an ability to reserve whole nodes, we could start treating them as resources and allocate them using DRA driver. Allocation would just put a taint that was defined in the resource claim. Pods which want to schedule within such nodes reservation would need to just define toleration.

Sorry for describing the idea here. I can elaborate and start the discussion elsewhere. Shall I create a short doc first?

johnbelamaric · 2025-02-04T23:43:15Z

Shall I create a short doc first?

Yes, please.

dom4ha · 2025-02-05T14:06:26Z

Yes, please.

Here's the doc: https://docs.google.com/document/d/1LvA9_H4vyNbp173DUm3p9_fpsLYjjpujyRPO3TAOy7g/edit?tab=t.0

alculquicondor · 2025-02-05T14:08:53Z

what would happen to Daemonset pods?

dom4ha · 2025-02-05T14:25:34Z

what would happen to Daemonset pods?

Last section should cover that

asm582 · 2025-02-05T16:24:08Z

Yes, please.

Here's the doc: https://docs.google.com/document/d/1LvA9_H4vyNbp173DUm3p9_fpsLYjjpujyRPO3TAOy7g/edit?tab=t.0

Thank you for creating the doc, I read the document; I assume we have no support for quota management in this model.

dom4ha · 2025-02-05T18:33:10Z

Thank you for creating the doc, I read the document; I assume we have no support for quota management in this model.

No, there's no quota management in this proposal.

bg-furiosa · 2025-03-18T04:03:53Z

cc @johnbelamaric

Has the discussion stalled? I was wondering if there’s any additional discussion planned.
I believe scoring is necessary when selecting multiple devices within a single node because, in practice, various PCIe topologies exist.

Currently, it’s hard to guarantee optimal device allocation within a node without exposing hardware attributes like CpuDomain, NUMABitMask, or SwitchBDF as matchAttributes in the ResourceClaim. However, rather than directly exposing these hardware details in the DRA API, I think we should introduce scoring to hide these complexities from users.

I’m still thinking about specific implementation details of the scoring mechanism. However, evaluating every possible combination, as hardware vendors (like NVIDIA) did in the Device Plugin Model, would be impractical.

Therefore, I’m considering an approach where the DRA Driver advertises preferred device combinations through the DRA API, and the DRA Scheduler Plugin checks these advertised preference combinations first before falling back to a First-Fit strategy.

I will share a more detailed document outlining various PCIe topologies and the related challenges soon.

Please let me know if there is anything I may have missed or should further look into regarding this topic.

bg-furiosa · 2025-03-30T23:31:16Z

I’d like to share a document describing the resource alignment challenges I’ve encountered while developing the DRA driver, especially regarding PCIe topology-specific constraints and device selection.

https://docs.google.com/document/d/11linIUv4qHoe88Go-CbqncngxzRl7VhyWoVRjmBVFlU/edit?tab=t.0#heading=h.gh4sgxyir965

Any feedback would be appreciated!

KobayashiD27 · 2025-04-07T08:08:00Z

Is the selection of which device to use within a single node, when multiple devices are available, within the scope of this discussion?

pohly · 2025-04-07T09:39:03Z

Yes, it is.

The allocator tries to find devices for each node that is under consideration by the scheduler. For each node, different solutions may be possible, and then one node might have a better solution than another. Both is based on scoring.

KobayashiD27 · 2025-04-08T02:36:28Z

In that case, I think there are two points of discussion: improving the allocator to select the optimal device within a single node, and adding a scoring plugin to select the optimal node from the chosen multiple nodes.

It seems the discussion on the scoring plugin is somewhat stalled, but there hasn't been much discussion about the allocator.

What is the likelihood of these discussions being restarted? If the functionality can be realized, I believe it could resolve the method of preferentially selecting node-local devices discussed in the BindingConditions KEP.

pohly · 2025-04-08T05:48:43Z

The allocator has the same problem as choosing a node: driver authors, admins, users all may have a different opinion about what's "the best" device.

We can pick this up again, but someone needs to drive it.

KobayashiD27 · 2025-04-08T06:32:14Z

My default assumption has always been to prefer node-local devices, but I hadn't really considered that there might be compelling reasons to choose fabric devices. It sounds like a configurable method would be essential to handle those varying preferences.

k8s-ci-robot assigned johnbelamaric Nov 20, 2024

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Nov 20, 2024

github-project-automation bot added this to SIG Scheduling Nov 20, 2024

github-project-automation bot moved this to Needs Triage in SIG Scheduling Nov 20, 2024

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Nov 20, 2024

github-project-automation bot added this to SIG Node: Dynamic Resource Allocation Nov 20, 2024

github-project-automation bot moved this to 🆕 New in SIG Node: Dynamic Resource Allocation Nov 20, 2024

johnbelamaric moved this from 🆕 New to 📋 Backlog in SIG Node: Dynamic Resource Allocation Dec 10, 2024

pohly mentioned this issue Dec 18, 2024

DRA: structured parameters: network-attached resource kubernetes/kubernetes#124042

Open

everpeace mentioned this issue Jan 21, 2025

KEP-5027 + 5055: DRA: admin-controlled device attributes + taints #5034

Merged

mortent mentioned this issue Jan 30, 2025

KEP-4816 DRA Prioritized List design update #5065

Merged

KobayashiD27 mentioned this issue Mar 18, 2025

Implement DRA Device Binding Conditions (KEP-5007) kubernetes/kubernetes#130160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRA: Support scoring for devices and nodes in scheduling #4970

DRA: Support scoring for devices and nodes in scheduling #4970

johnbelamaric commented Nov 20, 2024

johnbelamaric commented Nov 20, 2024

alculquicondor commented Nov 20, 2024

johnbelamaric commented Nov 20, 2024

johnbelamaric commented Nov 20, 2024

kannon92 commented Nov 20, 2024

wojtek-t commented Nov 21, 2024

johnbelamaric commented Nov 21, 2024

johnbelamaric commented Nov 21, 2024

dom4ha commented Nov 21, 2024

wojtek-t commented Nov 22, 2024

x13n commented Nov 22, 2024

johnbelamaric commented Jan 29, 2025

johnbelamaric commented Jan 29, 2025

x13n commented Jan 31, 2025

johnbelamaric commented Jan 31, 2025

dom4ha commented Feb 4, 2025

johnbelamaric commented Feb 4, 2025

x13n commented Feb 4, 2025

sftim commented Feb 4, 2025

dom4ha commented Feb 4, 2025 •

edited

Loading

johnbelamaric commented Feb 4, 2025

dom4ha commented Feb 5, 2025

alculquicondor commented Feb 5, 2025

dom4ha commented Feb 5, 2025

asm582 commented Feb 5, 2025

dom4ha commented Feb 5, 2025

bg-furiosa commented Mar 18, 2025

bg-furiosa commented Mar 30, 2025

KobayashiD27 commented Apr 7, 2025

pohly commented Apr 7, 2025

KobayashiD27 commented Apr 8, 2025

pohly commented Apr 8, 2025

KobayashiD27 commented Apr 8, 2025

DRA: Support scoring for devices and nodes in scheduling #4970

DRA: Support scoring for devices and nodes in scheduling #4970

Comments

johnbelamaric commented Nov 20, 2024

Enhancement Description

johnbelamaric commented Nov 20, 2024

alculquicondor commented Nov 20, 2024

johnbelamaric commented Nov 20, 2024

johnbelamaric commented Nov 20, 2024

kannon92 commented Nov 20, 2024

wojtek-t commented Nov 21, 2024

johnbelamaric commented Nov 21, 2024

johnbelamaric commented Nov 21, 2024

dom4ha commented Nov 21, 2024

wojtek-t commented Nov 22, 2024

x13n commented Nov 22, 2024

johnbelamaric commented Jan 29, 2025

johnbelamaric commented Jan 29, 2025

x13n commented Jan 31, 2025

johnbelamaric commented Jan 31, 2025

dom4ha commented Feb 4, 2025

johnbelamaric commented Feb 4, 2025

x13n commented Feb 4, 2025

sftim commented Feb 4, 2025

dom4ha commented Feb 4, 2025 • edited Loading

johnbelamaric commented Feb 4, 2025

dom4ha commented Feb 5, 2025

alculquicondor commented Feb 5, 2025

dom4ha commented Feb 5, 2025

asm582 commented Feb 5, 2025

dom4ha commented Feb 5, 2025

bg-furiosa commented Mar 18, 2025

bg-furiosa commented Mar 30, 2025

KobayashiD27 commented Apr 7, 2025

pohly commented Apr 7, 2025

KobayashiD27 commented Apr 8, 2025

pohly commented Apr 8, 2025

KobayashiD27 commented Apr 8, 2025

dom4ha commented Feb 4, 2025 •

edited

Loading