-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA: Support scoring for devices and nodes in scheduling #4970
Comments
/wg device-management |
cc @dom4ha |
cc @catblade |
cc @tardieu |
Thanks for filing that John. Let me just share my high-level thoughts about it. I obviously sympathize with the goal and usecase, but I would like us to spend time thinking about implementation.
I didn't yet put enough thoughts into it, but I would like to explore a very different model of a cleaner decision tree. In that model, we stack-rank the preferences and on a given level we immediately reject all options with a score that is different than the highest for this particular scoring function. Obviously we have a compatibility problem, but given we're talking about new feature maybe it's the right moment to seriously consider it now. And once we prove it, maybe we will be able to somehow get rid the old model in the future... |
Thanks @wojtek-t, good points. A few thoughts. I would characterize your first point as "predictability" - as a user I have an idea of what's going to happen. We can try to make that a goal. One factor to consider is who needs to influence the decision, and how that affects predictability. I can think of (at least) four different roles that probably want some say in the scoring:
During the KEP process we will have to figure which of these we want to address, what the scope of control they should have, how to manage weights given to each, what APIs they need to influence decisions, and how to make all that comprehensible and predictable. It should be fun :) On your second point, were those issues there for solely local decisions (scoring nodes for a single Pod), or did they arise out of things like pod affinity? One thing I would put explicitly out-of-scope for this KEP is any kind of cross-pod affinity or gang scheduling. I think we need a different solution than our existing kube-scheduler for that. As a corollary, I also don't want to address optimizing for future workloads that may be coming along soon (a la Kueue). For example, if I know that the next workload will need a full GPU, and I can fit the current workload on a MIG on an in-use GPU, but with slightly less performance, ideally I would probably choose the MIG. But in this KEP, I don't want to try to do that. I think it's already hard enough when we only make decisions for the single Pod we're looking at. That said, I think that whatever we do should be useful to solutions like Kueue or a multi-Pod scheduler to leverage as needed. |
cc @jingxu97 |
If we want to support both the end user preference and cluster admin policy (to maintain high utilization), we probably cannot just stack rank, as we may end up either listening to the end user preference unconditionally, or follow admin policy ignoring the user preference (at least in the simplest implementation). Tuning weights might be indeed challenging, but on the other hand reaching higher predictability (sacrificing flexibility) could be achieved by picking more radical weights without changing the whole model. For instance all kinds of DRA factors could have much stronger weights than all other non-DRA factors, without stack ranking all existing plugin types. I bet that scheduler uses flat weights because it's hard to definitely point out which factors are more and which are less important, so unpredictability at some level is probably inevitable. Another small comment is that Autoscaler skips scoring, so it may give suboptimal predictions compared to the real scheduler future placements, but I'm not sure how much Autoscaler accuracy is a concern at this point. |
Thanks for your thoughts, so let me share my further thoughts (some of which I already had before, some are triggered by above comments).
Yes - it's exactly that. And I fully agree that there are multiple personas that would potentially like to affect that. In my mental model I had only 3 (workload owner, cluster admin and provider), but indeed we may need to think if provider shouldn't be split. My thinking about how those 3 affect the decisions are:
I'm happy to hear other thoughts on it, but this is how I was thinking about it.
It's a bit of both:
I have a bunch of thoughts about that too, but I'm fine with making at out-of-scope. If we build a reasonable building block here, it should be enough. And given we want to focus purely on "intra-node" aspect, I think we're fine here.
Technically that's true, but still see my performance point above - we may unnecessary use order of magnitude more resources in that model. If we want predictability, let's not hack around to achieve it, but rather let's do that properly.
CA is a bit different here, because it doesn't look only on a single node. It has an internal concept of scoring, but it's a bit different. I would like to achieve better unification here, but for the sake of making progress, I would also put it out of scope of now. |
I don't disagree, but will share my thoughts nevertheless :) CA indeed has its own scoring mechanism. In fact, it has more than one. The most complex one was indeed score based and good luck reasoning about this one:
More recently we've introduced a way of composing different expanders (i.e. CA scoring functions) in a way that each one can filter out some options that are preferred from its perspective, eventually leading to a single option being chosen. This is very similar to the decision tree proposal above and I agree it is much simpler to reason about (not only as a user, but also as CA maintainer & operator). I think there might be some opportunity to bring the two scoring mechanisms closer, especially if we build some kind of multi-pod scheduler. One might argue CA is already a multi-pod scheduler that can also create nodes. If there's no capacity in the cluster, the nodes created by CA are essentially the only option for pods to be scheduled, so scheduler scoring doesn't matter much in an autoscaled cluster. |
@44past4 this is relevant to our discussion on your concerns about fragmentation potential with #4874 and with multi-node use cases. It's not looking like we will have capacity to do much if anything with this in 1.33. Therefore, I think we should make sure DRA provides the hooks such that higher level components like Kueue and/or CA can be prescriptive about the where to locate the DRA resources. I think the current CEL-based attribute selectors enable this already. This allows those components to handle those issues, while not requiring those components for the base functionality. I raised in SIG scheduling last week that I am concerned we are trying to layer too many things on top of the current pod-by-pod scheduling framework. There are a few different efforts that are using this framework because it is what exists, not because it's necessarily the right fit for their scheduling algorithms. It would be helpful to be able to experiment with alternative schedulers without necessarily being confined to the existing framework. The case of having a more prescriptive scheduling plan from Kueue or CA is one of those ideas. Ideally, multiple independent schedulers could operate on the same set of nodes, and we would have a control plane resource that can reserve resources on a node to do that (see for example this discussion). However, that's pretty far off. In the short term, @Huang-Wei said it should be pretty easy to allow individual nodes to be designated as being for particular schedulers, such that kube-scheduler does not touch them at all. This would allow completely alternate schedulers (not using the scheduling framework at all) to co-exist with kube-scheduler without us having to figure out a Reservation resource yet. It has the downside in that those alternate schedulers may also need to handle basic things like the administrative DaemonSets that tend to be on every node. But I think it's a small, reasonable step to make progress while we figure out the long term vision. @dom4ha would you or someone from your team be able to put together a sig-scheduling KEP for 1.33 that allows this "bypass kube-scheduler" on a node-by-node basis? We would need something on the podspec to indicate if a workload should use the alternate scheduler. Maybe the current schedulerName is sufficient, with some changes to the scheduler config APIs? That would be ideal. But I don't know enough about the details of how the scheduler config works to be sure. |
cc @bsalamat |
I like the idea of taking an incremental step forwards by managing subsets of nodes exclusively by a dedicated scheduler. I wonder if this is something that should be indicated on a node object then? If we have schedulerName(s) on the node object, it will be clear if the binding is being done by the correct one. |
I think @Huang-Wei had something specific in mind as to how to accomplish this. |
Sure, let me think about it. I was also thinking about using persistent (shared) ResourceClaims for resource reservation, but reserving whole nodes sounds like some initial simplification. I guess that preventing scheduler from placing pods in such reservation is relatively easy thing to do, but such nodes reservation itself needs to be allocated somehow, which is also a form of scheduling. Do you know who would specify which nodes to reserve? |
I think just humans or an out-of-band controller would mark the node. |
Controllers or humans doing this post-node creation may lead to race conditions. Could this be specified at kubelet level, so the node object is created with this information already available? Or is the intent to make it sub-node eventually and the kubelet path won't be flexible enough? |
We have an existing mechanism that feels relevant: we can taint a node during startup, and remove the taint once the Node is up to date. Maybe we want a way to make a taint be tolerated by default, or to have an effect we didn't previously support, but taints feel right here. This kind of discussion belongs in a pull request more than an issue (we don't use GitHub Discussions for KEPs). |
Assuming what we want to achieve is simply an ability to reserve whole nodes, we could start treating them as resources and allocate them using DRA driver. Allocation would just put a taint that was defined in the resource claim. Pods which want to schedule within such nodes reservation would need to just define toleration. Sorry for describing the idea here. I can elaborate and start the discussion elsewhere. Shall I create a short doc first? |
Yes, please. |
Here's the doc: https://docs.google.com/document/d/1LvA9_H4vyNbp173DUm3p9_fpsLYjjpujyRPO3TAOy7g/edit?tab=t.0 |
what would happen to Daemonset pods? |
Last section should cover that |
Thank you for creating the doc, I read the document; I assume we have no support for quota management in this model. |
No, there's no quota management in this proposal. |
Has the discussion stalled? I was wondering if there’s any additional discussion planned. Currently, it’s hard to guarantee optimal device allocation within a node without exposing hardware attributes like CpuDomain, NUMABitMask, or SwitchBDF as matchAttributes in the ResourceClaim. However, rather than directly exposing these hardware details in the DRA API, I think we should introduce scoring to hide these complexities from users. I’m still thinking about specific implementation details of the scoring mechanism. However, evaluating every possible combination, as hardware vendors (like NVIDIA) did in the Device Plugin Model, would be impractical. Therefore, I’m considering an approach where the DRA Driver advertises preferred device combinations through the DRA API, and the DRA Scheduler Plugin checks these advertised preference combinations first before falling back to a First-Fit strategy. I will share a more detailed document outlining various PCIe topologies and the related challenges soon. Please let me know if there is anything I may have missed or should further look into regarding this topic. |
Enhancement Description
DRA supports the concept of "under specifying" a request. This gives the scheduler more flexibility to satisfy a request, increasing the likelihood of success in environments with scarce resources. For example, rather than asking for a specific model of a device, the user can ask for any one of a set of models, as long as it has some minimum specified amount of memory.
Currently, DRA uses a "first fit" algorithm during scheduling. This can lead to inefficient choices. Building on the example above, if the user asks for a device with at least 4GB of memory, if the first device found has 80GB of memory, it will be chosen, even if there is another option with exactly 4GB. If scoring were available, the scheduler could evaluate the "waste" associated with each possible option, and make a more efficient choice.
Scoring is also critical in other situations where there is optionality in how to satisfy a request. For instance, in #4816 the user is allowed to provide a list of preferences. While that works to choose the "best" option on a given node, in reality most nodes have homogeneous selections of devices. So, in a cluster with nodes that meet the first option in the list, and nodes that meet the second option in the list, either one could be chosen. If DRA could score the nodes based on whether they satisfy the first or second option, then preference could be given to the first option across nodes, not just within a node.
The last important place for scoring is to help with fragmentation and bin packing. This is especially relevant with the implementation of #4815 pending. The ability to dynamical choose partitions of a device can lead to fragmentation; scoring can alleviate that to some extent. It is not a complete solution to that problem, but can help.
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):/assign @johnbelamaric
/cc @pohly @klueska @mortent @alculquicondor @wojtek-t
/sig scheduling
The text was updated successfully, but these errors were encountered: