Preliminary ReservedCoresPerGPU implementation #185
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The primary reason we have our GPU cluster is to use the GPUs (surprise). This means it is highly undesirable for jobs to consume all the CPUs on a node but not all the GPUs, preventing other jobs from using the GPUs. Our cluster is highly heterogenous, so setting something like MaxTasksPerNode doesn't really help. As a concrete example, consider a queue with two nodes, both with 8 GPUS but one with 16 CPUs and the other with 32 CPUs. If a user submits a 12 CPU, 1 GPU job (e.g. namd) and then another user submits 15 1 CPU, 1 GPU jobs (e.g. amber), whether or not all the jobs will be able to run depends on which node the 12 CPU job was initially scheduled to.
This pull request implements a ReservedCoresPerGPU partition option that prevents a job from running on a node if it does not leave at least "ReservedCoresPerGPU" cores free for each unused GPU on the node.
I would have preferred to reserve CPUs (hyperthreads) instead of cores, but it appears even with CR_CPU that jobs aren't scheduled at this level of granularity.
I doubt the calculations are accurate for multi-node jobs. We (almost) never run multi-node GPU jobs so it's difficult to justify the time it would take me to understand how these work. But it at least works for simple cases.