Preliminary ReservedCoresPerGPU implementation #185

dkoes · 2020-02-16T15:02:26Z

The primary reason we have our GPU cluster is to use the GPUs (surprise). This means it is highly undesirable for jobs to consume all the CPUs on a node but not all the GPUs, preventing other jobs from using the GPUs. Our cluster is highly heterogenous, so setting something like MaxTasksPerNode doesn't really help. As a concrete example, consider a queue with two nodes, both with 8 GPUS but one with 16 CPUs and the other with 32 CPUs. If a user submits a 12 CPU, 1 GPU job (e.g. namd) and then another user submits 15 1 CPU, 1 GPU jobs (e.g. amber), whether or not all the jobs will be able to run depends on which node the 12 CPU job was initially scheduled to.

This pull request implements a ReservedCoresPerGPU partition option that prevents a job from running on a node if it does not leave at least "ReservedCoresPerGPU" cores free for each unused GPU on the node.

I would have preferred to reserve CPUs (hyperthreads) instead of cores, but it appears even with CR_CPU that jobs aren't scheduled at this level of granularity.

I doubt the calculations are accurate for multi-node jobs. We (almost) never run multi-node GPU jobs so it's difficult to justify the time it would take me to understand how these work. But it at least works for simple cases.

Still need to implement this is an option, but this reserves a core for each remaining gpu so no job can make GPUs unavailable by taking up all the cores on a node.

I'll be darned if I can figure out how to get it into the api though...

Found the one please I was missing.

wickberg · 2020-02-16T21:25:32Z

Hi -

As noted in CONTRIBUTING.md, we do not accept Pull Requests through Github at this time. Please submit patches as attachments to new bugs to https://bugs.schedmd.com under the "C - Contributions" severity level.

Thanks!

dkoes added 4 commits February 15, 2020 19:53

Initial reserved core per gpu implementation.

112cf18

Still need to implement this is an option, but this reserves a core for each remaining gpu so no job can make GPUs unavailable by taking up all the cores on a node.

Add ReservedCoresPerGPU option

7b6ba7b

I'll be darned if I can figure out how to get it into the api though...

Fix API communication

66189a3

Found the one please I was missing.

Merge remote-tracking branch 'upstream/master'

35edfe0

wickberg closed this Feb 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preliminary ReservedCoresPerGPU implementation #185

Preliminary ReservedCoresPerGPU implementation #185

dkoes commented Feb 16, 2020

wickberg commented Feb 16, 2020

Preliminary ReservedCoresPerGPU implementation #185

Preliminary ReservedCoresPerGPU implementation #185

Conversation

dkoes commented Feb 16, 2020

wickberg commented Feb 16, 2020