Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary ReservedCoresPerGPU implementation #185

Closed
wants to merge 4 commits into from

Conversation

dkoes
Copy link

@dkoes dkoes commented Feb 16, 2020

The primary reason we have our GPU cluster is to use the GPUs (surprise). This means it is highly undesirable for jobs to consume all the CPUs on a node but not all the GPUs, preventing other jobs from using the GPUs. Our cluster is highly heterogenous, so setting something like MaxTasksPerNode doesn't really help. As a concrete example, consider a queue with two nodes, both with 8 GPUS but one with 16 CPUs and the other with 32 CPUs. If a user submits a 12 CPU, 1 GPU job (e.g. namd) and then another user submits 15 1 CPU, 1 GPU jobs (e.g. amber), whether or not all the jobs will be able to run depends on which node the 12 CPU job was initially scheduled to.

This pull request implements a ReservedCoresPerGPU partition option that prevents a job from running on a node if it does not leave at least "ReservedCoresPerGPU" cores free for each unused GPU on the node.

I would have preferred to reserve CPUs (hyperthreads) instead of cores, but it appears even with CR_CPU that jobs aren't scheduled at this level of granularity.

I doubt the calculations are accurate for multi-node jobs. We (almost) never run multi-node GPU jobs so it's difficult to justify the time it would take me to understand how these work. But it at least works for simple cases.

Still need to implement this is an option, but this reserves a core for
each remaining gpu so no job can make GPUs unavailable by taking up all
the cores on a node.
I'll be darned if I can figure out how to get it into the api though...
Found the one please I was missing.
@wickberg
Copy link
Member

Hi -

As noted in CONTRIBUTING.md, we do not accept Pull Requests through Github at this time. Please submit patches as attachments to new bugs to https://bugs.schedmd.com under the "C - Contributions" severity level.

Thanks!

@wickberg wickberg closed this Feb 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants