-
Notifications
You must be signed in to change notification settings - Fork 3
exposing job cpus and gpus #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for this change? Did we have any user requests?
Is the idea to simply hide Slurm env variables and add a wrapper?
|
@skalyan yeah, this is to enable: facebookresearch/matrix#105 (comment)
it gives the info for slurm or local nodes |
| return int(os.environ.get("SLURM_CPUS_ON_NODE", 1)) | ||
| return int(max(os.cpu_count() or 0, 1)) | ||
|
|
||
| @lru_cache(maxsize=1) | ||
| def get_gpus(self) -> int: | ||
| if self.is_slurm_job(): | ||
| return int(os.environ.get("SLURM_GPUS_ON_NODE", 1)) | ||
| return sum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the intent of the ask?
Is it to know how many GPUs/CPUs are "present" on the node or "allocated" to this job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allocated to the job if slurm job, otherwise what's present in the node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I may, I think we should stick to a clear and limited contract for an API - If we want to return GPUs allocated to a job for a given API let's stick to that. I don't see benefits in either allocated or provisioned GPU count coming via the same API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skalyan I think this is fine for now, it matches how other methods from this class behaves. I'm up to change position here as we see how it gets used
Summary
exposing a way to get number of cpus and gpus from job information. defaults to local host if not on a slurm cluster
Test Plan
works on slurm and locally:
local node with gpus: