Skip to content

Conversation

@luccabb
Copy link
Member

@luccabb luccabb commented Oct 8, 2025

Summary

exposing a way to get number of cpus and gpus from job information. defaults to local host if not on a slurm cluster

Test Plan

works on slurm and locally:

$ srun python -c "import clusterscope; print(clusterscope.get_job().get_gpus()
); print(clusterscope.get_job().get_cpus())"
...
srun: job 1476042 has been allocated resources
1
2
$ python
Python 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import clusterscope
>>> clusterscope.get_job().get_gpus()
WARNING:root:No GPUs found or unable to retrieve GPU information
0
>>> clusterscope.get_job().get_cpus()
96

local node with gpus:

Type "help", "copyright", "credits" or "license" for more information.
>>> import clusterscope
>>> clusterscope.get_job().get_gpus()
2
>>> clusterscope.get_job().get_cpus()
80
>>> exit()

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 8, 2025
@luccabb luccabb marked this pull request as ready for review October 8, 2025 22:54
@luccabb luccabb requested review from gunchu and skalyan as code owners October 8, 2025 22:54
Copy link
Contributor

@skalyan skalyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for this change? Did we have any user requests?

Is the idea to simply hide Slurm env variables and add a wrapper?

@luccabb
Copy link
Member Author

luccabb commented Oct 10, 2025

@skalyan yeah, this is to enable: facebookresearch/matrix#105 (comment)

Is the idea to simply hide Slurm env variables and add a wrapper?

it gives the info for slurm or local nodes

Comment on lines +48 to +55
return int(os.environ.get("SLURM_CPUS_ON_NODE", 1))
return int(max(os.cpu_count() or 0, 1))

@lru_cache(maxsize=1)
def get_gpus(self) -> int:
if self.is_slurm_job():
return int(os.environ.get("SLURM_GPUS_ON_NODE", 1))
return sum(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intent of the ask?

Is it to know how many GPUs/CPUs are "present" on the node or "allocated" to this job?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocated to the job if slurm job, otherwise what's present in the node

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I may, I think we should stick to a clear and limited contract for an API - If we want to return GPUs allocated to a job for a given API let's stick to that. I don't see benefits in either allocated or provisioned GPU count coming via the same API.

Copy link
Member Author

@luccabb luccabb Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skalyan I think this is fine for now, it matches how other methods from this class behaves. I'm up to change position here as we see how it gets used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants