Here is a list of some key limitations we're aware of:
You can't create a Slurm cluster with nodes that don't have GPUs.
However, supporting CPU-only clusters seems pretty straightforward and we're already working on it.
We currently only provide a terraform recipe for Nebius cloud.
While you can easily change the number of worker nodes by tweaking a value in the YAML manifest, it only works smoothly
when scaling up. If you try to scale the cluster down, you’ll find that the deleted nodes remain in the Slurm
controller’s memory (and show up in sinfo
output, for example). You can manually remove them (using scontrol
) if
they bug you.
It doesn't interfere users to launch their jobs and will be fixed soon.
The Slurm's ability to split clusters into several partitions (= job queues) isn't supported now.
We'll implement it if there is a demand. The idea is that nodes in different partitions will be able to vary (e.g. equipped with different GPU models, use different container images, have different storages mounted, etc.)
Our list of supported software versions is pretty short right now:
- Linux distribution: Ubuntu 22.04.
- Slurm: versions
24.05.5
. - CUDA: version 12.4.1.
- Kubernetes: >= 1.29.
- Versions of some preinstalled software packages can't be changed.
Other versions may also be supported, but we haven't checked it yet. It would be cool if someone from the community tried to launch Soperator on a different setup and leave a feedback.
Although users can install or modify software in the shared environment, it doesn't apply to some low-level packages directly bound to GPUs (CUDA, NVIDIA drivers, NVIDIA container toolkit, enroot, etc.).
Such software versions must be explicitly supported in container images Soperator uses.
While setting some configuration options for Slurm should indeed be done by Soperator, there are such ones that some people would like to customize themselves. Not all of this is supported.
For example, Soperator sets some sysctl params on its own, and it's not configurable by the user.
Try this solution as it is and if something doesn't work, let us know, we will fix it.
You can only use Linux users & groups for now.