Slurm backend #9

tomdurrant · 2025-10-20T07:05:25Z

This pull request adds comprehensive SLURM backend examples and supporting documentation to the examples/backends directory. The primary focus is on demonstrating how to configure and validate SLURM jobs for high-performance computing (HPC) clusters using ROMPY, alongside providing a basic model configuration for backend testing. The README is updated to reflect these new examples and clarify usage.

SLURM Backend Example Scripts:

Added 05_slurm_backend_run.py, a detailed tutorial script with five example functions covering basic SLURM execution, advanced configuration, custom commands, dictionary-based configuration, and validation of SLURM parameters. This script is well-commented and includes logging for clarity.
Added basic_model_run.py, which provides a minimal ModelRun configuration for testing with different backends (local, docker, SLURM). This script is intended for reuse in backend testing scenarios.

Documentation Updates:

Overhauled README.md to focus on SLURM backend usage, describing new and existing example scripts, configuration files, and key features of ROMPY's SLURM support. The documentation now includes usage instructions and validation details for SLURM jobs.

Key themes covered:

Backend Example Expansion:

Multiple SLURM usage scenarios are provided, showing both basic and advanced configurations, custom commands, and validation techniques.

Documentation and Guidance:

The README is rewritten to guide users through SLURM backend setup, usage, and validation, and to highlight the new examples and configuration files.

Testing Support:

The basic model run script facilitates backend testing by providing a consistent model configuration for all environments.

Co-authored-by: Copilot <[email protected]>

benjaminleighton

scontrol might be better than squeue
specifying a queue should not be compulsorary

benjaminleighton · 2025-10-22T02:36:08Z

src/rompy/backends/config_slurm_fixed.py

Should this file exist?

No. I'll remove

benjaminleighton · 2025-10-22T02:37:37Z

src/rompy/backends/config.py

+        "slurm", 
+        description="The backend type."
+    )
+    queue: str = Field(


This probably should be optional. Typically in our slurm configuration I do not specify the queue and instead slurm automatically uses the preconfigured default queue

Noted, I'll change that

benjaminleighton · 2025-10-22T02:39:02Z

src/rompy/run/slurm.py

+            # Get job status
+            try:
+                result = subprocess.run(
+                    ['squeue', '-j', job_id, '-h', '-o', '%T'],


scontrol is a better approach here, fast completing or fast failing jobs are not identified and will disappear from the queue before this runs.

Ok, I'm not familiar with this. From a quick look, it indeed looks like a better option. It looks like it also exposes additional functionality that we might want to to think about. I'll have a look.

tomdurrant · 2025-10-22T02:50:24Z

src/rompy/run/slurm.py

+                logger.error(f"Timeout waiting for job {job_id} after {config.timeout} seconds")
+
+                # Try to cancel the job
+                try:


Just rethinking this, this is probably not a good idea. Python should not be actively trying to kill the job after the timeout, we should leave that to slurm

tomdurrant · 2025-10-22T02:51:49Z

Thanks Ben, I'll come back to you with some fixes

benjaminleighton

Nothing seems to be happening when I try to get slurmbackend to actually run a command see inline comments. A more complete test with a real model would be ideal.

benjaminleighton · 2025-11-05T00:18:16Z

src/rompy/run/slurm.py

+        for key, value in config.env_vars.items():
+            script_lines.append(f"export {key}={value}")
+
+        # Add the actual command to run the model\n        # First, check if there's a specific command in config, otherwise use the model's run method\n        if hasattr(config, 'command') and config.command:\n            script_lines.extend([\n                \"\",\n                \"# Execute custom command\",\n                config.command,\n            ])\n        else:\n            script_lines.extend([\n                \"\",\n                \"# Execute model using model_run.config.run() method\",\n                \"python -c \\\"\",\n                \"import sys\",\n                \"import os\",\n                \"sys.path.insert(0, os.getcwd())\",\n                \"from rompy.model import ModelRun\",\n                f\"model_run = ModelRun.from_dict({model_run.model_dump()})\",\n                \"model_run.config.run(model_run)\",\n                \"\\\"\",\n            ])


This does nothing and thus specifiying a command in SlurmConfig has no effect. Ideally there would be a unit test that actually runs a full, real model configuration with Slurm. If that isn't possible the examples should demonstrate running a real model and a real command.

benjaminleighton · 2025-11-05T00:20:02Z

examples/backends/05_slurm_backend_run.py

+        logger.info("Running custom command on SLURM...")
+
+        try:
+            logger.info("✅ SlurmConfig with custom command validated successfully")


This example doesn't seem to do much, just check if the constructor doesn't error out.

benjaminleighton · 2025-11-05T00:20:53Z

examples/backends/05_slurm_backend_run.py

+        logger.error(f"❌ SLURM dictionary configuration failed: {e}")
+
+
+def example_slurm_validation():


Many of these examples should probably be tests

tomdurrant and others added 17 commits October 6, 2025 22:05

Initial implementation

e1bcd88

Polish and tests

6722b64

Replaced subprocess docker calls with docker python library

3537db0

Update tests/integration/test_docker_backend.py

dbdc2be

Co-authored-by: Copilot <[email protected]>

Fixed failing test

9d357b6

log_box imported twice

e953b15

Replace deprecated utcnow

7e691c1

Suppressing numpy incompatibility warnings

ac72904

Definitive fix for the numpy warning in the tests

7432dcc

Run ruff across the repo

2499cf0

Add to extra dependencies the remote dependencies for cloudpathlib

ccde632

Merge branch 'main' into slurm-backend

33c149d

Added slurm examples

5367959

Clened up example backends

516d1e8

Fixed loging in dockers

c0e1898

Added basic backed run examples

d2260e5

fixed testing

d6c9d93

benjaminleighton reviewed Oct 22, 2025

View reviewed changes

tomdurrant commented Oct 22, 2025

View reviewed changes

Address comments in PR

5bedde0

benjaminleighton reviewed Nov 5, 2025

View reviewed changes

		logger.error(f"❌ SLURM dictionary configuration failed: {e}")


		def example_slurm_validation():

Slurm backend #9

Are you sure you want to change the base?

Slurm backend #9

Uh oh!

Conversation

tomdurrant commented Oct 20, 2025

Uh oh!

benjaminleighton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomdurrant commented Oct 22, 2025

Uh oh!

benjaminleighton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants