-
Notifications
You must be signed in to change notification settings - Fork 0
Slurm backend #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Copilot <[email protected]>
benjaminleighton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scontrol might be better than squeue
specifying a queue should not be compulsorary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this file exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I'll remove
| "slurm", | ||
| description="The backend type." | ||
| ) | ||
| queue: str = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably should be optional. Typically in our slurm configuration I do not specify the queue and instead slurm automatically uses the preconfigured default queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, I'll change that
src/rompy/run/slurm.py
Outdated
| # Get job status | ||
| try: | ||
| result = subprocess.run( | ||
| ['squeue', '-j', job_id, '-h', '-o', '%T'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scontrol is a better approach here, fast completing or fast failing jobs are not identified and will disappear from the queue before this runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm not familiar with this. From a quick look, it indeed looks like a better option. It looks like it also exposes additional functionality that we might want to to think about. I'll have a look.
src/rompy/run/slurm.py
Outdated
| logger.error(f"Timeout waiting for job {job_id} after {config.timeout} seconds") | ||
|
|
||
| # Try to cancel the job | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just rethinking this, this is probably not a good idea. Python should not be actively trying to kill the job after the timeout, we should leave that to slurm
|
Thanks Ben, I'll come back to you with some fixes |
benjaminleighton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing seems to be happening when I try to get slurmbackend to actually run a command see inline comments. A more complete test with a real model would be ideal.
| for key, value in config.env_vars.items(): | ||
| script_lines.append(f"export {key}={value}") | ||
|
|
||
| # Add the actual command to run the model\n # First, check if there's a specific command in config, otherwise use the model's run method\n if hasattr(config, 'command') and config.command:\n script_lines.extend([\n \"\",\n \"# Execute custom command\",\n config.command,\n ])\n else:\n script_lines.extend([\n \"\",\n \"# Execute model using model_run.config.run() method\",\n \"python -c \\\"\",\n \"import sys\",\n \"import os\",\n \"sys.path.insert(0, os.getcwd())\",\n \"from rompy.model import ModelRun\",\n f\"model_run = ModelRun.from_dict({model_run.model_dump()})\",\n \"model_run.config.run(model_run)\",\n \"\\\"\",\n ]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does nothing and thus specifiying a command in SlurmConfig has no effect. Ideally there would be a unit test that actually runs a full, real model configuration with Slurm. If that isn't possible the examples should demonstrate running a real model and a real command.
| logger.info("Running custom command on SLURM...") | ||
|
|
||
| try: | ||
| logger.info("✅ SlurmConfig with custom command validated successfully") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example doesn't seem to do much, just check if the constructor doesn't error out.
| logger.error(f"❌ SLURM dictionary configuration failed: {e}") | ||
|
|
||
|
|
||
| def example_slurm_validation(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of these examples should probably be tests
This pull request adds comprehensive SLURM backend examples and supporting documentation to the
examples/backendsdirectory. The primary focus is on demonstrating how to configure and validate SLURM jobs for high-performance computing (HPC) clusters using ROMPY, alongside providing a basic model configuration for backend testing. The README is updated to reflect these new examples and clarify usage.SLURM Backend Example Scripts:
05_slurm_backend_run.py, a detailed tutorial script with five example functions covering basic SLURM execution, advanced configuration, custom commands, dictionary-based configuration, and validation of SLURM parameters. This script is well-commented and includes logging for clarity.basic_model_run.py, which provides a minimalModelRunconfiguration for testing with different backends (local, docker, SLURM). This script is intended for reuse in backend testing scenarios.Documentation Updates:
README.mdto focus on SLURM backend usage, describing new and existing example scripts, configuration files, and key features of ROMPY's SLURM support. The documentation now includes usage instructions and validation details for SLURM jobs.Key themes covered:
Backend Example Expansion:
Documentation and Guidance:
Testing Support: