Skip to content

Conversation

@tomdurrant
Copy link
Contributor

This pull request adds comprehensive SLURM backend examples and supporting documentation to the examples/backends directory. The primary focus is on demonstrating how to configure and validate SLURM jobs for high-performance computing (HPC) clusters using ROMPY, alongside providing a basic model configuration for backend testing. The README is updated to reflect these new examples and clarify usage.

SLURM Backend Example Scripts:

  • Added 05_slurm_backend_run.py, a detailed tutorial script with five example functions covering basic SLURM execution, advanced configuration, custom commands, dictionary-based configuration, and validation of SLURM parameters. This script is well-commented and includes logging for clarity.
  • Added basic_model_run.py, which provides a minimal ModelRun configuration for testing with different backends (local, docker, SLURM). This script is intended for reuse in backend testing scenarios.

Documentation Updates:

  • Overhauled README.md to focus on SLURM backend usage, describing new and existing example scripts, configuration files, and key features of ROMPY's SLURM support. The documentation now includes usage instructions and validation details for SLURM jobs.

Key themes covered:

Backend Example Expansion:

  • Multiple SLURM usage scenarios are provided, showing both basic and advanced configurations, custom commands, and validation techniques.

Documentation and Guidance:

  • The README is rewritten to guide users through SLURM backend setup, usage, and validation, and to highlight the new examples and configuration files.

Testing Support:

  • The basic model run script facilitates backend testing by providing a consistent model configuration for all environments.

Copy link
Contributor

@benjaminleighton benjaminleighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scontrol might be better than squeue
specifying a queue should not be compulsorary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this file exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I'll remove

"slurm",
description="The backend type."
)
queue: str = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably should be optional. Typically in our slurm configuration I do not specify the queue and instead slurm automatically uses the preconfigured default queue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, I'll change that

# Get job status
try:
result = subprocess.run(
['squeue', '-j', job_id, '-h', '-o', '%T'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scontrol is a better approach here, fast completing or fast failing jobs are not identified and will disappear from the queue before this runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm not familiar with this. From a quick look, it indeed looks like a better option. It looks like it also exposes additional functionality that we might want to to think about. I'll have a look.

logger.error(f"Timeout waiting for job {job_id} after {config.timeout} seconds")

# Try to cancel the job
try:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just rethinking this, this is probably not a good idea. Python should not be actively trying to kill the job after the timeout, we should leave that to slurm

@tomdurrant
Copy link
Contributor Author

Thanks Ben, I'll come back to you with some fixes

Copy link
Contributor

@benjaminleighton benjaminleighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing seems to be happening when I try to get slurmbackend to actually run a command see inline comments. A more complete test with a real model would be ideal.

for key, value in config.env_vars.items():
script_lines.append(f"export {key}={value}")

# Add the actual command to run the model\n # First, check if there's a specific command in config, otherwise use the model's run method\n if hasattr(config, 'command') and config.command:\n script_lines.extend([\n \"\",\n \"# Execute custom command\",\n config.command,\n ])\n else:\n script_lines.extend([\n \"\",\n \"# Execute model using model_run.config.run() method\",\n \"python -c \\\"\",\n \"import sys\",\n \"import os\",\n \"sys.path.insert(0, os.getcwd())\",\n \"from rompy.model import ModelRun\",\n f\"model_run = ModelRun.from_dict({model_run.model_dump()})\",\n \"model_run.config.run(model_run)\",\n \"\\\"\",\n ])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does nothing and thus specifiying a command in SlurmConfig has no effect. Ideally there would be a unit test that actually runs a full, real model configuration with Slurm. If that isn't possible the examples should demonstrate running a real model and a real command.

logger.info("Running custom command on SLURM...")

try:
logger.info("✅ SlurmConfig with custom command validated successfully")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't seem to do much, just check if the constructor doesn't error out.

logger.error(f"❌ SLURM dictionary configuration failed: {e}")


def example_slurm_validation():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of these examples should probably be tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants