Overview slurm_wandb_sweeps

This repository is designed to facilitate the deployment and execution of distributed machine learning model training sweeps on the Horeka cluster, utilizing SLURM for job scheduling and Weights & Biases (WandB) for experiment tracking. It specifically handles multiple parallel training runs, leveraging the capabilities of multiple GPUs across different nodes.

Scripts and Usage

`slurm_sweep.sh`

This bash script automates the submission of multiple SLURM jobs, where each job corresponds to a training run defined within a WandB sweep. It allows specifying the number of GPUs per node.

Usage:

./slurm_sweep.sh <num_nodes> <sweep_id> <zip_file_path> [sbatch_options...]

Arguments:

<num_nodes>: Number of nodes to deploy for the sweep.
<sweep_id>: Identifier for the WandB sweep.
<zip_file_path>: Path to the zip file containing the data or resources for training.
[sbatch_options...]: Additional SBATCH options as needed.

`run_sweep_node.sh`

Executed by slurm_sweep.sh for each SLURM job submission, this script sets up the environment, manages data unzipping, and initiates the WandB agents for each specified GPU.

Usage: TODO

Data Handling and Environment

Configured for the Horeka cluster:

Initial data is contained within a zip file and is extracted to the temporary directory ($TMPDIR) on each node for faster processing during the sweep.
Data persistence and output management after processing must be handled by the executed Python script included in the sweep, which should ensure that data is written back to a persistent storage location ($HOME / $PROJECT).

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
slurm_scripts		slurm_scripts
example_sweep.yaml		example_sweep.yaml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview slurm_wandb_sweeps

Scripts and Usage

`slurm_sweep.sh`

`run_sweep_node.sh`

Data Handling and Environment

About

Releases

Packages

Languages

mspitzna/slurm_wandb_sweeps

Folders and files

Latest commit

History

Repository files navigation

Overview slurm_wandb_sweeps

Scripts and Usage

slurm_sweep.sh

run_sweep_node.sh

Data Handling and Environment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`slurm_sweep.sh`

`run_sweep_node.sh`

Packages