-
Notifications
You must be signed in to change notification settings - Fork 30
Dannce on Slurm HPC
We offer several modules to assist with launching dannce on slurm-equipped high performance clusters. They support parallel inference for center of mass and dannce keypoints, as well as parallel grid search for dannce training.
Differences between slurm clusters are accounted for using a slurm configuration file. The file consists of sbatch command-line arguments that specify the required resources for dannce operations in a slurm system. It also specifies the setup script that activates the appropriate dannce environment in your HPC. For example, this is the configuration file for operating on the Harvard Cannon cluster cluster/holyoke.yaml
.
# Dannce slurm configuration
dannce_train: "--job-name=trainDannce -p olveczkygpu,gpu --mem=80000 -t 3-00:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"
dannce_train_grid: "--job-name=trainDannce -p olveczkygpu,gpu --mem=80000 -t 3-00:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"
dannce_predict: "--job-name=predictDannce -p olveczkygpu,gpu,cox,gpu_requeue --mem=30000 -t 1-00:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"
dannce_multi_predict: "--job-name=predictDannce -p olveczkygpu,gpu,cox,gpu_requeue --mem=30000 -t 0-03:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"
# Com slurm configuration
com_train: "--job-name=trainCom -p olveczkygpu,gpu --mem=30000 -t 3-00:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"
com_predict: "--job-name=predictCom -p olveczkygpu,gpu,cox,gpu_requeue --mem=10000 -t 1-00:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"
com_multi_predict: "--job-name=predictCom -p olveczkygpu,gpu,cox,gpu_requeue --mem=10000 -t 0-03:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"
# Inference
inference: '--job-name=inference -p olveczky,shared --mem=30000 -t 3-00:00 -N 1 -n 8 --constraint="intel&avx2"'
# Setup functions (optional, set to "" if no setup is required. Trailing ; is required)
setup: "module load Anaconda3/2020.11; source activate dannce;"
You slurm configuration file should be included as a key value pain in your COM and dannce configuration files as follows:
slurm_config: /path/to/slurm_configuration.yaml
There are several command line functions that facilitate deployment on a slurm hpc. They all accept the path to model configuration files as inputs, denoted here by $com_config
and $dannce_config
. For each command, the list of command line arguments is available through the --help
argument.
com-train-sbatch $com_config
Submit a COM training job to the cluster.
com-predict-sbatch $com_config
Submit a single-process COM prediction job to the cluster.
com-predict-multi-gpu $com_config
Submit a multi-process COM prediction job to the cluster. (Results can be merged with com-merge $com_config
)
dannce-train-sbatch $dannce_config
Submit a dannce training job to the cluster
dannce-train-grid $dannce_config /path/to/training_params.yaml
Submit multiple training jobs to the cluster, specified by the training_params.yaml
file (described below).
dannce-predict-sbatch $dannce_config
Submit a single-instance, single-process dannce prediction job to the cluster.
dannce-predict-multi-gpu $dannce_config
Submit a single-instance, multi-process DANNCE prediction job to the cluster. (Results can be merged with dannce-merge $com_config
)
dannce-inference-sbatch $com_config $dannce_config
For each instance:
- Submit a multi-process COM prediction job to the cluster
- Merge the results
- Submit a multi-process DANNCE prediction job to the cluster.
- Merge the results
Requires that COM and dannce networks are already trained and specified in io.yaml
Training parameters for grid search can be defined in a training_parameters.yaml
file. These will incoporate changes to the base dannce configuration file used in dannce-train-grid
. The training_parameters.yaml
consists of a list of dictionaries specifying the desired parameters to change. For example, the following parameters file will launch two training jobs, one using the default loss function, and the other using a l1 loss.
batch_params:
- dannce_finetune_weights: /n/holylfs02/LABS/olveczky_lab/Diego/code/dannce/demo/markerless_mouse_1/DANNCE/weights/weights.rat.MAX/
data_split_seed: 42
dannce_train_dir: ./DANNCE/FT_MAX
- dannce_finetune_weights: /n/holylfs02/LABS/olveczky_lab/Diego/code/dannce/demo/markerless_mouse_1/DANNCE/weights/weights.rat.MAX/
data_split_seed: 42
loss: mask_nan_l1_loss
dannce_train_dir: ./DANNCE/FT_MAX_L1