Pausing jobs on Slurm systems #29

alchem0x2A · 2022-09-14T15:47:35Z

Related to #25 and #26. When running VASP by srun in slurm environment, the mpi interface may or may not be exposed to end user, so directly sending signals to srun is not working. From the slurm manual, the preferred way to pause / resume a srun step is as follows:

NOTE: A suspended job releases its CPUs for allocation to other jobs. Resuming a previously suspended job may result in multiple jobs being allocated the same CPUs, which could trigger gang scheduling with some configurations or severe degradation in performance with other configurations. Use of the scancel command to send SIGSTOP and SIGCONT signals would stop a job without releasing its CPUs for allocation to other jobs and would be a preferable mechanism in many cases. If performing system maintenance you may want to use suspend/resume in the following way. Before suspending set all nodes to draining or set all partitions to down so that no new jobs can be scheduled. Then suspend jobs. Once maintenance is done resume jobs then resume nodes and/or set all partitions back to up. Use with caution.

Simple test shows that it can be done by the following steps (thx Mark Glines for the hint)

Determine if current VASP_COMMAND contains srun directive and actually in a SLURM environment by checking the env
Find the SLURM_JOB_ID of current job (i.e. that submitted by sbatch or salloc)
Find the vasp step of current job by squeue -s --job <jobid> which may show something like follows

STEPID     NAME PARTITION     USER      TIME NODELIST
62793872.1 vasp_std interacti  ttian20     12:06 nid02338
62793872.intera interact interacti  ttian20     15:13 nid02338
62793872.extern   extern interacti  ttian20     15:13 nid02338

stepid 62793872.1 is the job step of vasp we want to pause / resume
4. Send TSTP signal to this step scancel -s SIGTSTP 62793872.1. top should show CPU usage drops to 0
5. Send CONT signal to this step scancel -s SIGCONT 62793872.1 to resume.

We may want to do step 3-5 every time pause / resume is involved as the step id may change

The text was updated successfully, but these errors were encountered:

alchem0x2A · 2022-09-24T02:40:18Z

close as all PR merged

alchem0x2A added bug Something isn't working enhancement New feature or request labels Sep 14, 2022

alchem0x2A closed this as completed Sep 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pausing jobs on Slurm systems #29

Pausing jobs on Slurm systems #29

alchem0x2A commented Sep 14, 2022 •

edited

Loading

alchem0x2A commented Sep 24, 2022

Pausing jobs on Slurm systems #29

Pausing jobs on Slurm systems #29

Comments

alchem0x2A commented Sep 14, 2022 • edited Loading

alchem0x2A commented Sep 24, 2022

alchem0x2A commented Sep 14, 2022 •

edited

Loading