You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Related to #25 and #26. When running VASP by srun in slurm environment, the mpi interface may or may not be exposed to end user, so directly sending signals to srun is not working. From the slurm manual, the preferred way to pause / resume a srun step is as follows:
NOTE: A suspended job releases its CPUs for allocation to other jobs. Resuming a previously suspended job may result in multiple jobs being allocated the same CPUs, which could trigger gang scheduling with some configurations or severe degradation in performance with other configurations. Use of the scancel command to send SIGSTOP and SIGCONT signals would stop a job without releasing its CPUs for allocation to other jobs and would be a preferable mechanism in many cases. If performing system maintenance you may want to use suspend/resume in the following way. Before suspending set all nodes to draining or set all partitions to down so that no new jobs can be scheduled. Then suspend jobs. Once maintenance is done resume jobs then resume nodes and/or set all partitions back to up. Use with caution.
Simple test shows that it can be done by the following steps (thx Mark Glines for the hint)
Determine if current VASP_COMMAND contains srun directive and actually in a SLURM environment by checking the env
Find the SLURM_JOB_ID of current job (i.e. that submitted by sbatch or salloc)
Find the vasp step of current job by squeue -s --job <jobid> which may show something like follows
STEPID NAME PARTITION USER TIME NODELIST
62793872.1 vasp_std interacti ttian20 12:06 nid02338
62793872.intera interact interacti ttian20 15:13 nid02338
62793872.extern extern interacti ttian20 15:13 nid02338
stepid 62793872.1 is the job step of vasp we want to pause / resume
4. Send TSTP signal to this step scancel -s SIGTSTP 62793872.1. top should show CPU usage drops to 0
5. Send CONT signal to this step scancel -s SIGCONT 62793872.1 to resume.
We may want to do step 3-5 every time pause / resume is involved as the step id may change
The text was updated successfully, but these errors were encountered:
Related to #25 and #26. When running VASP by
srun
in slurm environment, the mpi interface may or may not be exposed to end user, so directly sending signals tosrun
is not working. From the slurm manual, the preferred way to pause / resume asrun
step is as follows:Simple test shows that it can be done by the following steps (thx Mark Glines for the hint)
srun
directive and actually in a SLURM environment by checking the envSLURM_JOB_ID
of current job (i.e. that submitted bysbatch
orsalloc
)squeue -s --job <jobid>
which may show something like followsstepid 62793872.1 is the job step of vasp we want to pause / resume
4. Send TSTP signal to this step
scancel -s SIGTSTP 62793872.1
.top
should show CPU usage drops to 05. Send CONT signal to this step
scancel -s SIGCONT 62793872.1
to resume.We may want to do step 3-5 every time pause / resume is involved as the step id may change
The text was updated successfully, but these errors were encountered: