Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pausing jobs on Slurm systems #29

Closed
alchem0x2A opened this issue Sep 14, 2022 · 1 comment
Closed

Pausing jobs on Slurm systems #29

alchem0x2A opened this issue Sep 14, 2022 · 1 comment
Labels
bug Something isn't working enhancement New feature or request

Comments

@alchem0x2A
Copy link
Collaborator

alchem0x2A commented Sep 14, 2022

Related to #25 and #26. When running VASP by srun in slurm environment, the mpi interface may or may not be exposed to end user, so directly sending signals to srun is not working. From the slurm manual, the preferred way to pause / resume a srun step is as follows:

NOTE: A suspended job releases its CPUs for allocation to other jobs. Resuming a previously suspended job may result in multiple jobs being allocated the same CPUs, which could trigger gang scheduling with some configurations or severe degradation in performance with other configurations. Use of the scancel command to send SIGSTOP and SIGCONT signals would stop a job without releasing its CPUs for allocation to other jobs and would be a preferable mechanism in many cases. If performing system maintenance you may want to use suspend/resume in the following way. Before suspending set all nodes to draining or set all partitions to down so that no new jobs can be scheduled. Then suspend jobs. Once maintenance is done resume jobs then resume nodes and/or set all partitions back to up. Use with caution.

Simple test shows that it can be done by the following steps (thx Mark Glines for the hint)

  1. Determine if current VASP_COMMAND contains srun directive and actually in a SLURM environment by checking the env
  2. Find the SLURM_JOB_ID of current job (i.e. that submitted by sbatch or salloc)
  3. Find the vasp step of current job by squeue -s --job <jobid> which may show something like follows
STEPID     NAME PARTITION     USER      TIME NODELIST
62793872.1 vasp_std interacti  ttian20     12:06 nid02338
62793872.intera interact interacti  ttian20     15:13 nid02338
62793872.extern   extern interacti  ttian20     15:13 nid02338

stepid 62793872.1 is the job step of vasp we want to pause / resume
4. Send TSTP signal to this step scancel -s SIGTSTP 62793872.1. top should show CPU usage drops to 0
5. Send CONT signal to this step scancel -s SIGCONT 62793872.1 to resume.

We may want to do step 3-5 every time pause / resume is involved as the step id may change

@alchem0x2A alchem0x2A added bug Something isn't working enhancement New feature or request labels Sep 14, 2022
@alchem0x2A
Copy link
Collaborator Author

close as all PR merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant