Slurm Cluster Management Scripts

Companion scripts for day-to-day operations of a Slurm HPC cluster. They cover node recovery, health monitoring, filesystem repair, service restoration, rolling upgrades, and job throughput reporting.

Prerequisites

Tool	Purpose
`slurm` (`sinfo`, `scontrol`, `sacct`, `scancel`)	Node state queries and job management
`clush`	Parallel remote command execution across nodes
`nodeset`	Nodeset expansion (ClusterShell)
`sudo`	Most scripts require passwordless sudo on target nodes
`ssh`	Passwordless key-based access from the head node

The target partition is cpubase_bycore_b1 — hardcoded throughout. Update the scripts if your partition name differs.

Scripts

`alloc_sinfo`

Quick per-node status table for the partition.

./alloc_sinfo

Displays: NodeName, Count, Partition, State, CPUs, S:C:T topology, Memory, TmpDisk, Weight, Features, Reason, CPU allocation (A/I/O/T).

`node_check.sh`

Self-draining health probe to run as a cron job on each compute node.

# Example crontab entry on compute nodes:
*/5 * * * * /path/to/node_check.sh >> /var/log/node_check.log 2>&1

Reads /project/testfile to verify the shared filesystem is mounted. If the read fails the node drains itself via scontrol with the reason /project FS is gone, preventing new jobs from being scheduled until the issue is resolved.

`revive_drained_nodes.sh`

Interactive recovery of error-state nodes. Prompts before undraining.

./revive_drained_nodes.sh

Steps performed:

Lists error-state nodes.
Kills any running jobs on those nodes (sacct + scancel).
Reboots nodes via clush.
Polls SSH per-node until each is reachable.
Remounts /localscratch and cleans stale data.
Restarts munge, iptables, ip6tables, and puppet.
Asks for confirmation, then undrains (scontrol State=UNDRAIN).

`automated_undrain.sh`

Non-interactive version of revive_drained_nodes.sh — undrains automatically without prompting. Suitable for cron or automation pipelines.

./automated_undrain.sh

Same steps as revive_drained_nodes.sh but skips all user prompts; exits cleanly with a message if no error-state nodes are found.

`select_with_regex_revive_drained_nodes.sh`

Like revive_drained_nodes.sh but accepts a custom state-pattern to target nodes, and uses clush for the post-reboot readiness check instead of per-node SSH polling.

./select_with_regex_revive_drained_nodes.sh [state-pattern]

Argument	Default	Description
`state-pattern`	`error`	`grep` pattern matched against `sinfo` state column

Examples:

./select_with_regex_revive_drained_nodes.sh           # targets 'error' nodes
./select_with_regex_revive_drained_nodes.sh drain     # targets 'drain*' nodes

`upgrade_compute_nodes.sh`

Rolling parallel kernel patch for all compute nodes.

./upgrade_compute_nodes.sh

For every node currently in idle, alloc, mix, drained, or draining state the script (in parallel background jobs):

Drains the node (skips if already drained/draining).
Waits until fully drained.
Runs dnf upgrade --disablerepo=cernvm and reboots via clush.
Waits for the node to come back online.
Undrains the node.

The CVE reference in the drain reason (CVE-2026-31431 kernel patch) should be updated to match the actual patch being applied.

`fix_localscratch.sh`

Fixes the /localscratch bind-mount on a single compute node. Run as root directly on the target node (e.g. after imaging or an fstab corruption).

sudo ./fix_localscratch.sh

Removes any existing ephemeral0 fstab entries and writes the canonical config:

/dev/vdb → /mnt/ephemeral0 (auto, nofail, cloud-init ordered)
/mnt/ephemeral0 → /localscratch (bind mount)

`fix_netdev.sh`

Removes the _netdev option from the ephemeral0 fstab entry on a single node. Run as root on the target node.

sudo ./fix_netdev.sh

_netdev delays the mount until the network is available, which is unnecessary for a local block device and causes boot-time failures on some cloud images.

`fix_munge_and_iptable.sh`

Restores munge, iptables, ip6tables, and puppet on all nodes in the partition at once. Useful after mass reboots where these services failed to start due to missing runtime directories.

./fix_munge_and_iptable.sh

Creates /var/lock/subsys and /var/run/munge with correct ownership on every node, then starts/restarts the relevant services via clush.

`get_all_mds_denied.sh`

Gathers and chronologically sorts mds0 reconnect denied messages from /var/log/messages across all nodes.

./get_all_mds_denied.sh

These messages indicate Lustre MDS connection rejections, typically caused by clock skew or Kerberos/GSS credential failures. Output is sorted by timestamp.

`completed_run.sh`

Reports pipeline sample throughput over rolling time windows.

./completed_run.sh

Counts sacct jobs named report_pcgr or linx_plot in COMPLETED state for:

Last 24 hours, last 7 days
1 / 2 / 3 weeks ago
Last 30 days, 1 / 2 months ago

Common Patterns

Find error nodes manually:

sinfo -hlN --partition cpubase_bycore_b1 | grep error

Undrain a single node:

sudo scontrol update State=UNDRAIN NodeName=<nodename>

Run a command on all nodes:

clush -v --machinefile <(sinfo -hlN --partition cpubase_bycore_b1 | awk '{print $1}') "<command>"

Notes

Scripts that operate on individual nodes (fix_localscratch.sh, fix_netdev.sh, node_check.sh) must be run on the target node itself unless otherwise noted.
Scripts that operate cluster-wide are intended to run from the head/management node.
The partition name cpubase_bycore_b1 and Slurm binary path /opt/software/slurm/bin/ are hardcoded. Adjust them to match your environment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Slurm Cluster Management Scripts

Prerequisites

Scripts

`alloc_sinfo`

`node_check.sh`

`revive_drained_nodes.sh`

`automated_undrain.sh`

`select_with_regex_revive_drained_nodes.sh`

`upgrade_compute_nodes.sh`

`fix_localscratch.sh`

`fix_netdev.sh`

`fix_munge_and_iptable.sh`

`get_all_mds_denied.sh`

`completed_run.sh`

Common Patterns

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
alloc_sinfo		alloc_sinfo
automated_undrain.sh		automated_undrain.sh
completed_run.sh		completed_run.sh
fix_localscratch.sh		fix_localscratch.sh
fix_munge_and_iptable.sh		fix_munge_and_iptable.sh
fix_netdev.sh		fix_netdev.sh
get_all_mds_denied.sh		get_all_mds_denied.sh
node_check.sh		node_check.sh
revive_drained_nodes.sh		revive_drained_nodes.sh
select_with_regex_revive_drained_nodes.sh		select_with_regex_revive_drained_nodes.sh
upgrade_compute_nodes.sh		upgrade_compute_nodes.sh

Folders and files

Latest commit

History

Repository files navigation

Slurm Cluster Management Scripts

Prerequisites

Scripts

alloc_sinfo

node_check.sh

revive_drained_nodes.sh

automated_undrain.sh

select_with_regex_revive_drained_nodes.sh

upgrade_compute_nodes.sh

fix_localscratch.sh

fix_netdev.sh

fix_munge_and_iptable.sh

get_all_mds_denied.sh

completed_run.sh

Common Patterns

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`alloc_sinfo`

`node_check.sh`

`revive_drained_nodes.sh`

`automated_undrain.sh`

`select_with_regex_revive_drained_nodes.sh`

`upgrade_compute_nodes.sh`

`fix_localscratch.sh`

`fix_netdev.sh`

`fix_munge_and_iptable.sh`

`get_all_mds_denied.sh`

`completed_run.sh`

Packages