Companion scripts for day-to-day operations of a Slurm HPC cluster. They cover node recovery, health monitoring, filesystem repair, service restoration, rolling upgrades, and job throughput reporting.
| Tool | Purpose |
|---|---|
slurm (sinfo, scontrol, sacct, scancel) |
Node state queries and job management |
clush |
Parallel remote command execution across nodes |
nodeset |
Nodeset expansion (ClusterShell) |
sudo |
Most scripts require passwordless sudo on target nodes |
ssh |
Passwordless key-based access from the head node |
The target partition is cpubase_bycore_b1 — hardcoded throughout. Update the scripts if your partition name differs.
Quick per-node status table for the partition.
./alloc_sinfoDisplays: NodeName, Count, Partition, State, CPUs, S:C:T topology, Memory, TmpDisk, Weight, Features, Reason, CPU allocation (A/I/O/T).
Self-draining health probe to run as a cron job on each compute node.
# Example crontab entry on compute nodes:
*/5 * * * * /path/to/node_check.sh >> /var/log/node_check.log 2>&1Reads /project/testfile to verify the shared filesystem is mounted. If the read fails the node drains itself via scontrol with the reason /project FS is gone, preventing new jobs from being scheduled until the issue is resolved.
Interactive recovery of error-state nodes. Prompts before undraining.
./revive_drained_nodes.shSteps performed:
- Lists error-state nodes.
- Kills any running jobs on those nodes (
sacct+scancel). - Reboots nodes via
clush. - Polls SSH per-node until each is reachable.
- Remounts
/localscratchand cleans stale data. - Restarts
munge,iptables,ip6tables, andpuppet. - Asks for confirmation, then undrains (
scontrol State=UNDRAIN).
Non-interactive version of revive_drained_nodes.sh — undrains automatically without prompting. Suitable for cron or automation pipelines.
./automated_undrain.shSame steps as revive_drained_nodes.sh but skips all user prompts; exits cleanly with a message if no error-state nodes are found.
Like revive_drained_nodes.sh but accepts a custom state-pattern to target nodes, and uses clush for the post-reboot readiness check instead of per-node SSH polling.
./select_with_regex_revive_drained_nodes.sh [state-pattern]| Argument | Default | Description |
|---|---|---|
state-pattern |
error |
grep pattern matched against sinfo state column |
Examples:
./select_with_regex_revive_drained_nodes.sh # targets 'error' nodes
./select_with_regex_revive_drained_nodes.sh drain # targets 'drain*' nodesRolling parallel kernel patch for all compute nodes.
./upgrade_compute_nodes.shFor every node currently in idle, alloc, mix, drained, or draining state the script (in parallel background jobs):
- Drains the node (skips if already drained/draining).
- Waits until fully drained.
- Runs
dnf upgrade --disablerepo=cernvmand reboots viaclush. - Waits for the node to come back online.
- Undrains the node.
The CVE reference in the drain reason (CVE-2026-31431 kernel patch) should be updated to match the actual patch being applied.
Fixes the /localscratch bind-mount on a single compute node. Run as root directly on the target node (e.g. after imaging or an fstab corruption).
sudo ./fix_localscratch.shRemoves any existing ephemeral0 fstab entries and writes the canonical config:
/dev/vdb→/mnt/ephemeral0(auto, nofail, cloud-init ordered)/mnt/ephemeral0→/localscratch(bind mount)
Removes the _netdev option from the ephemeral0 fstab entry on a single node. Run as root on the target node.
sudo ./fix_netdev.sh_netdev delays the mount until the network is available, which is unnecessary for a local block device and causes boot-time failures on some cloud images.
Restores munge, iptables, ip6tables, and puppet on all nodes in the partition at once. Useful after mass reboots where these services failed to start due to missing runtime directories.
./fix_munge_and_iptable.shCreates /var/lock/subsys and /var/run/munge with correct ownership on every node, then starts/restarts the relevant services via clush.
Gathers and chronologically sorts mds0 reconnect denied messages from /var/log/messages across all nodes.
./get_all_mds_denied.shThese messages indicate Lustre MDS connection rejections, typically caused by clock skew or Kerberos/GSS credential failures. Output is sorted by timestamp.
Reports pipeline sample throughput over rolling time windows.
./completed_run.shCounts sacct jobs named report_pcgr or linx_plot in COMPLETED state for:
- Last 24 hours, last 7 days
- 1 / 2 / 3 weeks ago
- Last 30 days, 1 / 2 months ago
Find error nodes manually:
sinfo -hlN --partition cpubase_bycore_b1 | grep errorUndrain a single node:
sudo scontrol update State=UNDRAIN NodeName=<nodename>Run a command on all nodes:
clush -v --machinefile <(sinfo -hlN --partition cpubase_bycore_b1 | awk '{print $1}') "<command>"- Scripts that operate on individual nodes (
fix_localscratch.sh,fix_netdev.sh,node_check.sh) must be run on the target node itself unless otherwise noted. - Scripts that operate cluster-wide are intended to run from the head/management node.
- The partition name
cpubase_bycore_b1and Slurm binary path/opt/software/slurm/bin/are hardcoded. Adjust them to match your environment.