Skip to content

c3g/slurm_cluster_management

Repository files navigation

Slurm Cluster Management Scripts

Companion scripts for day-to-day operations of a Slurm HPC cluster. They cover node recovery, health monitoring, filesystem repair, service restoration, rolling upgrades, and job throughput reporting.

Prerequisites

Tool Purpose
slurm (sinfo, scontrol, sacct, scancel) Node state queries and job management
clush Parallel remote command execution across nodes
nodeset Nodeset expansion (ClusterShell)
sudo Most scripts require passwordless sudo on target nodes
ssh Passwordless key-based access from the head node

The target partition is cpubase_bycore_b1 — hardcoded throughout. Update the scripts if your partition name differs.


Scripts

alloc_sinfo

Quick per-node status table for the partition.

./alloc_sinfo

Displays: NodeName, Count, Partition, State, CPUs, S:C:T topology, Memory, TmpDisk, Weight, Features, Reason, CPU allocation (A/I/O/T).


node_check.sh

Self-draining health probe to run as a cron job on each compute node.

# Example crontab entry on compute nodes:
*/5 * * * * /path/to/node_check.sh >> /var/log/node_check.log 2>&1

Reads /project/testfile to verify the shared filesystem is mounted. If the read fails the node drains itself via scontrol with the reason /project FS is gone, preventing new jobs from being scheduled until the issue is resolved.


revive_drained_nodes.sh

Interactive recovery of error-state nodes. Prompts before undraining.

./revive_drained_nodes.sh

Steps performed:

  1. Lists error-state nodes.
  2. Kills any running jobs on those nodes (sacct + scancel).
  3. Reboots nodes via clush.
  4. Polls SSH per-node until each is reachable.
  5. Remounts /localscratch and cleans stale data.
  6. Restarts munge, iptables, ip6tables, and puppet.
  7. Asks for confirmation, then undrains (scontrol State=UNDRAIN).

automated_undrain.sh

Non-interactive version of revive_drained_nodes.sh — undrains automatically without prompting. Suitable for cron or automation pipelines.

./automated_undrain.sh

Same steps as revive_drained_nodes.sh but skips all user prompts; exits cleanly with a message if no error-state nodes are found.


select_with_regex_revive_drained_nodes.sh

Like revive_drained_nodes.sh but accepts a custom state-pattern to target nodes, and uses clush for the post-reboot readiness check instead of per-node SSH polling.

./select_with_regex_revive_drained_nodes.sh [state-pattern]
Argument Default Description
state-pattern error grep pattern matched against sinfo state column

Examples:

./select_with_regex_revive_drained_nodes.sh           # targets 'error' nodes
./select_with_regex_revive_drained_nodes.sh drain     # targets 'drain*' nodes

upgrade_compute_nodes.sh

Rolling parallel kernel patch for all compute nodes.

./upgrade_compute_nodes.sh

For every node currently in idle, alloc, mix, drained, or draining state the script (in parallel background jobs):

  1. Drains the node (skips if already drained/draining).
  2. Waits until fully drained.
  3. Runs dnf upgrade --disablerepo=cernvm and reboots via clush.
  4. Waits for the node to come back online.
  5. Undrains the node.

The CVE reference in the drain reason (CVE-2026-31431 kernel patch) should be updated to match the actual patch being applied.


fix_localscratch.sh

Fixes the /localscratch bind-mount on a single compute node. Run as root directly on the target node (e.g. after imaging or an fstab corruption).

sudo ./fix_localscratch.sh

Removes any existing ephemeral0 fstab entries and writes the canonical config:

  • /dev/vdb/mnt/ephemeral0 (auto, nofail, cloud-init ordered)
  • /mnt/ephemeral0/localscratch (bind mount)

fix_netdev.sh

Removes the _netdev option from the ephemeral0 fstab entry on a single node. Run as root on the target node.

sudo ./fix_netdev.sh

_netdev delays the mount until the network is available, which is unnecessary for a local block device and causes boot-time failures on some cloud images.


fix_munge_and_iptable.sh

Restores munge, iptables, ip6tables, and puppet on all nodes in the partition at once. Useful after mass reboots where these services failed to start due to missing runtime directories.

./fix_munge_and_iptable.sh

Creates /var/lock/subsys and /var/run/munge with correct ownership on every node, then starts/restarts the relevant services via clush.


get_all_mds_denied.sh

Gathers and chronologically sorts mds0 reconnect denied messages from /var/log/messages across all nodes.

./get_all_mds_denied.sh

These messages indicate Lustre MDS connection rejections, typically caused by clock skew or Kerberos/GSS credential failures. Output is sorted by timestamp.


completed_run.sh

Reports pipeline sample throughput over rolling time windows.

./completed_run.sh

Counts sacct jobs named report_pcgr or linx_plot in COMPLETED state for:

  • Last 24 hours, last 7 days
  • 1 / 2 / 3 weeks ago
  • Last 30 days, 1 / 2 months ago

Common Patterns

Find error nodes manually:

sinfo -hlN --partition cpubase_bycore_b1 | grep error

Undrain a single node:

sudo scontrol update State=UNDRAIN NodeName=<nodename>

Run a command on all nodes:

clush -v --machinefile <(sinfo -hlN --partition cpubase_bycore_b1 | awk '{print $1}') "<command>"

Notes

  • Scripts that operate on individual nodes (fix_localscratch.sh, fix_netdev.sh, node_check.sh) must be run on the target node itself unless otherwise noted.
  • Scripts that operate cluster-wide are intended to run from the head/management node.
  • The partition name cpubase_bycore_b1 and Slurm binary path /opt/software/slurm/bin/ are hardcoded. Adjust them to match your environment.

About

Helpers and maintenance scripts for Slurm clusters

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages