Fix slurmd fails to autostart by systemd on boot #191

gengwg · 2023-02-01T08:03:13Z

Slurm fails to autostart by systemd on CentOS 8 machines after reboot. Had to manually start the service.

It seems some transient network or some other problems (DNS?) during the boot time caused slurmd to fail to start. However later it doesn't try restart after failing. This is to add a retry, so that slurmd can be started after some transient error.

Tests

Before:

$ systemctl cat  slurmd
# /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target
#ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes
TasksMax=infinity
[Install]
WantedBy=multi-user.target

# reboot and slurmd fails to start

$ systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2023-01-31 19:17:19 PST; 37s ago
  Process: 3877 ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 3877 (code=exited, status=1/FAILURE)
$ tail /var/log/slurmd.log
[2023-01-31T19:17:18.974] cred/munge: init: Munge credential signature plugin loaded
[2023-01-31T19:17:18.975] slurmd version 20.11.9 started
[2023-01-31T19:17:18.976] debug:  jobacct_gather/linux: init: Job accounting gather LINUX plugin loaded
[2023-01-31T19:17:18.977] debug:  job_container/none: init: job_container none plugin loaded
[2023-01-31T19:17:18.980] debug:  switch/none: init: switch NONE plugin loaded
[2023-01-31T19:17:19.102] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2023-01-31T19:17:19.102] error: slurm_set_addr: Unable to resolve "(null)"
[2023-01-31T19:17:19.102] error: slurm_set_port: attempting to set port without address family
[2023-01-31T19:17:19.104] error: Error creating slurm stream socket: Address family not supported by protocol
[2023-01-31T19:17:19.104] error: Unable to bind listen port (6818): Address family not supported by protocol

After:

$ cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target
#After=munge.service network-online.target remote-fs.target
#ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes
TasksMax=infinity
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target

# reboot and slurmd is able to start

$ systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2023-01-31 19:06:06 PST; 1min 10s ago
 Main PID: 5982 (slurmd)
   Memory: 2.5M
   CGroup: /system.slice/slurmd.service
           └─5982 /usr/sbin/slurmd -D

Slurm fails to autostart by systemd on CentOS 8 machines after reboot. Had to manually start the service. It seems some transient network or some other problems during the boot time caused slurmd to fail to start. However later it doesn't try restart after failing. This is to add a retry, so that slurmd can be started after some transient error.

gengwg · 2023-02-01T08:04:35Z

Adding a delay also works, e.g.

ExecStartPre=/bin/sleep 30

However I think above fix is better.

wickberg · 2023-02-01T15:58:00Z

Hi -

As noted in CONTRIBUTING.md, we do not accept Pull Requests through Github at this time. Please submit patches as attachments to new bugs to https://bugs.schedmd.com/ under the "C - Contributions" severity level.

Thanks!

wickberg closed this Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix slurmd fails to autostart by systemd on boot #191

Fix slurmd fails to autostart by systemd on boot #191

gengwg commented Feb 1, 2023

gengwg commented Feb 1, 2023

wickberg commented Feb 1, 2023

Fix slurmd fails to autostart by systemd on boot #191

Fix slurmd fails to autostart by systemd on boot #191

Conversation

gengwg commented Feb 1, 2023

Tests

gengwg commented Feb 1, 2023

wickberg commented Feb 1, 2023