Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix slurmd fails to autostart by systemd on boot #191

Closed
wants to merge 1 commit into from
Closed

Fix slurmd fails to autostart by systemd on boot #191

wants to merge 1 commit into from

Conversation

gengwg
Copy link

@gengwg gengwg commented Feb 1, 2023

Slurm fails to autostart by systemd on CentOS 8 machines after reboot. Had to manually start the service.

It seems some transient network or some other problems (DNS?) during the boot time caused slurmd to fail to start. However later it doesn't try restart after failing. This is to add a retry, so that slurmd can be started after some transient error.

Tests

Before:

$ systemctl cat  slurmd
# /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target
#ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes
TasksMax=infinity
[Install]
WantedBy=multi-user.target

# reboot and slurmd fails to start

$ systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2023-01-31 19:17:19 PST; 37s ago
  Process: 3877 ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 3877 (code=exited, status=1/FAILURE)
$ tail /var/log/slurmd.log
[2023-01-31T19:17:18.974] cred/munge: init: Munge credential signature plugin loaded
[2023-01-31T19:17:18.975] slurmd version 20.11.9 started
[2023-01-31T19:17:18.976] debug:  jobacct_gather/linux: init: Job accounting gather LINUX plugin loaded
[2023-01-31T19:17:18.977] debug:  job_container/none: init: job_container none plugin loaded
[2023-01-31T19:17:18.980] debug:  switch/none: init: switch NONE plugin loaded
[2023-01-31T19:17:19.102] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2023-01-31T19:17:19.102] error: slurm_set_addr: Unable to resolve "(null)"
[2023-01-31T19:17:19.102] error: slurm_set_port: attempting to set port without address family
[2023-01-31T19:17:19.104] error: Error creating slurm stream socket: Address family not supported by protocol
[2023-01-31T19:17:19.104] error: Unable to bind listen port (6818): Address family not supported by protocol

After:

$ cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target
#After=munge.service network-online.target remote-fs.target
#ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes
TasksMax=infinity
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target

# reboot and slurmd is able to start

$ systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2023-01-31 19:06:06 PST; 1min 10s ago
 Main PID: 5982 (slurmd)
   Memory: 2.5M
   CGroup: /system.slice/slurmd.service
           └─5982 /usr/sbin/slurmd -D

Slurm fails to autostart by systemd on CentOS 8 machines after reboot. Had to manually start the service.

It seems some transient network or some other problems during the boot time caused slurmd to fail to start. However later it doesn't try restart after failing. This is to add a retry, so that slurmd can be started after some transient error.
@gengwg
Copy link
Author

gengwg commented Feb 1, 2023

Adding a delay also works, e.g.

ExecStartPre=/bin/sleep 30

However I think above fix is better.

@wickberg
Copy link
Member

wickberg commented Feb 1, 2023

Hi -

As noted in CONTRIBUTING.md, we do not accept Pull Requests through Github at this time. Please submit patches as attachments to new bugs to https://bugs.schedmd.com/ under the "C - Contributions" severity level.

Thanks!

@wickberg wickberg closed this Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants