Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some pods are stuck in init on one of our clusters #487

Open
Alwinator opened this issue Feb 9, 2023 · 18 comments
Open

Some pods are stuck in init on one of our clusters #487

Alwinator opened this issue Feb 9, 2023 · 18 comments

Comments

@Alwinator
Copy link

Alwinator commented Feb 9, 2023

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node? No, I am running Red Hat Enterprise Linux 8.6 (Ootpa)
  • Are you running Kubernetes v1.13+? Yes, I am running OpenShift 4.11.35 with Kubernetes 1.23
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? Yes, CRI-O
  • GPU Operator version: 22.9.2

1. Issue or feature description

On one of our clusters, many Nvidia pods are stuck in init. I checked the logs and could not find something suspicious. Maybe there are other logs that tell more?

I suspect the problem appeared after the migration to OpenShift 4.11.

2. Steps to reproduce the issue

Since this is a production cluster it is influenced by hundreds of customers, it is quite hard to find a way to reproduce, however here are the possible ways to reproduce:

  1. Setup OpenShift with version 4.11.20
  2. Install NFD Operator 4.11.0-202212070335
  3. Install GPU Operator version 22.9.2
  4. Use Nvidia A30 GPUs

3. Information to attach

Logs of one of the GPU Feature Discovery pods stuck in init (gpu-feature-discovery-ddw67)
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
waiting for nvidia container stack to be setup
...
Logs of one of the Mig Manager pods stuck in init (nvidia-mig-manager-b4675)
waiting for nvidia container toolkit to be setup
waiting for nvidia container toolkit to be setup
waiting for nvidia container toolkit to be setup
waiting for nvidia container toolkit to be setup
waiting for nvidia container toolkit to be setup
...
Logs of one of the Toolkit Deamonset pods stuck in init (nvidia-container-toolkit-daemonset-664n2)
time="2023-02-07T13:08:04Z" level=info msg="Driver is not pre-installed on the host. Checking driver container status."
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds
...
Logs of one of the Driver Deamonset pods stuck in init (nvidia-driver-daemonset-411.86.202212072103-0-8k4vr)
Running nv-ctr-run-with-dtk
...
WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.
WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.
...
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 4.18.0-372.36.1.el8_6.x86_64
...
Starting NVIDIA persistence daemon...
ls: cannot access '/proc/driver/nvidia-nvswitch/devices/*': No such file or directory
Mounting NVIDIA driver rootfs...
Change device files security context for selinux compatibility
Done, now waiting for signal

oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS             RESTARTS        AGE
gpu-feature-discovery-2x45b                           1/1     Running            0               7d8h
gpu-feature-discovery-ddw67                           0/1     Init:0/1           0               42h
gpu-feature-discovery-hxcpm                           1/1     Running            0               7d8h
gpu-operator-66c69d4d8b-7ll7f                         1/1     Running            0               42h
nvidia-container-toolkit-daemonset-62b62              1/1     Running            0               7d19h
nvidia-container-toolkit-daemonset-664n2              0/1     Init:0/1           0               42h
nvidia-container-toolkit-daemonset-hwbpw              0/1     Init:0/1           0               42h
nvidia-cuda-validator-9j7mq                           0/1     Completed          0               7d19h
nvidia-cuda-validator-czqfk                           0/1     Completed          0               7d19h
nvidia-dcgm-exporter-4rnxn                            1/1     Running            0               7d8h
nvidia-dcgm-exporter-79mqk                            0/1     Init:0/2           0               42h
nvidia-dcgm-exporter-cv6rg                            0/1     CrashLoopBackOff   495 (32s ago)   42h
nvidia-dcgm-jlznx                                     0/1     Init:0/1           0               42h
nvidia-dcgm-klpt5                                     1/1     Running            0               7d8h
nvidia-dcgm-pjsqb                                     0/1     CrashLoopBackOff   503 (25s ago)   42h
nvidia-device-plugin-daemonset-g42hg                  1/1     Running            0               7d19h
nvidia-device-plugin-daemonset-jsg8j                  0/1     Init:0/1           0               42h
nvidia-device-plugin-daemonset-rkhgx                  0/1     Init:0/1           0               42h
nvidia-device-plugin-validator-f5p4w                  0/1     Completed          0               7d19h
nvidia-device-plugin-validator-tmszz                  0/1     Completed          0               7d19h
nvidia-driver-daemonset-411.86.202212072103-0-8k4vr   2/2     Running            2               7d17h
nvidia-driver-daemonset-411.86.202212072103-0-9jdnc   2/2     Running            0               7d19h
nvidia-driver-daemonset-411.86.202212072103-0-cvtc4   2/2     Running            0               7d19h
nvidia-mig-manager-6qdfb                              1/1     Running            0               7d8h
nvidia-mig-manager-b4675                              0/1     Init:0/1           0               42h
nvidia-mig-manager-glb7z                              0/1     Init:0/1           0               42h
nvidia-node-status-exporter-dvfdg                     1/1     Running            0               7d8h
nvidia-node-status-exporter-jl9x5                     1/1     Running            2               7d8h
nvidia-node-status-exporter-jmbvp                     1/1     Running            0               7d8h
nvidia-operator-validator-cghvm                       1/1     Running            0               7d19h
nvidia-operator-validator-gq5g2                       0/1     Init:0/4           0               42h
nvidia-operator-validator-lwd4c                       0/1     Init:0/4           0               42h
oc get ds -n nvidia-gpu-operator
NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
gpu-feature-discovery                           3         3         2       3            2           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                      107d
nvidia-container-toolkit-daemonset              3         3         1       2            1           nvidia.com/gpu.deploy.container-toolkit=true                                                                          107d
nvidia-dcgm                                     3         3         1       3            1           nvidia.com/gpu.deploy.dcgm=true                                                                                       107d
nvidia-dcgm-exporter                            3         3         1       3            1           nvidia.com/gpu.deploy.dcgm-exporter=true                                                                              107d
nvidia-device-plugin-daemonset                  3         3         1       2            1           nvidia.com/gpu.deploy.device-plugin=true                                                                              107d
nvidia-driver-daemonset-411.86.202212072103-0   3         3         3       0            3           feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202212072103-0,nvidia.com/gpu.deploy.driver=true   7d21h
nvidia-mig-manager                              3         3         1       3            1           nvidia.com/gpu.deploy.mig-manager=true                                                                                107d
nvidia-node-status-exporter                     3         3         3       3            3           nvidia.com/gpu.deploy.node-status-exporter=true                                                                       107d
nvidia-operator-validator                       3         3         1       2            1           nvidia.com/gpu.deploy.operator-validator=true                                                                         107d
NVIDIA shared directory: `ls -la /run/nvidia`
total 4
drwxr-xr-x.  4 root root  100 Feb  7 13:09 .
drwxr-xr-x. 48 root root 1260 Feb  7 13:07 ..
dr-xr-xr-x.  1 root root  103 Feb  7 13:08 driver
-rw-r--r--.  1 root root    6 Feb  7 13:09 nvidia-driver.pid
drwxr-xr-x.  2 root root   40 Feb  7 13:07 validations
NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`
total 12916
drwxr-xr-x. 3 root root    4096 Feb  1 13:45 .
drwxr-xr-x. 3 root root      21 Feb  1 13:45 ..
drwxr-xr-x. 3 root root      38 Feb  1 13:45 .config
lrwxrwxrwx. 1 root root      32 Feb  1 13:45 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.11.0
-rwxr-xr-x. 1 root root 2959400 Feb  1 13:45 libnvidia-container-go.so.1.11.0
lrwxrwxrwx. 1 root root      29 Feb  1 13:45 libnvidia-container.so.1 -> libnvidia-container.so.1.11.0
-rwxr-xr-x. 1 root root  191784 Feb  1 13:45 libnvidia-container.so.1.11.0
-rwxr-xr-x. 1 root root     154 Feb  1 13:45 nvidia-container-cli
-rwxr-xr-x. 1 root root   48072 Feb  1 13:45 nvidia-container-cli.real
-rwxr-xr-x. 1 root root     342 Feb  1 13:45 nvidia-container-runtime
-rwxr-xr-x. 1 root root     414 Feb  1 13:45 nvidia-container-runtime-experimental
-rwxr-xr-x. 1 root root     203 Feb  1 13:45 nvidia-container-runtime-hook
-rwxr-xr-x. 1 root root 2142816 Feb  1 13:45 nvidia-container-runtime-hook.real
-rwxr-xr-x. 1 root root 3771792 Feb  1 13:45 nvidia-container-runtime.experimental
-rwxr-xr-x. 1 root root 4079768 Feb  1 13:45 nvidia-container-runtime.real
lrwxrwxrwx. 1 root root      29 Feb  1 13:45 nvidia-container-toolkit -> nvidia-container-runtime-hook
NVIDIA driver directory: `ls -la /run/nvidia/driver`
total 0
dr-xr-xr-x.    1 root root  103 Feb  7 13:08 .
drwxr-xr-x.    4 root root  100 Feb  7 13:09 ..
lrwxrwxrwx.    1 root root    7 Jun 21  2021 bin -> usr/bin
dr-xr-xr-x.    2 root root    6 Jun 21  2021 boot
drwxr-xr-x.   16 root root 3100 Feb  7 13:09 dev
drwxr-xr-x.    1 root root   43 Feb  7 13:08 drivers
drwxr-xr-x.    1 root root   68 Feb  7 13:09 etc
drwxr-xr-x.    2 root root    6 Jun 21  2021 home
drwxr-xr-x.    2 root root   24 Feb  7 13:08 host-etc
lrwxrwxrwx.    1 root root    7 Jun 21  2021 lib -> usr/lib
lrwxrwxrwx.    1 root root    9 Jun 21  2021 lib64 -> usr/lib64
drwxr-xr-x.    2 root root   38 Dec  6 19:28 licenses
drwx------.    2 root root    6 Oct 19 04:46 lost+found
drwxr-xr-x.    2 root root    6 Jun 21  2021 media
drwxr-xr-x.    1 root root   42 Feb  7 13:08 mnt
drwxr-xr-x.    2 root root    6 Jun 21  2021 opt
dr-xr-xr-x. 2895 root root    0 Feb  7 13:06 proc
dr-xr-x---.    3 root root  213 Oct 19 04:57 root
drwxr-xr-x.    1 root root  136 Feb  7 13:09 run
lrwxrwxrwx.    1 root root    8 Jun 21  2021 sbin -> usr/sbin
drwxr-xr-x.    2 root root    6 Jun 21  2021 srv
dr-xr-xr-x.   13 root root    0 Feb  7 13:07 sys
drwxrwxrwx.    1 root root   18 Feb  7 13:09 tmp
drwxr-xr-x.    1 root root   65 Oct 19 04:47 usr
drwxr-xr-x.    1 root root   30 Oct 19 04:47 var
@shivamerla
Copy link
Contributor

@Alwinator From the driver pod logs you posted, looks like driver install is successful. Can you exec into that container and run "nvidia-smi"?

oc exec -n nvidia-gpu-operator nvidia-driver-daemonset-411.86.202212072103-0-8k4vr -- nvidia-smi

@Alwinator
Copy link
Author

@shivamerla

Thu Feb  9 13:25:22 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A30          On   | 00000000:21:00.0 Off |                   On |
| N/A   31C    P0    27W / 165W |      0MiB / 24576MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A30          On   | 00000000:81:00.0 Off |                   On |
| N/A   31C    P0    29W / 165W |      0MiB / 24576MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A30          On   | 00000000:E2:00.0 Off |                   On |
| N/A   31C    P0    29W / 165W |      0MiB / 24576MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  No MIG devices found                                                       |
+-----------------------------------------------------------------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@shivamerla
Copy link
Contributor

shivamerla commented Feb 12, 2023

@Alwinator If nvidia-smi is successful then driver Daemonset will create the file /run/nvidia/validations/.driver-ctr-ready from the startup probe here. Is it possible to double check if this status file got created on the worker node but toolkit pod is not seeing it for some reason? If this is true you can try restarting container-toolkit pods for driver readiness checks to pass.

@Alwinator
Copy link
Author

@shivamerla The file does not exist. There is not even an Nvidia folder in the run folder.

# cat /run/nvidia/validations/.driver-ctr-ready
cat: /run/nvidia/validations/.driver-ctr-ready: No such file or directory
# cd /run
# ls    
blkid  console  cryptsetup  faillock  lock  log  rhsm  secrets  sepermit  setrans  systemd  user

@shivamerla
Copy link
Contributor

From the logs attached to this issue earlier, looks like driver directory is mounted. May be when you were checking driver container restarted for some reason and unmounted /run/nvidia directory?

ls -la /run/nvidia/driver

total 0
dr-xr-xr-x.    1 root root  103 Feb  7 13:08 .
drwxr-xr-x.    4 root root  100 Feb  7 13:09 ..
lrwxrwxrwx.    1 root root    7 Jun 21  2021 bin -> usr/bin
dr-xr-xr-x.    2 root root    6 Jun 21  2021 boot
drwxr-xr-x.   16 root root 3100 Feb  7 13:09 dev
drwxr-xr-x.    1 root root   43 Feb  7 13:08 drivers
drwxr-xr-x.    1 root root   68 Feb  7 13:09 etc
drwxr-xr-x.    2 root root    6 Jun 21  2021 home
drwxr-xr-x.    2 root root   24 Feb  7 13:08 host-etc
lrwxrwxrwx.    1 root root    7 Jun 21  2021 lib -> usr/lib
lrwxrwxrwx.    1 root root    9 Jun 21  2021 lib64 -> usr/lib64
drwxr-xr-x.    2 root root   38 Dec  6 19:28 licenses
drwx------.    2 root root    6 Oct 19 04:46 lost+found
drwxr-xr-x.    2 root root    6 Jun 21  2021 media
drwxr-xr-x.    1 root root   42 Feb  7 13:08 mnt
drwxr-xr-x.    2 root root    6 Jun 21  2021 opt
dr-xr-xr-x. 2895 root root    0 Feb  7 13:06 proc
dr-xr-x---.    3 root root  213 Oct 19 04:57 root
drwxr-xr-x.    1 root root  136 Feb  7 13:09 run
lrwxrwxrwx.    1 root root    8 Jun 21  2021 sbin -> usr/sbin
drwxr-xr-x.    2 root root    6 Jun 21  2021 srv
dr-xr-xr-x.   13 root root    0 Feb  7 13:07 sys
drwxrwxrwx.    1 root root   18 Feb  7 13:09 tmp
drwxr-xr-x.    1 root root   65 Oct 19 04:47 usr
drwxr-xr-x.    1 root root   30 Oct 19 04:47 var

Is Driver container constantly restarting?

@Alwinator
Copy link
Author

I have executed the ls -la /run/nvidia/driver on all GPU nodes on this cluster. And every node shows that this directory does not exist. Additionally, the driver pod was not restarted for one week. No pods are restarting, except the nvidia-dcgm, because they have a CrashLoopBackOff . This is caused because many other nvidia pods are stuck in init state.

@warroyo
Copy link

warroyo commented May 1, 2023

I am seeing this same exact issue, with the exception that this is not on openshift

additionally I don't see the startupProbe set on the pods.

@likku123
Copy link

+1

1 similar comment
@dgabrysch
Copy link

+1

@FanKang2021
Copy link

I have encountered the same problem. Have you solved it?

@likku123
Copy link

Is there any recent update of linux kernel happened on the node ? or any restart of node happened?
If yes, please check the kernel version and it's compatibility with the GPU operator version you are trying to install.

@FanKang2021
Copy link

helm install --wait --generate-name
-n gpu-operator --create-namespace
nvidia/gpu-operator
--set toolkit.enabled=false
I read the official website that you can deploy in this way without a driver, but my physical machine has not deployed nvidia driver, and an error has been reported. In the nvida-operator-validator command, "running command chroot with args [/run/nvidia/driver nvida-smi]
chroot: failed to run command 'nvidia-smi': No such file or directory
command failed, retrying after 5 seconds", is my understanding wrong, the host must install the driver?

@likku123
Copy link

likku123 commented Apr 15, 2024

I believe you are trying to install latest gpu-operator. Can you please provide the output of this command.
uname -sra

Also you should install with toolkit.enabled=true. I believe your node does not have the toolkit already installed.

@FanKang2021
Copy link

I believe you are trying to install latest gpu-operator. Can you please provide the output of this command. uname -sra

Also you should install with toolkit.enabled=true. I believe your node does not have the toolkit already installed.

Thank,I have solved it.

@sunwuyan
Copy link

sunwuyan commented Apr 25, 2024

I believe you are trying to install latest gpu-operator. Can you please provide the output of this command. uname -sra

Also you should install with toolkit.enabled=true. I believe your node does not have the toolkit already installed.

Hi, I have manually installed the nvidia gpu driver on the workstation. When I helm installed the latest gpu-operator(v23.9.2), I directly set driver.enabled to false, and toolkit.enabled default to true. After the installation was successful, the pod for testing the gpu also ran successfully, but when restarted the gpu node, I also encountered the similar problem.

@umesh211
Copy link

Hi All, We have also encountered similar issue where nvidia-gpu-operator pod were in init state during the openshift cluster upgrade. PFB details:

1: Openshift Version: 4.15.23
2: nvida-gpu-operator version: 24.6.0

We were getting below errors wrt nvidia-gpu-operator.

command failed, retrying after 5 seconds

running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
time="2024-08-02T06:36:18Z" level=info msg="Driver is not pre-installed on the host. Checking driver container status."


Also, the GPU operator was producing the following error log:

2024-08-02T08:00:24.341532262Z E0802 08:00:24.341521 1 reflector.go:150] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:106: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:nvidia-gpu-operator:gpu-operator" cannot list resource "configmaps" in API grou

Everythings was working fine before upgrade but we received below error message during the openshift cluster upgrad

We have added a clusterrole and clusterRoleBinding for nvidia-gpu-operator and it worked somehow but we are not sure about root cause of the issue and what might be solution and actual resolution for this kinds of the issues. Does anyone has any idea for this issue?

@tariq1890
Copy link
Contributor

tariq1890 commented Aug 12, 2024

2024-08-02T08:00:24.341532262Z E0802 08:00:24.341521 1 reflector.go:150] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:106: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:nvidia-gpu-operator:gpu-operator" cannot list resource "configmaps" in API grou

@umesh211 We are working on publishing v24.6.1 which will have fix the for this particular issue

@study-eq-eat-drink
Copy link

study-eq-eat-drink commented Aug 14, 2024

I encountered the same issue and resolved it as follows.

Cause

First, the .driver-ctr-ready file is created by the startupProbe of the NVIDIA GPU driver Pod. (Reference)

In my case, the following error message was found in the init container of the NVIDIA GPU driver Pod:

....(skip)
**Could not unload NVIDIA driver kernel modules, driver is in use
Auto drain of the node ml-dev-01 is disabled by the upgrade policy
Failed to uninstall nvidia driver components**
**Auto eviction of GPU pods on node ml-dev-01 is disabled by the upgrade policy
Auto drain of the node ml-dev-01 is disabled by the upgrade policy**
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/ml-dev-01 labeled

This message indicated that the Pod on the node targeted for the upgrade could not be drained.

To address this issue, the following steps were taken:

Solution

The following settings were added to the gpu-operator installed via Helm chart:

driver:
  upgradePolicy:
    autoUpgrade: true
    drain:
      enable: true
      force: true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants