-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some pods are stuck in init on one of our clusters #487
Comments
@Alwinator From the driver pod logs you posted, looks like driver install is successful. Can you exec into that container and run "nvidia-smi"?
|
|
@Alwinator If |
@shivamerla The file does not exist. There is not even an Nvidia folder in the run folder.
|
From the logs attached to this issue earlier, looks like driver directory is mounted. May be when you were checking driver container restarted for some reason and unmounted /run/nvidia directory?
Is Driver container constantly restarting? |
I have executed the |
I am seeing this same exact issue, with the exception that this is not on openshift additionally I don't see the startupProbe set on the pods. |
+1 |
1 similar comment
+1 |
I have encountered the same problem. Have you solved it? |
Is there any recent update of linux kernel happened on the node ? or any restart of node happened? |
helm install --wait --generate-name |
I believe you are trying to install latest gpu-operator. Can you please provide the output of this command. Also you should install with toolkit.enabled=true. I believe your node does not have the toolkit already installed. |
Thank,I have solved it. |
Hi, I have manually installed the nvidia gpu driver on the workstation. When I helm installed the latest gpu-operator(v23.9.2), I directly set driver.enabled to false, and toolkit.enabled default to true. After the installation was successful, the pod for testing the gpu also ran successfully, but when restarted the gpu node, I also encountered the similar problem. |
Hi All, We have also encountered similar issue where nvidia-gpu-operator pod were in init state during the openshift cluster upgrade. PFB details: 1: Openshift Version: 4.15.23 We were getting below errors wrt nvidia-gpu-operator. command failed, retrying after 5 seconds
Everythings was working fine before upgrade but we received below error message during the openshift cluster upgrad We have added a clusterrole and clusterRoleBinding for nvidia-gpu-operator and it worked somehow but we are not sure about root cause of the issue and what might be solution and actual resolution for this kinds of the issues. Does anyone has any idea for this issue? |
@umesh211 We are working on publishing v24.6.1 which will have fix the for this particular issue |
I encountered the same issue and resolved it as follows. CauseFirst, the In my case, the following error message was found in the init container of the NVIDIA GPU driver Pod:
This message indicated that the Pod on the node targeted for the upgrade could not be drained. To address this issue, the following steps were taken: SolutionThe following settings were added to the
|
1. Quick Debug Checklist
1. Issue or feature description
On one of our clusters, many Nvidia pods are stuck in init. I checked the logs and could not find something suspicious. Maybe there are other logs that tell more?
I suspect the problem appeared after the migration to OpenShift 4.11.
2. Steps to reproduce the issue
Since this is a production cluster it is influenced by hundreds of customers, it is quite hard to find a way to reproduce, however here are the possible ways to reproduce:
3. Information to attach
Logs of one of the GPU Feature Discovery pods stuck in init (gpu-feature-discovery-ddw67)
Logs of one of the Mig Manager pods stuck in init (nvidia-mig-manager-b4675)
Logs of one of the Toolkit Deamonset pods stuck in init (nvidia-container-toolkit-daemonset-664n2)
Logs of one of the Driver Deamonset pods stuck in init (nvidia-driver-daemonset-411.86.202212072103-0-8k4vr)
oc get pods -n nvidia-gpu-operator
oc get ds -n nvidia-gpu-operator
NVIDIA shared directory: `ls -la /run/nvidia`
NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`
NVIDIA driver directory: `ls -la /run/nvidia/driver`
The text was updated successfully, but these errors were encountered: