-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #857
Comments
@henryli001 as called out in #48 one other option is to use Note that using CDI to request devices should also address this problem, as the cgroups are updated by Which Docker version are you using, and how do you typically launch containers? |
I am experiencing a similar issue on Kubernetes. I have followed #48 to reproduce the issue, but the workarounds are not effective. In particular I tried to add the udev rule and also to change the containerd configuration. As a result when I launch the NVIDIA smi loop pod, as soon as I run
Versions
ConfigurationsGPU Operator deployment:
Nvidia device plugin:
Udev ruleCreated following rule under
Reloaded and triggered the rule:
Logs:
And if I run directly
Containerd configurationMy containerd configuration is as follows:
While If I run
|
Add code to Function Executor health check that verifies if Nvidia drivers are working. This allows to detect if Function Executor is affected by the known issue NVIDIA/nvidia-container-toolkit#857
Add code to Function Executor health check that verifies if Nvidia drivers are working. This allows to detect if Function Executor is affected by the known issue NVIDIA/nvidia-container-toolkit#857 Verified manually the health check on a GPU enabled machine.
Add code to Function Executor health check that verifies if Nvidia drivers are working. This allows to detect if Function Executor is affected by the known issue NVIDIA/nvidia-container-toolkit#857 Verified manually the health check on a GPU enabled machine.
It turns out that nvidia-smi doesn't detect the known issue NVIDIA/nvidia-container-toolkit#857 So in addition to that I'm adding a simple pytorch CUDA computation as a health check. This is known to detect the known issue. Also refactored the code to extra the health check handler into a separate class because it quite complex now.
It turns out that nvidia-smi doesn't detect the known issue NVIDIA/nvidia-container-toolkit#857 So in addition to that I'm adding a simple pytorch CUDA computation as a health check. This is known to detect the known issue. Also refactored the code to extra the health check handler into a separate class because it quite complex now.
Running standalone docker container in an Azure Linux 2.0 VM with nvidia container toolkit installed will lose access to the GPU and throw the error: "Failed to initialize NVML: Unknown Error" after the container is running for a while. The symptom is similar to that described in this known issue: #48 and can be reproed by running
systemctl daemon-reload
.The issue would not show up if I explicitly set
--device=
for each NVIDIA device node in my system in the docker command. However, this is not a sustainable solution as the number of NVIDIA device nodes in the system may change based on the configuration and thus I'm wondering if there's a better way to let the container automatically access all the NVIDIA devices without explicitly setting--device=
for each NVIDIA device node?The text was updated successfully, but these errors were encountered: