docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error"

Running standalone docker container in an Azure Linux 2.0 VM with nvidia container toolkit installed will lose access to the GPU and throw the error: "Failed to initialize NVML: Unknown Error" after the container is running for a while. The symptom is similar to that described in this known issue: https://github.com/NVIDIA/nvidia-container-toolkit/issues/48 and can be reproed by running `systemctl daemon-reload`. 

The issue would not show up if I explicitly set `--device=` for each NVIDIA device node in my system in the docker command. However, this is not a sustainable solution as the number of NVIDIA device nodes in the system may change based on the configuration and thus I'm wondering if there's a better way to let the container automatically access all the NVIDIA devices without explicitly setting `--device=` for each NVIDIA device node?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #857

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #857

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions