Skip to content

docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #857

Open
@henryli001

Description

@henryli001

Running standalone docker container in an Azure Linux 2.0 VM with nvidia container toolkit installed will lose access to the GPU and throw the error: "Failed to initialize NVML: Unknown Error" after the container is running for a while. The symptom is similar to that described in this known issue: #48 and can be reproed by running systemctl daemon-reload.

The issue would not show up if I explicitly set --device= for each NVIDIA device node in my system in the docker command. However, this is not a sustainable solution as the number of NVIDIA device nodes in the system may change based on the configuration and thus I'm wondering if there's a better way to let the container automatically access all the NVIDIA devices without explicitly setting --device= for each NVIDIA device node?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions