Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #857

Open
henryli001 opened this issue Jan 11, 2025 · 2 comments
Assignees

Comments

@henryli001
Copy link

Running standalone docker container in an Azure Linux 2.0 VM with nvidia container toolkit installed will lose access to the GPU and throw the error: "Failed to initialize NVML: Unknown Error" after the container is running for a while. The symptom is similar to that described in this known issue: #48 and can be reproed by running systemctl daemon-reload.

The issue would not show up if I explicitly set --device= for each NVIDIA device node in my system in the docker command. However, this is not a sustainable solution as the number of NVIDIA device nodes in the system may change based on the configuration and thus I'm wondering if there's a better way to let the container automatically access all the NVIDIA devices without explicitly setting --device= for each NVIDIA device node?

@elezar
Copy link
Member

elezar commented Jan 14, 2025

@henryli001 as called out in #48 one other option is to use cgroupfs as the cgroup driver instead of systemd.

Note that using CDI to request devices should also address this problem, as the cgroups are updated by runc (or another low-level runtime) instead of the nvidia-container-runtime-hook.

Which Docker version are you using, and how do you typically launch containers?

@elezar elezar self-assigned this Jan 14, 2025
@santurini
Copy link

santurini commented Jan 17, 2025

I am experiencing a similar issue on Kubernetes. I have followed #48 to reproduce the issue, but the workarounds are not effective. In particular I tried to add the udev rule and also to change the containerd configuration.

As a result when I launch the NVIDIA smi loop pod, as soon as I run systemctl daemon-reload the command fails with:

Failed to initialize NVML: Unknown Error

Versions

  • GPU Operator: v24.9.1
  • Nvidia device plugin with MPS configuration: 0.17.0
  • Containerd version: 1.7.12
  • Nvidia-ctk version: 1.17.3

Configurations

GPU Operator deployment:

helm upgrade \
    gpu-operator \
    nvidia/gpu-operator \
    -n gpu-operator \
    --set devicePlugin.enabled=false \
    --set gfd.enabled=false

Nvidia device plugin:

cat << EOF > dp-mps-config.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy:
    - envvar
    deviceIDStrategy: uuid
sharing:
  mps:
    renameByDefault: false
    resources:
    - name: nvidia.com/gpu
      replicas: 3
EOF
kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \
    --from-file=config=dp-mps-config.yaml
helm upgrade nvdp -i nvdp/nvidia-device-plugin \
    --namespace nvidia-device-plugin \
    --set runtimeClassName=nvidia \
    --set config.name=nvidia-plugin-configs \
    --set nvidiaDriverRoot=/ \
    --set gfd.enabled=true

Udev rule

Created following rule under /lib/udev/rules.d/71-nvidia-dev-char.rules:

# This will create /dev/char symlinks to all device nodes
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-ctk system 	create-dev-char-symlinks --create-all"

Reloaded and triggered the rule:

udevadm trigger

Logs:

Jan 17 15:30:29 node-4b snapd[1149]: udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data

And if I run directly nvidia-ctk system create-dev-char-symlinks --create-all:

time="2025-01-17T15:33:05Z" level=warning msg="unable to get class name for device: failed to find class with id 'c8000'\n"
time="2025-01-17T15:33:05Z" level=warning msg="unable to get class name for device: failed to find class with id 'c8000'\n"
time="2025-01-17T15:33:05Z" level=warning msg="unable to get class name for device: failed to find class with id 'c8000'\n"
time="2025-01-17T15:33:05Z" level=warning msg="unable to get class name for device: failed to find class with id 'c8000'\n"
time="2025-01-17T15:33:06Z" level=warning msg="unable to get class name for device: failed to find class with id 'c8000'\n"
time="2025-01-17T15:33:06Z" level=warning msg="unable to get class name for device: failed to find class with id 'c8000'\n"
time="2025-01-17T15:33:06Z" level=warning msg="unable to get class name for device: failed to find class with id 'c8000'\n"
time="2025-01-17T15:33:06Z" level=warning msg="unable to get class name for device: failed to find class with id 'c8000'\n"

Containerd configuration

My containerd configuration is as follows:

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = false

While If I run crictl info:

      "runtimes": {
        "nvidia": {
          "runtimeType": "io.containerd.runc.v2",
          "runtimePath": "",
          "runtimeEngine": "",
          "PodAnnotations": null,
          "ContainerAnnotations": null,
          "runtimeRoot": "",
          "options": {
            "BinaryName": "/usr/local/nvidia/toolkit/nvidia-container-runtime",
            "SystemdCgroup": true
          },
          "privileged_without_host_devices": false,
          "privileged_without_host_devices_all_devices_allowed": false,
          "baseRuntimeSpec": "",
          "cniConfDir": "",
          "cniMaxConfNum": 0,
          "snapshotter": "",
          "sandboxMode": "podsandbox"
        },
        "runc": {
          "runtimeType": "io.containerd.runc.v2",
          "runtimePath": "",
          "runtimeEngine": "",
          "PodAnnotations": null,
          "ContainerAnnotations": null,
          "runtimeRoot": "",
          "options": {
            "SystemdCgroup": true
          },

eabatalov added a commit to tensorlakeai/tensorlake that referenced this issue Feb 22, 2025
Add code to Function Executor health check that verifies if
Nvidia drivers are working. This allows to detect if Function
Executor is affected by the known issue
NVIDIA/nvidia-container-toolkit#857
eabatalov added a commit to tensorlakeai/tensorlake that referenced this issue Feb 22, 2025
Add code to Function Executor health check that verifies if
Nvidia drivers are working. This allows to detect if Function
Executor is affected by the known issue
NVIDIA/nvidia-container-toolkit#857

Verified manually the health check on a GPU enabled machine.
eabatalov added a commit to tensorlakeai/tensorlake that referenced this issue Feb 23, 2025
Add code to Function Executor health check that verifies if Nvidia
drivers are working. This allows to detect if Function Executor is
affected by the known issue
NVIDIA/nvidia-container-toolkit#857

Verified manually the health check on a GPU enabled machine.
eabatalov added a commit to tensorlakeai/tensorlake that referenced this issue Feb 27, 2025
It turns out that nvidia-smi doesn't detect the known issue
NVIDIA/nvidia-container-toolkit#857

So in addition to that I'm adding a simple pytorch CUDA computation
as a health check. This is known to detect the known issue.

Also refactored the code to extra the health check handler into
a separate class because it quite complex now.
eabatalov added a commit to tensorlakeai/tensorlake that referenced this issue Feb 27, 2025
It turns out that nvidia-smi doesn't detect the known issue
NVIDIA/nvidia-container-toolkit#857

So in addition to that I'm adding a simple pytorch CUDA computation
as a health check. This is known to detect the known issue.

Also refactored the code to extra the health check handler into
a separate class because it quite complex now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants