nvidia-device-plugin-validator fails after node reboot (with MIG enabled) #403

dasantonym · 2022-09-09T08:51:49Z

System

Running on bare-metal

Ubuntu 20.04.4
Kubernetes v1.24.3
Containerd 1.6.7
GPU-Operator v1.11.1

Setup

GPU-Operator is installed with:

helm install --wait --debug --generate-name --create-namespace \
      nvidia/gpu-operator \
      -n gpu-operator \
      --set migManager.config.name=mig-config \
      --set mig.strategy=mixed \
      --set driver.enabled=false \
      --set toolkit.enabled=false

MIG-Config:

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-disabled:
        - devices: all
          mig-enabled: false
      node-standard:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "2g.20gb": 3
        - devices: [1]
          mig-enabled: true
          mig-devices:
            "2g.20gb": 3
        - devices: [2]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
        - devices: [3]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
        - devices: [4,5,6,7]
          mig-enabled: false

Issue

I am running a cluster with multiple GPU-Nodes and some of the nodes are using MIG, others are not. Now, as long as all nodes have their MIG config set to all-disabled, everything is fine. As soon as I set one node to a mixed MIG config, the nvidia-device-plugin-validator fails with the message:

spec: failed to generate spec: lstat /run/nvidia/driver/dev/nvidiactl: no such file or directory

Once I switch back the MIG config to all-disabled, the validation succeeds again.

Edit: To further clarify: The validator only fails with the wrong driver root value for the node where I activate MIG. The other validator pods are unaffected, even after further (node) restarts, as long as MIG remains off.

The text was updated successfully, but these errors were encountered:

shivamerla · 2022-09-09T14:29:19Z

@dasantonym to confirm, wrong driver root was set on the node nvidia-container-runtime config file? Or did you see wrong NVIDIA_DRIVER_ROOT set within the device-plugin pod?

dasantonym · 2022-09-09T14:42:53Z

Thanks for the prompt reply. I did not make any config changes, but where should I look for the nvidia-container-runtime file? Do you mean the containerd config file or is this also located in a specific pod?

The host's containerd config looks like this:

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

The env var NVIDIA_DRIVER_ROOT is set to / for the failing pod.

dasantonym · 2022-09-09T15:00:47Z

I just noticed something else: I just added a new node and that one switches fine between MIG and non-MIG setup. The other nodes are in use so I can't check them now. This might very well be related to this one node only...

If you have any suggestions for further debugging, I'd be delighted. I will now make a few more tries and wait for the other nodes to be out of use again to check if they are also affected.

shivamerla · 2022-09-09T15:08:58Z

from the error you posted, looks like device-plugin is getting started with wrong NVIDIA_DRIVER_ROOT=/run/nvidia/driver. This is set based on the file /run/nvidia/validations/host-driver-ready in place.

[/run/nvidia/driver/dev/nvidiactl](spec: failed to generate spec: lstat /run/nvidia/driver/dev/nvidiactl: no such file or directory)

dasantonym · 2022-09-09T16:00:54Z

I'm afraid I don't understand this correctly. The value of NVIDIA_DRIVER_ROOT within the device-plugin-daemonset pod on the failing node is /. The file /run/nvidia/validations/host-driver-ready exists (both in the pod and on the host). Only the ClusterPolicy nvidia-device-plugin-validator continously fails for the node until I disable MIG again. The device plugin itself looks ok.

Is there any other place where the wrong driver root could be coming from?

shivamerla · 2022-09-09T16:04:43Z

Can you restart device-plugin on the failing node to confirm if the issue is persistent on MIG changes. Other place for driver-root setting is /etc/nvidia-container-runtime/config.toml. The reason i asked for the device-plugin restart is we set this during startup of the plugin, so wondering if there is any race happening there.

        name: nvidia-device-plugin
        command: [bash, -c]
        args: ["[[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-device-plugin;"]

dasantonym · 2022-09-10T06:30:44Z

I did a restart, but got the same failure. Now I did more testing and it seems to me that this really is related to one or some nodes only. I am going to check this again next week and try to find reason, but maybe resetting, removing and re-adding the node(s) might do the trick.

I'll report back with more info!

dasantonym · 2022-09-27T07:13:53Z

Sorry for the long delay. I did not yet have time to investigate this more closely.

So far, I have had no luck. I first reset and removed the node from Kubernetes, but after re-adding, the problem persisted. Then I reset, removed and re-installed the entire NVIDIA software, but also no luck there.

Unfortunately, the error also persists for at least one other node. One node does the change from non-MIG to MIG state correctly (at least it did, I cannot verify now since the cluster is in production).

Where could this wrong setting for the driver-root be persisted elsewhere? And why is it only active when MIG mode is on? This is confusing. Do you maybe have other ideas on where to look?

One more question: You mentioned /etc/nvidia-container-runtime/config.toml – where do i find this file?

dasantonym · 2023-01-30T09:36:10Z

@shivamerla I finally got around to properly debugging this and you were right, there must be a race happening here.

This only affects the nodes after a reboot and at first, everything runs ok, only the device-plugin-validator fails. Even the cuda-validation works and once I restart the device-plugin-validator, it validates properly.

So the file /run/nvidia/validations/host-driver-ready must be appearing too late for the validator and the validation fails.

Can you advise as to how to work around this? This prevents me from automatically scaling the physical nodes as I always have to be there to manually restart the validator.

dasantonym · 2023-01-30T09:45:27Z

Just to add: this does not seem to have anything to do with MIG.

cdesiniotis · 2023-02-02T02:10:30Z

@dasantonym thanks for reporting this issue. During MIG reconfigurations, there is a race condition between when the host-driver-ready file is created and when it is read by some of our gpu-operator components that depend on it. I am working on fixing this issue here in mig-manager: https://gitlab.com/nvidia/cloud-native/mig-parted/-/merge_requests/115

Just to add: this does not seem to have anything to do with MIG.

How did you come to this conclusion? Did you also reproduce this issue on nodes without any MIG configuration?

cdesiniotis · 2023-02-02T02:13:02Z

cc @klueska

dasantonym · 2023-02-02T08:37:52Z

Thanks for looking into this! I guess I just got confused over trying out so many variations. I at first thought the error only occurs if I change the MIG config, then saw that it happens even when the config is unchanged on boot, but in fact, MIG was always enabled. I justed tested again on a different setup with mig.strategy=none and MIG disabled on all GPUs and it starts up properly.

I retract my additional comment!

wwj-2017-1117 · 2024-01-20T11:15:23Z

I also meet the same problem

accelide · 2024-01-30T11:26:19Z

Running into same issue with MIG enabled or disabled. This issue occurs intermittently when NVIDIA drivers are installed at OS level. To mitigate the issue, created custom gpu-operator image by adding retry logic with WHILE loop to https://github.com/NVIDIA/gpu-operator/blob/master/assets/state-container-toolkit/0400_configmap.yaml and building image. Attached are sample 0400_configmap.yaml with WHILE loop and Dockerfile to create custom image
0400_configmap.txt
Dockerfile.txt

Helm command to install with custom image:

helm install --wait --debug --generate-name --create-namespace \
      nvidia/gpu-operator \
      -n gpu-operator \
     --set driver.enabled=false \
     --set operator.version=<replace with customimage tag>

@cdesiniotis

cdesiniotis · 2024-01-31T23:42:12Z

@accelide thanks for providing details on what resolved the issue for you. It looks like a race condition still exists. I will have to get back to you once I investigate a bit further.

dasantonym changed the title ~~nvidia-device-plugin-validator fails when MIG is enabled for a node~~ nvidia-device-plugin-validator fails after node reboot Jan 30, 2023

dasantonym changed the title ~~nvidia-device-plugin-validator fails after node reboot~~ nvidia-device-plugin-validator fails after node reboot (with MIG enabled) Feb 2, 2023

cdesiniotis added the bug Issue/PR to expose/discuss/fix a bug label Jan 31, 2024

cdesiniotis self-assigned this Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-device-plugin-validator fails after node reboot (with MIG enabled) #403

nvidia-device-plugin-validator fails after node reboot (with MIG enabled) #403

dasantonym commented Sep 9, 2022 •

edited

Loading

shivamerla commented Sep 9, 2022

dasantonym commented Sep 9, 2022

dasantonym commented Sep 9, 2022

shivamerla commented Sep 9, 2022

dasantonym commented Sep 9, 2022

shivamerla commented Sep 9, 2022

dasantonym commented Sep 10, 2022

dasantonym commented Sep 27, 2022

dasantonym commented Jan 30, 2023

dasantonym commented Jan 30, 2023 •

edited

Loading

cdesiniotis commented Feb 2, 2023

cdesiniotis commented Feb 2, 2023

dasantonym commented Feb 2, 2023

wwj-2017-1117 commented Jan 20, 2024

accelide commented Jan 30, 2024 •

edited

Loading

cdesiniotis commented Jan 31, 2024

nvidia-device-plugin-validator fails after node reboot (with MIG enabled) #403

nvidia-device-plugin-validator fails after node reboot (with MIG enabled) #403

Comments

dasantonym commented Sep 9, 2022 • edited Loading

System

Setup

Issue

shivamerla commented Sep 9, 2022

dasantonym commented Sep 9, 2022

dasantonym commented Sep 9, 2022

shivamerla commented Sep 9, 2022

dasantonym commented Sep 9, 2022

shivamerla commented Sep 9, 2022

dasantonym commented Sep 10, 2022

dasantonym commented Sep 27, 2022

dasantonym commented Jan 30, 2023

dasantonym commented Jan 30, 2023 • edited Loading

cdesiniotis commented Feb 2, 2023

cdesiniotis commented Feb 2, 2023

dasantonym commented Feb 2, 2023

wwj-2017-1117 commented Jan 20, 2024

accelide commented Jan 30, 2024 • edited Loading

cdesiniotis commented Jan 31, 2024

dasantonym commented Sep 9, 2022 •

edited

Loading

dasantonym commented Jan 30, 2023 •

edited

Loading

accelide commented Jan 30, 2024 •

edited

Loading