Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-device-plugin-validator fails after node reboot (with MIG enabled) #403

Open
dasantonym opened this issue Sep 9, 2022 · 16 comments
Assignees
Labels
bug Issue/PR to expose/discuss/fix a bug

Comments

@dasantonym
Copy link

dasantonym commented Sep 9, 2022

System

Running on bare-metal

  • Ubuntu 20.04.4
  • Kubernetes v1.24.3
  • Containerd 1.6.7
  • GPU-Operator v1.11.1

Setup

GPU-Operator is installed with:

helm install --wait --debug --generate-name --create-namespace \
      nvidia/gpu-operator \
      -n gpu-operator \
      --set migManager.config.name=mig-config \
      --set mig.strategy=mixed \
      --set driver.enabled=false \
      --set toolkit.enabled=false

MIG-Config:

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-disabled:
        - devices: all
          mig-enabled: false
      node-standard:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "2g.20gb": 3
        - devices: [1]
          mig-enabled: true
          mig-devices:
            "2g.20gb": 3
        - devices: [2]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
        - devices: [3]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
        - devices: [4,5,6,7]
          mig-enabled: false

Issue

I am running a cluster with multiple GPU-Nodes and some of the nodes are using MIG, others are not. Now, as long as all nodes have their MIG config set to all-disabled, everything is fine. As soon as I set one node to a mixed MIG config, the nvidia-device-plugin-validator fails with the message:

spec: failed to generate spec: lstat /run/nvidia/driver/dev/nvidiactl: no such file or directory

Once I switch back the MIG config to all-disabled, the validation succeeds again.

Edit: To further clarify: The validator only fails with the wrong driver root value for the node where I activate MIG. The other validator pods are unaffected, even after further (node) restarts, as long as MIG remains off.

@shivamerla
Copy link
Contributor

@dasantonym to confirm, wrong driver root was set on the node nvidia-container-runtime config file? Or did you see wrong NVIDIA_DRIVER_ROOT set within the device-plugin pod?

@dasantonym
Copy link
Author

Thanks for the prompt reply. I did not make any config changes, but where should I look for the nvidia-container-runtime file? Do you mean the containerd config file or is this also located in a specific pod?

The host's containerd config looks like this:

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

The env var NVIDIA_DRIVER_ROOT is set to / for the failing pod.

@dasantonym
Copy link
Author

I just noticed something else: I just added a new node and that one switches fine between MIG and non-MIG setup. The other nodes are in use so I can't check them now. This might very well be related to this one node only...

If you have any suggestions for further debugging, I'd be delighted. I will now make a few more tries and wait for the other nodes to be out of use again to check if they are also affected.

@shivamerla
Copy link
Contributor

from the error you posted, looks like device-plugin is getting started with wrong NVIDIA_DRIVER_ROOT=/run/nvidia/driver. This is set based on the file /run/nvidia/validations/host-driver-ready in place.

[/run/nvidia/driver/dev/nvidiactl](spec: failed to generate spec: lstat /run/nvidia/driver/dev/nvidiactl: no such file or directory)

@dasantonym
Copy link
Author

I'm afraid I don't understand this correctly. The value of NVIDIA_DRIVER_ROOT within the device-plugin-daemonset pod on the failing node is /. The file /run/nvidia/validations/host-driver-ready exists (both in the pod and on the host). Only the ClusterPolicy nvidia-device-plugin-validator continously fails for the node until I disable MIG again. The device plugin itself looks ok.

Is there any other place where the wrong driver root could be coming from?

@shivamerla
Copy link
Contributor

Can you restart device-plugin on the failing node to confirm if the issue is persistent on MIG changes. Other place for driver-root setting is /etc/nvidia-container-runtime/config.toml. The reason i asked for the device-plugin restart is we set this during startup of the plugin, so wondering if there is any race happening there.

        name: nvidia-device-plugin
        command: [bash, -c]
        args: ["[[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-device-plugin;"]

@dasantonym
Copy link
Author

I did a restart, but got the same failure. Now I did more testing and it seems to me that this really is related to one or some nodes only. I am going to check this again next week and try to find reason, but maybe resetting, removing and re-adding the node(s) might do the trick.

I'll report back with more info!

@dasantonym
Copy link
Author

Sorry for the long delay. I did not yet have time to investigate this more closely.

So far, I have had no luck. I first reset and removed the node from Kubernetes, but after re-adding, the problem persisted. Then I reset, removed and re-installed the entire NVIDIA software, but also no luck there.

Unfortunately, the error also persists for at least one other node. One node does the change from non-MIG to MIG state correctly (at least it did, I cannot verify now since the cluster is in production).

Where could this wrong setting for the driver-root be persisted elsewhere? And why is it only active when MIG mode is on? This is confusing. Do you maybe have other ideas on where to look?

One more question: You mentioned /etc/nvidia-container-runtime/config.toml – where do i find this file?

@dasantonym
Copy link
Author

@shivamerla I finally got around to properly debugging this and you were right, there must be a race happening here.

This only affects the nodes after a reboot and at first, everything runs ok, only the device-plugin-validator fails. Even the cuda-validation works and once I restart the device-plugin-validator, it validates properly.

So the file /run/nvidia/validations/host-driver-ready must be appearing too late for the validator and the validation fails.

Can you advise as to how to work around this? This prevents me from automatically scaling the physical nodes as I always have to be there to manually restart the validator.

@dasantonym dasantonym changed the title nvidia-device-plugin-validator fails when MIG is enabled for a node nvidia-device-plugin-validator fails after node reboot Jan 30, 2023
@dasantonym
Copy link
Author

dasantonym commented Jan 30, 2023

Just to add: this does not seem to have anything to do with MIG.

@cdesiniotis
Copy link
Contributor

@dasantonym thanks for reporting this issue. During MIG reconfigurations, there is a race condition between when the host-driver-ready file is created and when it is read by some of our gpu-operator components that depend on it. I am working on fixing this issue here in mig-manager: https://gitlab.com/nvidia/cloud-native/mig-parted/-/merge_requests/115

Just to add: this does not seem to have anything to do with MIG.

How did you come to this conclusion? Did you also reproduce this issue on nodes without any MIG configuration?

@cdesiniotis
Copy link
Contributor

cc @klueska

@dasantonym
Copy link
Author

Thanks for looking into this! I guess I just got confused over trying out so many variations. I at first thought the error only occurs if I change the MIG config, then saw that it happens even when the config is unchanged on boot, but in fact, MIG was always enabled. I justed tested again on a different setup with mig.strategy=none and MIG disabled on all GPUs and it starts up properly.

I retract my additional comment!

@dasantonym dasantonym changed the title nvidia-device-plugin-validator fails after node reboot nvidia-device-plugin-validator fails after node reboot (with MIG enabled) Feb 2, 2023
@wwj-2017-1117
Copy link

I also meet the same problem

@accelide
Copy link

accelide commented Jan 30, 2024

Running into same issue with MIG enabled or disabled. This issue occurs intermittently when NVIDIA drivers are installed at OS level. To mitigate the issue, created custom gpu-operator image by adding retry logic with WHILE loop to https://github.com/NVIDIA/gpu-operator/blob/master/assets/state-container-toolkit/0400_configmap.yaml and building image. Attached are sample 0400_configmap.yaml with WHILE loop and Dockerfile to create custom image
0400_configmap.txt
Dockerfile.txt

Helm command to install with custom image:

helm install --wait --debug --generate-name --create-namespace \
      nvidia/gpu-operator \
      -n gpu-operator \
     --set driver.enabled=false \
     --set operator.version=<replace with customimage tag>

@cdesiniotis

@cdesiniotis cdesiniotis added the bug Issue/PR to expose/discuss/fix a bug label Jan 31, 2024
@cdesiniotis cdesiniotis self-assigned this Jan 31, 2024
@cdesiniotis
Copy link
Contributor

@accelide thanks for providing details on what resolved the issue for you. It looks like a race condition still exists. I will have to get back to you once I investigate a bit further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

No branches or pull requests

5 participants