-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia-device-plugin-validator fails after node reboot (with MIG enabled) #403
Comments
@dasantonym to confirm, wrong driver root was set on the node nvidia-container-runtime config file? Or did you see wrong NVIDIA_DRIVER_ROOT set within the device-plugin pod? |
Thanks for the prompt reply. I did not make any config changes, but where should I look for the nvidia-container-runtime file? Do you mean the containerd config file or is this also located in a specific pod? The host's containerd config looks like this: version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime" The env var NVIDIA_DRIVER_ROOT is set to |
I just noticed something else: I just added a new node and that one switches fine between MIG and non-MIG setup. The other nodes are in use so I can't check them now. This might very well be related to this one node only... If you have any suggestions for further debugging, I'd be delighted. I will now make a few more tries and wait for the other nodes to be out of use again to check if they are also affected. |
from the error you posted, looks like device-plugin is getting started with wrong NVIDIA_DRIVER_ROOT=/run/nvidia/driver. This is set based on the file /run/nvidia/validations/host-driver-ready in place.
|
I'm afraid I don't understand this correctly. The value of NVIDIA_DRIVER_ROOT within the Is there any other place where the wrong driver root could be coming from? |
Can you restart device-plugin on the failing node to confirm if the issue is persistent on MIG changes. Other place for driver-root setting is /etc/nvidia-container-runtime/config.toml. The reason i asked for the device-plugin restart is we set this during startup of the plugin, so wondering if there is any race happening there.
|
I did a restart, but got the same failure. Now I did more testing and it seems to me that this really is related to one or some nodes only. I am going to check this again next week and try to find reason, but maybe resetting, removing and re-adding the node(s) might do the trick. I'll report back with more info! |
Sorry for the long delay. I did not yet have time to investigate this more closely. So far, I have had no luck. I first reset and removed the node from Kubernetes, but after re-adding, the problem persisted. Then I reset, removed and re-installed the entire NVIDIA software, but also no luck there. Unfortunately, the error also persists for at least one other node. One node does the change from non-MIG to MIG state correctly (at least it did, I cannot verify now since the cluster is in production). Where could this wrong setting for the driver-root be persisted elsewhere? And why is it only active when MIG mode is on? This is confusing. Do you maybe have other ideas on where to look? One more question: You mentioned |
@shivamerla I finally got around to properly debugging this and you were right, there must be a race happening here. This only affects the nodes after a reboot and at first, everything runs ok, only the device-plugin-validator fails. Even the cuda-validation works and once I restart the device-plugin-validator, it validates properly. So the file Can you advise as to how to work around this? This prevents me from automatically scaling the physical nodes as I always have to be there to manually restart the validator. |
Just to add: this does not seem to have anything to do with MIG. |
@dasantonym thanks for reporting this issue. During MIG reconfigurations, there is a race condition between when the
How did you come to this conclusion? Did you also reproduce this issue on nodes without any MIG configuration? |
cc @klueska |
Thanks for looking into this! I guess I just got confused over trying out so many variations. I at first thought the error only occurs if I change the MIG config, then saw that it happens even when the config is unchanged on boot, but in fact, MIG was always enabled. I justed tested again on a different setup with I retract my additional comment! |
I also meet the same problem |
Running into same issue with MIG enabled or disabled. This issue occurs intermittently when NVIDIA drivers are installed at OS level. To mitigate the issue, created custom gpu-operator image by adding retry logic with WHILE loop to https://github.com/NVIDIA/gpu-operator/blob/master/assets/state-container-toolkit/0400_configmap.yaml and building image. Attached are sample 0400_configmap.yaml with WHILE loop and Dockerfile to create custom image
|
@accelide thanks for providing details on what resolved the issue for you. It looks like a race condition still exists. I will have to get back to you once I investigate a bit further. |
System
Running on bare-metal
Setup
GPU-Operator is installed with:
MIG-Config:
Issue
I am running a cluster with multiple GPU-Nodes and some of the nodes are using MIG, others are not. Now, as long as all nodes have their MIG config set to
all-disabled
, everything is fine. As soon as I set one node to a mixed MIG config, thenvidia-device-plugin-validator
fails with the message:Once I switch back the MIG config to
all-disabled
, the validation succeeds again.Edit: To further clarify: The validator only fails with the wrong driver root value for the node where I activate MIG. The other validator pods are unaffected, even after further (node) restarts, as long as MIG remains off.
The text was updated successfully, but these errors were encountered: