-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU-operator install fails - NFD master pod crash , Probes are failing #619
Comments
1. Quick Debug Information
2. Issue or feature descriptiongpu-operator-1701120700-node-feature-discovery-master pod is crashing with below error: Liveness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out 3. Steps to reproduce the issuehttps://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/microsoft-aks.html Approach 1: Create Nodepool without GPU Driver az aks nodepool add --resource-group az-sre-germanywestcentral --cluster-name cx-aiml-sre-germanywestcentral --name gpuskipdri --node-count 1 --node-vm-size Standard_NC4as_T4_v3 --node-taints sku=gpu:NoSchedule --labels sku=gpu --node-osdisk-type Ephemeral --enable-cluster-autoscaler --tags SkipGPUDriverInstall=true --os-type Linux --min-count 1 --max-count 1 az aks nodepool show --resource-group az-sre-germanywestcentral --cluster-name cx-aiml-sre-germanywestcentral --name gpuskipdri --query tags Install NVIDIA GPU Operator helm version helm repo add nvidia https://helm.ngc.nvidia.com/nvidia &&helm repo update helm install --wait --generate-name -n gpu-operator --create-namespace \nvidia/gpu-operator helm list --namespace gpu-operator 4. Information to attach (optional if deemed irrelevant)kubernetes pods status: kubectl get pods -n gpu-operator kubernetes daemonset status: kubectl get ds -n gpu-operator
kubectl describe pod gpu-operator-1701120700-node-feature-discovery-master-d46872jxf -n gpu-operator Warning Unhealthy 31m (x69 over 91m) kubelet Liveness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out
kubectl logs -n gpu-operator gpu-operator-1701120700-node-feature-discovery-master-d46872jxf --all-containers I1127 23:03:42.257487 1 nfd-master.go:1338] "starting the nfd api controller"
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected. This bundle can be submitted to us via email: [email protected] |
@ArangoGutierrez any thoughts? |
gpu-operator-issue1.txt
The text was updated successfully, but these errors were encountered: