Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU-operator install fails - NFD master pod crash , Probes are failing #619

Open
PrachiMittal2016 opened this issue Nov 27, 2023 · 2 comments

Comments

@PrachiMittal2016
Copy link

gpu-operator-issue1.txt

@PrachiMittal2016
Copy link
Author

PrachiMittal2016 commented Nov 27, 2023

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
  • Kernel Version: 5.15.0-1051-azure
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.7.5-1
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): AKS kubernetes version: 1.25.11
  • GPU Operator Version: gpu-operator-v23.9.0

2. Issue or feature description

gpu-operator-1701120700-node-feature-discovery-master pod is crashing with below error:

Liveness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out
Readiness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out

3. Steps to reproduce the issue

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/microsoft-aks.html

Approach 1:
Create AKS Cluster with Node Pool Tags to Prevent Driver installation

Create Nodepool without GPU Driver

az aks nodepool add --resource-group az-sre-germanywestcentral --cluster-name cx-aiml-sre-germanywestcentral --name gpuskipdri --node-count 1 --node-vm-size Standard_NC4as_T4_v3 --node-taints sku=gpu:NoSchedule --labels sku=gpu --node-osdisk-type Ephemeral --enable-cluster-autoscaler --tags SkipGPUDriverInstall=true --os-type Linux --min-count 1 --max-count 1

az aks nodepool show --resource-group az-sre-germanywestcentral --cluster-name cx-aiml-sre-germanywestcentral --name gpuskipdri --query tags
{
"SkipGPUDriverInstall": "true"
}

Install NVIDIA GPU Operator

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#install-nvidia-gpu-operator

helm version
version.BuildInfo{Version:"v3.10.3", GitCommit:"835b7334cfe2e5e27870ab3ed4135f136eecc704", GitTreeState:"clean", GoVersion:"go1.18.9"}

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia &&helm repo update

helm install --wait --generate-name -n gpu-operator --create-namespace \nvidia/gpu-operator

helm list --namespace gpu-operator
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator-1701120700 gpu-operator 1 2023-11-27 16:31:43.779255 -0500 EST failed gpu-operator-v23.9.0 v23.9.0

4. Information to attach (optional if deemed irrelevant)

kubernetes pods status:

kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-1701120700-node-feature-discovery-gc-779cf9cfjsmfc 1/1 Running 0 90m
gpu-operator-1701120700-node-feature-discovery-master-d46872jxf 0/1 CrashLoopBackOff 31 (4m16s ago) 90m
gpu-operator-1701120700-node-feature-discovery-worker-4w9tm 1/1 Running 0 90m
gpu-operator-1701120700-node-feature-discovery-worker-76b58 1/1 Running 0 90m
gpu-operator-1701120700-node-feature-discovery-worker-8zxk6 1/1 Running 0 90m
gpu-operator-1701120700-node-feature-discovery-worker-c6xnb 1/1 Running 0 90m
gpu-operator-1701120700-node-feature-discovery-worker-l5js7 1/1 Running 0 90m
gpu-operator-1701120700-node-feature-discovery-worker-wfq95 1/1 Running 0 90m
gpu-operator-1701120700-node-feature-discovery-worker-zkvct 1/1 Running 0 90m
gpu-operator-75dc4c6dd6-prjdj 1/1 Running 0 90m

kubernetes daemonset status:

kubectl get ds -n gpu-operator
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-operator-1701120700-node-feature-discovery-worker 7 7 7 7 7 91m

  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME

kubectl describe pod gpu-operator-1701120700-node-feature-discovery-master-d46872jxf -n gpu-operator
Name: gpu-operator-1701120700-node-feature-discovery-master-d46872jxf
Namespace: gpu-operator
Priority: 0
Service Account: node-feature-discovery
Node: aks-micservices-74078279-vmss000008/10.240.0.11
Start Time: Mon, 27 Nov 2023 16:31:49 -0500
Labels: app.kubernetes.io/instance=gpu-operator-1701120700
app.kubernetes.io/name=node-feature-discovery
pod-template-hash=d468b9bc
role=master
Annotations:
Status: Running
IP: 10.244.12.231
IPs:
IP: 10.244.12.231
Controlled By: ReplicaSet/gpu-operator-1701120700-node-feature-discovery-master-d468b9bc
Containers:
master:
Container ID: containerd://596c5399b024736718c29776c9f6f10b9927eb7bbfdefe004d9e7e479c3acd34
Image: registry.k8s.io/nfd/node-feature-discovery:v0.14.2
Image ID: registry.k8s.io/nfd/node-feature-discovery@sha256:2a56d172c48b76531eb719780224ef278daa68b9088f592f16df2519bed08de4
Ports: 8080/TCP, 8081/TCP
Host Ports: 0/TCP, 0/TCP
Command:
nfd-master
Args:
-port=8080
-crd-controller=true
-metrics=8081
State: Running
Started: Mon, 27 Nov 2023 18:03:08 -0500
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Mon, 27 Nov 2023 17:57:23 -0500
Finished: Mon, 27 Nov 2023 17:58:01 -0500
Ready: False
Restart Count: 32
Liveness: exec [/usr/bin/grpc_health_probe -addr=:8080] delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [/usr/bin/grpc_health_probe -addr=:8080] delay=5s timeout=1s period=10s #success=1 #failure=10
Environment:
NODE_NAME: (v1:spec.nodeName)
Mounts:
/etc/kubernetes/node-feature-discovery from nfd-master-conf (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v8tws (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nfd-master-conf:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: gpu-operator-1701120700-node-feature-discovery-master-conf
Optional: false
kube-api-access-v8tws:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message


Warning Unhealthy 31m (x69 over 91m) kubelet Liveness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out
Warning Unhealthy 6m37s (x118 over 91m) kubelet Readiness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out
Warning BackOff 103s (x335 over 87m) kubelet Back-off restarting failed container

  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers

kubectl logs -n gpu-operator gpu-operator-1701120700-node-feature-discovery-master-d46872jxf --all-containers
I1127 23:03:42.256769 1 main.go:83] "-port is deprecated, will be removed in a future release along with the deprecated gRPC API"
I1127 23:03:42.256879 1 nfd-master.go:213] "Node Feature Discovery Master" version="v0.14.2" nodeName="aks-micservices-74078279-vmss000008" namespace="gpu-operator"
I1127 23:03:42.257112 1 nfd-master.go:1214] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I1127 23:03:42.257462 1 nfd-master.go:1274] "configuration successfully updated" configuration=<
DenyLabelNs: {}
EnableTaints: false
ExtraLabelNs:
nvidia.com: {}
Klog: {}
LabelWhiteList: {}
LeaderElection:
LeaseDuration:
Duration: 15000000000
RenewDeadline:
Duration: 10000000000
RetryPeriod:
Duration: 2000000000
NfdApiParallelism: 10
NoPublish: false
ResourceLabels: {}
ResyncPeriod:
Duration: 3600000000000

I1127 23:03:42.257487 1 nfd-master.go:1338] "starting the nfd api controller"
I1127 23:03:42.257716 1 node-updater-pool.go:79] "starting the NFD master node updater pool" parallelism=10
I1127 23:03:42.286242 1 metrics.go:115] "metrics server starting" port=8081
I1127 23:03:42.286360 1 component.go:36] [core][Server #1] Server created
I1127 23:03:42.286399 1 nfd-master.go:347] "gRPC server serving" port=8080
I1127 23:03:42.286465 1 component.go:36] [core][Server #1 ListenSocket #2] ListenSocket created
I1127 23:03:43.286714 1 nfd-master.go:694] "will process all nodes in the cluster"

  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
    I am not sure which pod it is
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

@shivamerla
Copy link
Contributor

@ArangoGutierrez any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants