Use a fraction gpu resource, fail to get response from manager #164

weixiujuan · 2022-06-07T16:57:34Z

Please help solve the problem，The information is as follows，Thank you.

The restricted GPU configuration is as follows：

    resources:
      limits:
        tencent.com/vcuda-core: "20"
        tencent.com/vcuda-memory: "20"
      requests:
        tencent.com/vcuda-core: "20"
        tencent.com/vcuda-memory: "20"
    env:
      - name: LOGGER_LEVEL
        value: "5"

The running algorithm program reports the following error:

/tmp/cuda-control/src/loader.c:1056 config file: /etc/vcuda/kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice/vcuda.config
/tmp/cuda-control/src/loader.c:1057 pid file: /etc/vcuda/kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice/pids.config
/tmp/cuda-control/src/loader.c:1061 register to remote: pod uid: tainerd.service, cont id: kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice
F0607 15:56:33.572429     158 client.go:78] fail to get response from manager, error rpc error: code = Unknown desc = can't find kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice from docker
/tmp/cuda-control/src/register.c:87 rpc client exit with 255

gpu-manager.INFO log contents are as follows：

I0607 15:56:33.571262  626706 manager.go:369] UID: tainerd.service, cont: kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice want to registration
I0607 15:56:33.571439  626706 manager.go:455] Write /etc/gpu-manager/vm/tainerd.service/kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice/pids.config
I0607 15:56:33.573392  626706 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

gpu-manager.WARNING log contents are as follows：
W0607 15:56:44.887813 626706 manager.go:290] Find orphaned pod tainerd.service

gpu-manager.ERROR and gpu-manager.FATAL are no error log.

my gpu-manager.yaml is follwing:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-manager
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-manager-role
subjects:
- kind: ServiceAccount
  name: gpu-manager
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-manager-daemonset
  namespace: kube-system
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      name: gpu-manager-ds
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: gpu-manager-ds
    spec:
      serviceAccount: gpu-manager
      tolerations:
        # This toleration is deprecated. Kept here for backward compatibility
        # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
        - key: CriticalAddonsOnly
          operator: Exists
        - key: tencent.com/vcuda-core
          operator: Exists
          effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      # only run node has gpu device
      nodeSelector:
        nvidia-device-enable: enable
      hostPID: true
      containers:
        - image: thomassong/gpu-manager:1.1.5
          imagePullPolicy: IfNotPresent
          name: gpu-manager
          securityContext:
            privileged: true
          ports:
            - containerPort: 5678
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: vdriver
              mountPath: /etc/gpu-manager/vdriver
            - name: vmdata
              mountPath: /etc/gpu-manager/vm
            - name: log
              mountPath: /var/log/gpu-manager
            - name: checkpoint
              mountPath: /etc/gpu-manager/checkpoint
            - name: run-dir
              mountPath: /var/run
            - name: cgroup
              mountPath: /sys/fs/cgroup
              readOnly: true
            - name: usr-directory
              mountPath: /usr/local/host
              readOnly: true
            - name: kube-root
              mountPath: /root/.kube
              readOnly: true
          env:
            - name: LOG_LEVEL
              value: "5"
            - name: EXTRA_FLAGS
              value: "--logtostderr=false --container-runtime-endpoint=/var/run/containerd/containerd.sock --cgroup-driver=systemd"
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
      volumes:
        - name: device-plugin
          hostPath:
            type: Directory
            path: /var/lib/kubelet/device-plugins
        - name: vmdata
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vm
        - name: vdriver
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vdriver
        - name: log
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/log
        - name: checkpoint
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/checkpoint
        # We have to mount the whole /var/run directory into container, because of bind mount docker.sock
        # inode change after host docker is restarted
        - name: run-dir
          hostPath:
            type: Directory
            path: /var/run
        - name: cgroup
          hostPath:
            type: Directory
            path: /sys/fs/cgroup
        # We have to mount /usr directory instead of specified library path, because of non-existing
        # problem for different distro
        - name: usr-directory
          hostPath:
            type: Directory
            path: /usr
        - name: kube-root
          hostPath:
            type: Directory
            path: /root/.kube

The text was updated successfully, but these errors were encountered:

DennisYoung96 · 2022-06-08T09:17:07Z

same with me. did u solve it?

weixiujuan · 2022-06-13T03:22:27Z

Hi，I lowered the version of kubernetes to v1.20 and it works fine，did u solve it?

zhichenghe · 2022-06-29T06:03:30Z

we have the same issue at the kubernetes version of v1.18.6

lynnfi · 2023-09-09T12:37:53Z

the same error, I think the reason is "--container-runtime-endpoint=/var/run/containerd/containerd.sock --cgroup-driver=systemd" , use containerd as container-runtime cause this problem, i will try to solve this.

lynnfi · 2023-09-11T03:33:14Z

same with me. did u solve it?

I change the k8s cgroup from systemd to cgroup，it works well. Do not use -cgroup-driver=systemd
The congfig like

env:
- name: LOG_LEVEL
value: "5"
- name: EXTRA_FLAGS
value: "--logtostderr=false --container-runtime-endpoint=/var/run/containerd/containerd.sock"
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a fraction gpu resource, fail to get response from manager #164

Use a fraction gpu resource, fail to get response from manager #164

weixiujuan commented Jun 7, 2022

DennisYoung96 commented Jun 8, 2022

weixiujuan commented Jun 13, 2022

zhichenghe commented Jun 29, 2022

lynnfi commented Sep 9, 2023

lynnfi commented Sep 11, 2023

Use a fraction gpu resource, fail to get response from manager #164

Use a fraction gpu resource, fail to get response from manager #164

Comments

weixiujuan commented Jun 7, 2022

DennisYoung96 commented Jun 8, 2022

weixiujuan commented Jun 13, 2022

zhichenghe commented Jun 29, 2022

lynnfi commented Sep 9, 2023

lynnfi commented Sep 11, 2023