Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a fraction gpu resource, fail to get response from manager #164

Open
weixiujuan opened this issue Jun 7, 2022 · 5 comments
Open

Use a fraction gpu resource, fail to get response from manager #164

weixiujuan opened this issue Jun 7, 2022 · 5 comments

Comments

@weixiujuan
Copy link

Please help solve the problem,The information is as follows,Thank you.

The restricted GPU configuration is as follows:

    resources:
      limits:
        tencent.com/vcuda-core: "20"
        tencent.com/vcuda-memory: "20"
      requests:
        tencent.com/vcuda-core: "20"
        tencent.com/vcuda-memory: "20"
    env:
      - name: LOGGER_LEVEL
        value: "5"

The running algorithm program reports the following error:

/tmp/cuda-control/src/loader.c:1056 config file: /etc/vcuda/kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice/vcuda.config
/tmp/cuda-control/src/loader.c:1057 pid file: /etc/vcuda/kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice/pids.config
/tmp/cuda-control/src/loader.c:1061 register to remote: pod uid: tainerd.service, cont id: kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice
F0607 15:56:33.572429     158 client.go:78] fail to get response from manager, error rpc error: code = Unknown desc = can't find kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice from docker
/tmp/cuda-control/src/register.c:87 rpc client exit with 255

gpu-manager.INFO log contents are as follows:

I0607 15:56:33.571262  626706 manager.go:369] UID: tainerd.service, cont: kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice want to registration
I0607 15:56:33.571439  626706 manager.go:455] Write /etc/gpu-manager/vm/tainerd.service/kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice/pids.config
I0607 15:56:33.573392  626706 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

gpu-manager.WARNING log contents are as follows:
W0607 15:56:44.887813 626706 manager.go:290] Find orphaned pod tainerd.service

gpu-manager.ERROR and gpu-manager.FATAL are no error log.

my gpu-manager.yaml is follwing:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-manager
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-manager-role
subjects:
- kind: ServiceAccount
  name: gpu-manager
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-manager-daemonset
  namespace: kube-system
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      name: gpu-manager-ds
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: gpu-manager-ds
    spec:
      serviceAccount: gpu-manager
      tolerations:
        # This toleration is deprecated. Kept here for backward compatibility
        # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
        - key: CriticalAddonsOnly
          operator: Exists
        - key: tencent.com/vcuda-core
          operator: Exists
          effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      # only run node has gpu device
      nodeSelector:
        nvidia-device-enable: enable
      hostPID: true
      containers:
        - image: thomassong/gpu-manager:1.1.5
          imagePullPolicy: IfNotPresent
          name: gpu-manager
          securityContext:
            privileged: true
          ports:
            - containerPort: 5678
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: vdriver
              mountPath: /etc/gpu-manager/vdriver
            - name: vmdata
              mountPath: /etc/gpu-manager/vm
            - name: log
              mountPath: /var/log/gpu-manager
            - name: checkpoint
              mountPath: /etc/gpu-manager/checkpoint
            - name: run-dir
              mountPath: /var/run
            - name: cgroup
              mountPath: /sys/fs/cgroup
              readOnly: true
            - name: usr-directory
              mountPath: /usr/local/host
              readOnly: true
            - name: kube-root
              mountPath: /root/.kube
              readOnly: true
          env:
            - name: LOG_LEVEL
              value: "5"
            - name: EXTRA_FLAGS
              value: "--logtostderr=false --container-runtime-endpoint=/var/run/containerd/containerd.sock --cgroup-driver=systemd"
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
      volumes:
        - name: device-plugin
          hostPath:
            type: Directory
            path: /var/lib/kubelet/device-plugins
        - name: vmdata
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vm
        - name: vdriver
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vdriver
        - name: log
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/log
        - name: checkpoint
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/checkpoint
        # We have to mount the whole /var/run directory into container, because of bind mount docker.sock
        # inode change after host docker is restarted
        - name: run-dir
          hostPath:
            type: Directory
            path: /var/run
        - name: cgroup
          hostPath:
            type: Directory
            path: /sys/fs/cgroup
        # We have to mount /usr directory instead of specified library path, because of non-existing
        # problem for different distro
        - name: usr-directory
          hostPath:
            type: Directory
            path: /usr
        - name: kube-root
          hostPath:
            type: Directory
            path: /root/.kube
@DennisYoung96
Copy link

same with me. did u solve it?

@weixiujuan
Copy link
Author

Hi,I lowered the version of kubernetes to v1.20 and it works fine,did u solve it?

@zhichenghe
Copy link

we have the same issue at the kubernetes version of v1.18.6

@lynnfi
Copy link

lynnfi commented Sep 9, 2023

the same error, I think the reason is "--container-runtime-endpoint=/var/run/containerd/containerd.sock --cgroup-driver=systemd" , use containerd as container-runtime cause this problem, i will try to solve this.

@lynnfi
Copy link

lynnfi commented Sep 11, 2023

same with me. did u solve it?

I change the k8s cgroup from systemd to cgroup,it works well. Do not use -cgroup-driver=systemd
The congfig like

env:
- name: LOG_LEVEL
value: "5"
- name: EXTRA_FLAGS
value: "--logtostderr=false --container-runtime-endpoint=/var/run/containerd/containerd.sock"
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants