Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

无法实现多个pod在同一张显卡上执行深度学习任务 #44

Open
Lorenz5622 opened this issue Jan 15, 2025 · 0 comments
Open

Comments

@Lorenz5622
Copy link

创建了两个pod,各设置45%的GPU核使用率和45%的显存使用率限制。同时在两个pod上执行深度学习训练任务,报NCCL错误,排查后确定为资源分配不足导致。只使用一个pod执行相同任务时,使用nvidia-smi发现GPU使用率到达100%,GPU核使用率限制失效。

附创建pod时使用的yaml:
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod1
spec:
containers:

  • name: my-gpu-container
    image: nvcr.io/nvidia/pytorch:23.08-py3
    command: ["/bin/bash","-c","sleep 86400"]
    env:
    • name: OUT_DIR
      value: "./"
    • name: NCCL_DEBUG
      value: "INFO"
      resources:
      limits:
      memory: "20Gi" # 设置内存限制为 20GB
      cpu: 15 # 设置 CPU 限制为 20 个核心
      nvidia.com/gpu: 4 # 请求 2 张 GPU
      nvidia.com/gpumem-percentage: 45 # Each vGPU contains 3000m device memory (Optional,Integer)
      nvidia.com/gpucores: 45 # Each vGPU uses 30% of the entire GPU (Optional,Integer)
      volumeMounts:
    • mountPath: /datadrive
      name: datadrive-volume
      volumes:
  • name: datadrive-volume
    hostPath: # 使用 hostPath 进行绑定挂载
    path: / # 挂载宿主机的根目录
    hostIPC: true # 使用主机的 IPC namespace
    hostNetwork: true # 使用主机的网络 namespace
    hostPID: true # 如果需要,可以启用主机的 PID namespace

apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod2
spec:
containers:

  • name: my-gpu-container
    image: nvcr.io/nvidia/pytorch:23.08-py3
    command: ["/bin/bash","-c","sleep 86400"]
    env:
    • name: OUT_DIR
      value: "./"
    • name: NCCL_DEBUG
      value: "INFO"
      resources:
      limits:
      memory: "20Gi" # 设置内存限制为 20GB
      cpu: 15 # 设置 CPU 限制为 20 个核心
      nvidia.com/gpu: 4 # 请求 2 张 GPU
      nvidia.com/gpumem-percentage: 45 # Each vGPU contains 3000m device memory (Optional,Integer)
      nvidia.com/gpucores: 45 # Each vGPU uses 30% of the entire GPU (Optional,Integer)
      volumeMounts:
    • mountPath: /datadrive
      name: datadrive-volume
      volumes:
  • name: datadrive-volume
    hostPath: # 使用 hostPath 进行绑定挂载
    path: / # 挂载宿主机的根目录
    hostIPC: true # 使用主机的 IPC namespace
    hostNetwork: true # 使用主机的网络 namespace
    hostPID: true # 如果需要,可以启用主机的 PID namespace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant