We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
创建了两个pod,各设置45%的GPU核使用率和45%的显存使用率限制。同时在两个pod上执行深度学习训练任务,报NCCL错误,排查后确定为资源分配不足导致。只使用一个pod执行相同任务时,使用nvidia-smi发现GPU使用率到达100%,GPU核使用率限制失效。
附创建pod时使用的yaml: apiVersion: v1 kind: Pod metadata: name: my-gpu-pod1 spec: containers:
apiVersion: v1 kind: Pod metadata: name: my-gpu-pod2 spec: containers:
The text was updated successfully, but these errors were encountered:
No branches or pull requests
创建了两个pod,各设置45%的GPU核使用率和45%的显存使用率限制。同时在两个pod上执行深度学习训练任务,报NCCL错误,排查后确定为资源分配不足导致。只使用一个pod执行相同任务时,使用nvidia-smi发现GPU使用率到达100%,GPU核使用率限制失效。
附创建pod时使用的yaml:
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod1
spec:
containers:
image: nvcr.io/nvidia/pytorch:23.08-py3
command: ["/bin/bash","-c","sleep 86400"]
env:
value: "./"
value: "INFO"
resources:
limits:
memory: "20Gi" # 设置内存限制为 20GB
cpu: 15 # 设置 CPU 限制为 20 个核心
nvidia.com/gpu: 4 # 请求 2 张 GPU
nvidia.com/gpumem-percentage: 45 # Each vGPU contains 3000m device memory (Optional,Integer)
nvidia.com/gpucores: 45 # Each vGPU uses 30% of the entire GPU (Optional,Integer)
volumeMounts:
name: datadrive-volume
volumes:
hostPath: # 使用 hostPath 进行绑定挂载
path: / # 挂载宿主机的根目录
hostIPC: true # 使用主机的 IPC namespace
hostNetwork: true # 使用主机的网络 namespace
hostPID: true # 如果需要,可以启用主机的 PID namespace
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod2
spec:
containers:
image: nvcr.io/nvidia/pytorch:23.08-py3
command: ["/bin/bash","-c","sleep 86400"]
env:
value: "./"
value: "INFO"
resources:
limits:
memory: "20Gi" # 设置内存限制为 20GB
cpu: 15 # 设置 CPU 限制为 20 个核心
nvidia.com/gpu: 4 # 请求 2 张 GPU
nvidia.com/gpumem-percentage: 45 # Each vGPU contains 3000m device memory (Optional,Integer)
nvidia.com/gpucores: 45 # Each vGPU uses 30% of the entire GPU (Optional,Integer)
volumeMounts:
name: datadrive-volume
volumes:
hostPath: # 使用 hostPath 进行绑定挂载
path: / # 挂载宿主机的根目录
hostIPC: true # 使用主机的 IPC namespace
hostNetwork: true # 使用主机的网络 namespace
hostPID: true # 如果需要,可以启用主机的 PID namespace
The text was updated successfully, but these errors were encountered: