GPU sharing on cuda compute capability >=7.5 #231

guptaNswati · 2025-01-24T00:12:18Z

This is to add a check on allowing GPU sharing only when its a CUDA compute capability of 7.5 and higher. It skips both timeslicing and MPS. Referencing these 2 issues and related MR

#41
https://github.com/NVIDIA/cloud-native-team/issues/97
https://github.com/NVIDIA/cloud-native-team/issues/96

Tested on Geforce 980 and Titan

$ 
logs when called on incompatible GPUs
$ kubectl logs nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-xbnr2 -n nvidia

I0130 23:08:07.073619       1 driver.go:108] NodeUnprepareResource is called: number of claims: 1
E0130 23:08:07.123606       1 nvlib.go:534] 
Failed to set timeslice policy with value Default for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported

no MPS server running 
$ kubectl apply -f demo/specs/quickstart/gpu-test-mps.yaml

$ kubectl get pods -A
NAMESPACE            NAME                                                           READY   STATUS              RESTARTS   AGE
gpu-test-mps         test-pod                                                       0/2     ContainerCreating   0          31m
kube-system          coredns-668d6bf9bc-hwhxl                                       1/1     Running             0          34m
kube-system          coredns-668d6bf9bc-rb964                                       1/1     Running             0          34m
kube-system          etcd-k8s-dra-driver-cluster-control-plane                      1/1     Running             0          34m
kube-system          kindnet-gxfdc                                                  1/1     Running             0          34m
kube-system          kindnet-r88xt                                                  1/1     Running             0          34m
kube-system          kube-apiserver-k8s-dra-driver-cluster-control-plane            1/1     Running             0          34m
kube-system          kube-controller-manager-k8s-dra-driver-cluster-control-plane   1/1     Running             0          34m
kube-system          kube-proxy-m7m4t                                               1/1     Running             0          34m
kube-system          kube-proxy-tx7bp                                               1/1     Running             0          34m
kube-system          kube-scheduler-k8s-dra-driver-cluster-control-plane            1/1     Running             0          34m
local-path-storage   local-path-provisioner-58cc7856b6-x77dz                        1/1     Running             0          34m
nvidia               nvidia-dra-driver-k8s-dra-driver-controller-844fcb94b-66wkq    1/1     Running             0          32m
nvidia               nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-vqhfg          1/1     Running             0          32m

$ kubectl logs nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-vqhfg  -n nvidia
I0131 00:51:41.457384       1 device_state.go:73] using devRoot=/driver-root
I0131 00:52:26.105473       1 driver.go:97] NodePrepareResource is called: number of claims: 1
I0131 00:53:34.078698       1 driver.go:97] NodePrepareResource is called: number of claims: 1

$ kubectl get pods -n gpu-test-mps
NAME       READY   STATUS              RESTARTS   AGE
test-pod   0/2     ContainerCreating   0          29m

$ kubectl describe pod test-pod -n gpu-test-mps
Warning  FailedPrepareDynamicResources  31s (x25 over 30m)  kubelet            Failed to prepare dynamic resources: NodePrepareResources failed for claim gpu-test-mps/test-pod-shared-gpu-wfk6r: error preparing devices for claim 84b5789b-1f09-4d93-a3d3-a9fb61542cf9: prepare devices failed: error applying GPU config: GPU sharing is not available on this device UUID=GPU-34e8d7ba-0e4d-ac00-6852-695d5d404f51

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati · 2025-01-31T01:30:36Z

cc @elezar PTAL as you also reviewed #58

guptaNswati changed the title ~~Draft:MPS on cuda compute capability >3.5~~ Draft: MPS on cuda compute capability >3.5 Jan 24, 2025

GPU sharing on cuda compute capability >=7.5

86de1cb

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati force-pushed the when-to-startMPS branch from 58f6bfa to 86de1cb Compare January 31, 2025 01:14

guptaNswati requested a review from klueska January 31, 2025 01:28

guptaNswati changed the title ~~Draft: MPS on cuda compute capability >3.5~~ GPU sharing on cuda compute capability >=7.5 Jan 31, 2025

guptaNswati requested a review from elezar January 31, 2025 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU sharing on cuda compute capability >=7.5 #231

GPU sharing on cuda compute capability >=7.5 #231

guptaNswati commented Jan 24, 2025 •

edited

Loading

guptaNswati commented Jan 31, 2025

GPU sharing on cuda compute capability >=7.5 #231

Are you sure you want to change the base?

GPU sharing on cuda compute capability >=7.5 #231

Conversation

guptaNswati commented Jan 24, 2025 • edited Loading

guptaNswati commented Jan 31, 2025

guptaNswati commented Jan 24, 2025 •

edited

Loading