Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU sharing on cuda compute capability >=7.5 #231

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

guptaNswati
Copy link
Contributor

@guptaNswati guptaNswati commented Jan 24, 2025

This is to add a check on allowing GPU sharing only when its a CUDA compute capability of 7.5 and higher. It skips both timeslicing and MPS. Referencing these 2 issues and related MR

#41
https://github.com/NVIDIA/cloud-native-team/issues/97
https://github.com/NVIDIA/cloud-native-team/issues/96

Tested on Geforce 980 and Titan

$ 
logs when called on incompatible GPUs
$ kubectl logs nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-xbnr2 -n nvidia

I0130 23:08:07.073619       1 driver.go:108] NodeUnprepareResource is called: number of claims: 1
E0130 23:08:07.123606       1 nvlib.go:534] 
Failed to set timeslice policy with value Default for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported

no MPS server running 
$ kubectl apply -f demo/specs/quickstart/gpu-test-mps.yaml

$ kubectl get pods -A
NAMESPACE            NAME                                                           READY   STATUS              RESTARTS   AGE
gpu-test-mps         test-pod                                                       0/2     ContainerCreating   0          31m
kube-system          coredns-668d6bf9bc-hwhxl                                       1/1     Running             0          34m
kube-system          coredns-668d6bf9bc-rb964                                       1/1     Running             0          34m
kube-system          etcd-k8s-dra-driver-cluster-control-plane                      1/1     Running             0          34m
kube-system          kindnet-gxfdc                                                  1/1     Running             0          34m
kube-system          kindnet-r88xt                                                  1/1     Running             0          34m
kube-system          kube-apiserver-k8s-dra-driver-cluster-control-plane            1/1     Running             0          34m
kube-system          kube-controller-manager-k8s-dra-driver-cluster-control-plane   1/1     Running             0          34m
kube-system          kube-proxy-m7m4t                                               1/1     Running             0          34m
kube-system          kube-proxy-tx7bp                                               1/1     Running             0          34m
kube-system          kube-scheduler-k8s-dra-driver-cluster-control-plane            1/1     Running             0          34m
local-path-storage   local-path-provisioner-58cc7856b6-x77dz                        1/1     Running             0          34m
nvidia               nvidia-dra-driver-k8s-dra-driver-controller-844fcb94b-66wkq    1/1     Running             0          32m
nvidia               nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-vqhfg          1/1     Running             0          32m

$ kubectl logs nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-vqhfg  -n nvidia
I0131 00:51:41.457384       1 device_state.go:73] using devRoot=/driver-root
I0131 00:52:26.105473       1 driver.go:97] NodePrepareResource is called: number of claims: 1
I0131 00:53:34.078698       1 driver.go:97] NodePrepareResource is called: number of claims: 1

$ kubectl get pods -n gpu-test-mps
NAME       READY   STATUS              RESTARTS   AGE
test-pod   0/2     ContainerCreating   0          29m

$ kubectl describe pod test-pod -n gpu-test-mps
Warning  FailedPrepareDynamicResources  31s (x25 over 30m)  kubelet            Failed to prepare dynamic resources: NodePrepareResources failed for claim gpu-test-mps/test-pod-shared-gpu-wfk6r: error preparing devices for claim 84b5789b-1f09-4d93-a3d3-a9fb61542cf9: prepare devices failed: error applying GPU config: GPU sharing is not available on this device UUID=GPU-34e8d7ba-0e4d-ac00-6852-695d5d404f51

@guptaNswati guptaNswati changed the title Draft:MPS on cuda compute capability >3.5 Draft: MPS on cuda compute capability >3.5 Jan 24, 2025
@guptaNswati guptaNswati requested a review from klueska January 31, 2025 01:28
@guptaNswati guptaNswati changed the title Draft: MPS on cuda compute capability >3.5 GPU sharing on cuda compute capability >=7.5 Jan 31, 2025
@guptaNswati guptaNswati requested a review from elezar January 31, 2025 01:29
@guptaNswati
Copy link
Contributor Author

cc @elezar PTAL as you also reviewed #58

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant