Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rbac for deployment #230

Merged
merged 2 commits into from
Jan 31, 2025

Conversation

guptaNswati
Copy link
Contributor

To fix #229 cc @dekonnection

$ kubectl auth can-i get deployments --as=system:serviceaccount:nvidia-dra-driver:nvidia-dra-driver-k8s-dra-driver-service-account -n nvidia-dra-driver
yes

Signed-off-by: Swati Gupta <[email protected]>
@dekonnection
Copy link

Hi, I tried your branch, it fixed the initial error but now I get this one:

Failed to prepare dynamic resources: NodePrepareResources failed for claim lab/mps-gpu-7c9db8954b-mbwj5-mps-gpus-682s2: error preparing devices for claim 37925188-b216-4aa3-8ca7-f32cf28476ae: prepare devices failed: error applying GPU config: MPS control daemon is not yet ready: error listing pods from deployment

@guptaNswati
Copy link
Contributor Author

@dekonnection I added permissions for job and pods. Hopefully that will fix the errors.

@guptaNswati
Copy link
Contributor Author

quick test with the added permissions

updated the demo/specs/quickstart/gpu-test-mps.yaml  spec to set 
restartPolicy: "Never"
  image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
  command: ["bash", "-c"]
  args: ["/tmp/sample -benchmark -i=5000"]
 
$ kubectl apply -f demo/specs/quickstart/gpu-test-mps.yaml 
 
$ kubectl get pods -n gpu-test-mps 
NAME       READY   STATUS              RESTARTS   AGE
test-pod   0/2     ContainerCreating   0          13m
 
 $ kubectl describe pod test-pod -n gpu-test-mps 
   Warning  FailedPrepareDynamicResources  64s (x2 over 2m13s)  kubelet            Failed to prepare dynamic resources: NodePrepareResources failed for claim gpu-test-mps/test-pod-shared-gpu-stwn2: error preparing devices for claim 7d607266-069f-4162-90d6-6a6ed85ea459: prepare devices failed: error applying GPU config: error starting MPS control daemon: error checking if control daemon already started: failed to get deployment: deployments.apps "mps-control-daemon-7d607266-069f-4162-90d6-6a6ed85ea459-77bf4" is forbidden: User "system:serviceaccount:nvidia:nvidia-dra-driver-k8s-dra-driver-service-account" cannot get resource "deployments" in API group "apps" in the namespace "nvidia"

$ kubectl apply -f role.yaml
role.rbac.authorization.k8s.io/nvidia-dra-driver-k8s-dra-driver-app-role created

$ kubectl apply -f rolebinding.yaml
rolebinding.rbac.authorization.k8s.io/nvidia-dra-driver-k8s-dra-driver-app-role-binding created

$ kubectl auth can-i get pods --as=system:serviceaccount:nvidia-dra-driver:nvidia-dra-driver-k8s-dra-driver-service-account -n nvidia-dra-driver
yes

$ kubectl auth can-i get job --as=system:serviceaccount:nvidia-dra-driver:nvidia-dra-driver-k8s-dra-driver-service-account -n nvidia-dra-driver
yes

$ kubectl auth can-i get deployments --as=system:serviceaccount:nvidia-dra-driver:nvidia-dra-driver-k8s-dra-driver-service-account -n nvidia-dra-driver
yes

$ kubectl get pods -n gpu-test-mps
NAME       READY   STATUS      RESTARTS   AGE
test-pod   0/2     Completed   0          6m38s

nvidia@SC-Starwars-MAB9-B00:k8s-dra-driver$ kubectl logs test-pod -n gpu-test-mps
Defaulted container "mps-ctr0" out of: mps-ctr0, mps-ctr1
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
	-fullscreen       (run n-body simulation in fullscreen mode)
	-fp64             (use double precision floating point values for simulation)
	-hostmem          (stores simulation data in host memory)
	-benchmark        (run benchmark to measure performance) 
	-numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
	-device=<d>       (where d=0,1,2.... for the CUDA device to use)
	-numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
	-compare          (compares simulation results running once on the default GPU and once on the CPU)
	-cpu              (run n-body simulation on the CPU)
	-tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 9.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 9.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 9.0

> Compute 9.0 CUDA device: [NVIDIA GH200 96GB HBM3]
67584 bodies, total time for 5000 iterations: 26196.221 ms
= 871.805 billion interactions per second
= 17436.092 single-precision GFLOP/s at 20 flops per interaction

$ kubectl logs test-pod -c mps-ctr1 -n gpu-test-mps
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
	-fullscreen       (run n-body simulation in fullscreen mode)
	-fp64             (use double precision floating point values for simulation)
	-hostmem          (stores simulation data in host memory)
	-benchmark        (run benchmark to measure performance) 
	-numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
	-device=<d>       (where d=0,1,2.... for the CUDA device to use)
	-numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
	-compare          (compares simulation results running once on the default GPU and once on the CPU)
	-cpu              (run n-body simulation on the CPU)
	-tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 9.0 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 9.0 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 9.0

> Compute 9.0 CUDA device: [NVIDIA GH200 96GB HBM3]
67584 bodies, total time for 5000 iterations: 26197.629 ms
= 871.758 billion interactions per second
= 17435.154 single-precision GFLOP/s at 20 flops per interaction

@guptaNswati guptaNswati merged commit fe64609 into NVIDIA:main Jan 31, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression on MPS setups following changes in clusterrole
3 participants