Skip to content

GPU Driver Container Won't Start #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BHSDuncan opened this issue Mar 26, 2024 · 4 comments
Closed

GPU Driver Container Won't Start #54

BHSDuncan opened this issue Mar 26, 2024 · 4 comments

Comments

@BHSDuncan
Copy link

Essentially I'm seeing what's in this ticket: NVIDIA/gpu-operator#564 (when I start up my machine running a cluster with a version of CNS installed, currently an old one, like 9.x)

...and because I'm using one of the playbooks from this repo, I'm not sure how to resolve this issue.

I'm also unsure as to why the issue is happening now...I've been running this on a machine since last fall, but the issue linked above pre-dates it.

Will updating to the latest CNS version solve this issue? Or will it still be a problem, given that it looks like the install.sh and Dockerfile(s) are pretty much the same. (I'll probably try doing this anyway on a test box but I wanted to ask here as well.)

Thank you.

@angudadevops
Copy link
Contributor

@BHSDuncan I would recommend to try CNS 10.4 or CNS 11.1 with cns_nvidia_driver: yes flag in cns_values_10.4.yaml or cns_values_11.1.yaml file and trigger the installation. which will install Native TRD Driver on host which works with latest kernel.

If you want driver as part of GPU Operator then I would recommend to wait to hear from GPU Operator team.

@BHSDuncan
Copy link
Author

But that will install a driver on the host itself, right? I'd prefer to avoid installing anything on the machine and keep the driver in the cluster. For that, you're saying I'll need to wait for the GPU Operator team? If so, they've made it known they're working on a fix. Once the fix is in place, will the CNS playbooks need updating?

@angudadevops
Copy link
Contributor

yeah if you look at the comment NVIDIA/gpu-operator#564 (comment)

so with latest kernel the current Operator fixed, will validate with CNS and then if it requires any changes will make the changes to CNS as well and let you know

@angudadevops
Copy link
Contributor

@BHSDuncan CNS is updated with new Operator version, please check cns version: 11.3 and let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants