Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia driver issue with Amazon EKS optimized accelerated Amazon Linux AMIs #666

Closed
xyfleet opened this issue Feb 1, 2024 · 3 comments
Closed

Comments

@xyfleet
Copy link

xyfleet commented Feb 1, 2024

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Amazon Linux 2
  • Kernel Version: 5.4.254-170.358.amzn2.x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS
  • GPU Operator Version: v23.3.1

2. Issue or feature description

I am using the amazon optimized accelerated Amazon Linux AMIs to build a managed node group in EKS cluster to support GPU.
(https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html#gpu-ami)

Based on the AWS docs,
All of the optimized accelerated Amazon Linux AMIs have Nvidia Driver pre-installed. So, I disabled the driver installation when I install the GPU operator. (Even I enabled the driver installation on GPU operator, it did not overwrite the exiting driver in the host.)
Then, one issue appeared that there is no way to upgrade the Nvidia driver in the amazon host when I upgrade the GPU operator.
For example, the Nvidia driver in my existing GPU host is, 470.182.03
But the newer Nvidia driver coming with GPU operator is: 535.129.03
After the upgrading GPU operator to the latest version, v23.9.1, the Nvidia driver in the host is still 470.182.03.

I thought that the GPU operator can help me to manage the Nvidia driver in the host under my situation. But it did not.
Is it possible for me to upgrade the Nvidia driver by upgrading the GPU operator? How?

3. Steps to reproduce the issue

1: build a EKS and a managed node group with optimized accelerated Amazon Linux AMIs.
2: install the GPU operator by helm chart and disable the driver installation.

driver:
  enabled: false

3: upgrade the GPU operator and check the Nvidia driver.
The driver version will not be changed after the upgrading of GPU operator.

4. Information to attach (optional if deemed irrelevant)

[root@ip-10-10-20-111 /]# nvidia-smi -q | head

==============NVSMI LOG==============

Timestamp                                 : Fri Jan 26 23:11:04 2024
Driver Version                            : 470.182.03
CUDA Version                              : 11.4

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Product Name                          : Tesla T4
@xyfleet
Copy link
Author

xyfleet commented Feb 12, 2024

Anyone can help? Thanks.

@cdesiniotis
Copy link
Contributor

@xyfleet the GPU Operator does not manage the driver if it is installed on the host.

One option is to create a self-managed node group and pick an AMI that does not have any NVIDIA software installed. That way, the GPU Operator can manage the lifecycle of the driver and you can get the latest version. See our documentation here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/amazon-eks.html

@mikemckiernan
Copy link
Member

Thank you @xyfleet for using the NVIDIA GPU Operator and reporting the challenge. Thank you to Chris for providing the answer. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants