Nvidia driver issue with Amazon EKS optimized accelerated Amazon Linux AMIs #666

xyfleet · 2024-02-01T01:02:16Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Amazon Linux 2
Kernel Version: 5.4.254-170.358.amzn2.x86_64
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS
GPU Operator Version: v23.3.1

2. Issue or feature description

I am using the amazon optimized accelerated Amazon Linux AMIs to build a managed node group in EKS cluster to support GPU.
(https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html#gpu-ami)

Based on the AWS docs,
All of the optimized accelerated Amazon Linux AMIs have Nvidia Driver pre-installed. So, I disabled the driver installation when I install the GPU operator. (Even I enabled the driver installation on GPU operator, it did not overwrite the exiting driver in the host.)
Then, one issue appeared that there is no way to upgrade the Nvidia driver in the amazon host when I upgrade the GPU operator.
For example, the Nvidia driver in my existing GPU host is, 470.182.03
But the newer Nvidia driver coming with GPU operator is: 535.129.03
After the upgrading GPU operator to the latest version, v23.9.1, the Nvidia driver in the host is still 470.182.03.

I thought that the GPU operator can help me to manage the Nvidia driver in the host under my situation. But it did not.
Is it possible for me to upgrade the Nvidia driver by upgrading the GPU operator? How?

3. Steps to reproduce the issue

1: build a EKS and a managed node group with optimized accelerated Amazon Linux AMIs.
2: install the GPU operator by helm chart and disable the driver installation.

driver:
  enabled: false

3: upgrade the GPU operator and check the Nvidia driver.
The driver version will not be changed after the upgrading of GPU operator.

4. Information to attach (optional if deemed irrelevant)

[root@ip-10-10-20-111 /]# nvidia-smi -q | head

==============NVSMI LOG==============

Timestamp                                 : Fri Jan 26 23:11:04 2024
Driver Version                            : 470.182.03
CUDA Version                              : 11.4

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Product Name                          : Tesla T4

The text was updated successfully, but these errors were encountered:

xyfleet · 2024-02-12T18:26:11Z

Anyone can help? Thanks.

cdesiniotis · 2024-02-13T06:45:03Z

@xyfleet the GPU Operator does not manage the driver if it is installed on the host.

One option is to create a self-managed node group and pick an AMI that does not have any NVIDIA software installed. That way, the GPU Operator can manage the lifecycle of the driver and you can get the latest version. See our documentation here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/amazon-eks.html

mikemckiernan · 2024-02-17T17:41:44Z

Thank you @xyfleet for using the NVIDIA GPU Operator and reporting the challenge. Thank you to Chris for providing the answer. Closing this issue.

mikemckiernan closed this as completed Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nvidia driver issue with Amazon EKS optimized accelerated Amazon Linux AMIs #666

Nvidia driver issue with Amazon EKS optimized accelerated Amazon Linux AMIs #666

xyfleet commented Feb 1, 2024 •

edited

Loading

xyfleet commented Feb 12, 2024

cdesiniotis commented Feb 13, 2024

mikemckiernan commented Feb 17, 2024

Nvidia driver issue with Amazon EKS optimized accelerated Amazon Linux AMIs #666

Nvidia driver issue with Amazon EKS optimized accelerated Amazon Linux AMIs #666

Comments

xyfleet commented Feb 1, 2024 • edited Loading

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

xyfleet commented Feb 12, 2024

cdesiniotis commented Feb 13, 2024

mikemckiernan commented Feb 17, 2024

xyfleet commented Feb 1, 2024 •

edited

Loading