Add CUDA forward compatibility hook #948

elezar · 2025-02-27T15:30:24Z

With #877 the default behaviour of the NVIDIA Container Runtime / NVIDIA Container Runtime Hook was changed to not mount compat libraries from the container into the container. This removed "automatic" support for CUDA Forward compatibility.

This change attempts to address this by adding a createContainerHook that will create a file in /etc/ld.so.conf.d/ in the container to ensure that the /usr/local/cuda/compat libraries are added to the ldcache over the libraries mounted from the host. The provided host diver version is compared to the version of the compat libraries in the container and the config update is only performed if the compat libraries are newer than the host drivers.

Note that the hook only creates a file in the container's file system and does not perform any mount operations. This means that this mechanism is not present the same vulnerabilities causing CVE-2024-0132 and CVE-2025-23359.

In the case of the legacy runtime, this behaviour is only triggered if the allow-cuda-compat-libs-from-container feature flag is not enabled. The CDI spec generation has also been extended to include this hook.

This backports #906

This change adds an nvidia-cdi-hook enable-cuda-compat hook that checks the container for cuda compat libs and updates /etc/ld.so.conf.d to include their parent folder if their driver major version is sufficient. This allows CUDA Forward Compatibility to be used when this is not available through the libnvidia-container. Signed-off-by: Evan Lezar <[email protected]>

This change adds the enable-cuda-compat hook to the incomming OCI runtime spec if the allow-cuda-compat-libs-from-container feature flag is not enabled. An update-ldcache hook is also injected to ensure that the required folders are processed. Signed-off-by: Evan Lezar <[email protected]>

Signed-off-by: Evan Lezar <[email protected]>

luodw · 2025-05-24T14:56:15Z

internal/runtime/runtime_factory.go

 	default:
-		return []string{"mode", "graphics", "feature-gated"}
+		return []string{"feature-gated", "graphics", "mode"}


@elezar Hi, I have a question here. In this modifier order, it will create CreateContainer Hook like this ["enable-cuda-compat", "update-ldcache", "create-symlinks"]. As "update-ldcache" runs before "create-symlinks", so hook "create-symlinks" do some bind mount so(dynamic link library) in container will not add into ldcache?

NVIDIA Container Toolkit 1.17.5 requires Go >= 1.22 [1], and starts using enable-cuda-compat hooks in the Container Device Interface specification generated by it [2]. For example: "hookName": "createContainer", "path": "/usr/bin/nvidia-cdi-hook", "args": [ "nvidia-cdi-hook", "enable-cuda-compat", "--host-driver-version=570.153.02" ] The new hook makes it possible to have containers with a /usr/local/cuda/compat/libcuda.so.* that's newer than the proprietary NVIDIA driver on the host operating system, so that applications can use a newer CUDA without having to update the driver [3]. Even though this sounds useful, the hook has been disabled until it's handled by the 'init-container' command and there's a clear way to test it. The src/go.sum file was updated with 'go mod tidy'. [1] NVIDIA Container Toolkit commit 5bdf14b1e7c24763 NVIDIA/nvidia-container-toolkit@5bdf14b1e7c24763 NVIDIA/nvidia-container-toolkit#941 NVIDIA/nvidia-container-toolkit#950 [2] NVIDIA Container Toolkit commit 76040ff2ad63fb82 NVIDIA/nvidia-container-toolkit@76040ff2ad63fb82 NVIDIA/nvidia-container-toolkit#906 NVIDIA/nvidia-container-toolkit#948 [3] https://docs.nvidia.com/deploy/cuda-compatibility/ containers#1662

elezar added 3 commits February 27, 2025 17:26

Add ldconfig hook in legacy mode

f445d4b

Signed-off-by: Evan Lezar <[email protected]>

elezar added this to the v1.17.5 milestone Feb 27, 2025

elezar added the backport label Feb 27, 2025

elezar self-assigned this Feb 27, 2025

elezar requested review from cdesiniotis, tariq1890, klueska and ArangoGutierrez February 27, 2025 15:30

elezar added 3 commits February 27, 2025 17:34

Add enable-cuda-compat hook to CDI spec generation

e330a93

Signed-off-by: Evan Lezar <[email protected]>

Ensure that mode hook is executed last

9f611a5

Signed-off-by: Evan Lezar <[email protected]>

Add disable-cuda-compat-lib-hook feature flag

c1bac28

Signed-off-by: Evan Lezar <[email protected]>

elezar force-pushed the add-compat-lib-hook branch from 3307cb1 to c1bac28 Compare February 27, 2025 15:35

ArangoGutierrez approved these changes Feb 28, 2025

View reviewed changes

elezar merged commit f5680dd into NVIDIA:release-1.17 Feb 28, 2025
10 checks passed

elezar deleted the add-compat-lib-hook branch February 28, 2025 15:10

KCSesh mentioned this pull request Apr 17, 2025

Third party package updates bottlerocket-os/bottlerocket-core-kit#472

Merged

luodw reviewed May 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CUDA forward compatibility hook #948

Add CUDA forward compatibility hook #948

Uh oh!

elezar commented Feb 27, 2025

Uh oh!

Uh oh!

luodw May 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add CUDA forward compatibility hook #948

Add CUDA forward compatibility hook #948

Uh oh!

Conversation

elezar commented Feb 27, 2025

Uh oh!

Uh oh!

luodw May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

luodw May 24, 2025 •

edited

Loading