Description
NOTE: This issue is specific to container-toolkit when run as an OCI container via gpu-operator
To enable the nvidia-specific container runtime handlers, the toolkit must overlay config changes on the existing containerd configuration. For the toolkit to do this, it must retrieve the existing containerd config. Currently, it executes the following steps to achieve this:
i) Run containerd config dump
(chroot into the host system)
ii) If i) fails, fall back to retrieving the config TOML from the file specified in CONTAINERD_CONFIG
This algorithm falls short in the scenario of multiple containerd instances running on the same host.
Consider the example of a k0s-based node which runs off a containerd embedded within the k0s
system.
The same system also has a vanilla containerd installed. So we have containerd binaries in two locations
- /usr/bin/containerd
- /var/lib/k0s/bin/containerd
In this case, we expect the toolkit to modify the config of the k0s-embedded containerd. What ends up happening is - Step i) of the algorithm is run, which executes the containerd binary located in /usr/bin
(this binary path is chosen since it is resolved via the PATH
env var).
In this case, we would have wanted the toolkit to fall back to step ii., as it would then retrieve the desired config from the k0s-managed containerd; but the fallback is never triggered as step i. is successful.
Reproduction
To reproduce this issue
i. Install a vanilla containerd
package on the host and ensure that it's running
ii. Install k0s and setup a k0s cluster.
iii. Install gpu-operator (which includes toolkit) with the necessary config overrides to point to k0s
helm install gpu-operator -n gpu-operator --create-namespace \
nvidia/gpu-operator $HELM_OPTIONS \
--version=v24.9.2 \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/etc/k0s/containerd.d/nvidia.toml \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k0s/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia