container-toolkit does not modify the containerd config correctly when there are multiple instances of the `containerd` binary

NOTE: This issue is specific to container-toolkit when run as an OCI container via gpu-operator

To enable the nvidia-specific container runtime handlers, the toolkit must overlay config changes on the existing containerd configuration. For the toolkit to do this, it must retrieve the existing containerd config. Currently, it executes the following steps to achieve this:

i) Run `containerd config dump` (chroot into the host system)
ii) If i) fails, fall back to retrieving the config TOML from the file specified in `CONTAINERD_CONFIG`

This algorithm falls short in the scenario of multiple containerd instances running on the same host.  

Consider the example of a k0s-based node which runs off a containerd embedded within the `k0s` system. 

The same system also has a vanilla containerd installed. So we have containerd binaries in two locations

1. /usr/bin/containerd
2. /var/lib/k0s/bin/containerd

In this case, we expect the toolkit to modify the config of the k0s-embedded containerd. What ends up happening is - Step i) of the algorithm is run, which executes the containerd binary located in `/usr/bin` (this binary path is chosen since it is resolved via the `PATH` env var).

In this case, we would have wanted the toolkit to fall back to step ii., as it would then retrieve the desired config from the k0s-managed containerd; but the fallback is never triggered as step i. is successful.

## Reproduction

To reproduce this issue 

i. Install a vanilla `containerd` package on the host and ensure that it's running
ii. [Install](https://docs.k0sproject.io/v1.27.2+k0s.0/install/) k0s and setup a k0s cluster. 
iii. Install gpu-operator (which includes toolkit) with the necessary config overrides to point to k0s
```
helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator $HELM_OPTIONS \
    --version=v24.9.2 \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/etc/k0s/containerd.d/nvidia.toml \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k0s/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

container-toolkit does not modify the containerd config correctly when there are multiple instances of the `containerd` binary #982

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

container-toolkit does not modify the containerd config correctly when there are multiple instances of the containerd binary #982

Description

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

container-toolkit does not modify the containerd config correctly when there are multiple instances of the `containerd` binary #982