Skip to content

container-toolkit does not modify the containerd config correctly when there are multiple instances of the containerd binary #982

Open
@tariq1890

Description

@tariq1890

NOTE: This issue is specific to container-toolkit when run as an OCI container via gpu-operator

To enable the nvidia-specific container runtime handlers, the toolkit must overlay config changes on the existing containerd configuration. For the toolkit to do this, it must retrieve the existing containerd config. Currently, it executes the following steps to achieve this:

i) Run containerd config dump (chroot into the host system)
ii) If i) fails, fall back to retrieving the config TOML from the file specified in CONTAINERD_CONFIG

This algorithm falls short in the scenario of multiple containerd instances running on the same host.

Consider the example of a k0s-based node which runs off a containerd embedded within the k0s system.

The same system also has a vanilla containerd installed. So we have containerd binaries in two locations

  1. /usr/bin/containerd
  2. /var/lib/k0s/bin/containerd

In this case, we expect the toolkit to modify the config of the k0s-embedded containerd. What ends up happening is - Step i) of the algorithm is run, which executes the containerd binary located in /usr/bin (this binary path is chosen since it is resolved via the PATH env var).

In this case, we would have wanted the toolkit to fall back to step ii., as it would then retrieve the desired config from the k0s-managed containerd; but the fallback is never triggered as step i. is successful.

Reproduction

To reproduce this issue

i. Install a vanilla containerd package on the host and ensure that it's running
ii. Install k0s and setup a k0s cluster.
iii. Install gpu-operator (which includes toolkit) with the necessary config overrides to point to k0s

helm install gpu-operator -n gpu-operator --create-namespace \
  nvidia/gpu-operator $HELM_OPTIONS \
    --version=v24.9.2 \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/etc/k0s/containerd.d/nvidia.toml \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k0s/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions