-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
container-toolkit does not modify the containerd config correctly when there are multiple instances of the containerd
binary
#982
Comments
Closed a duplicate issue I had created in the gpu operator repo - NVIDIA/gpu-operator#1323. Some additional context and a temporary workaround below for others running into this. Additional ContextGPU Operator v24.9.x switched to fetching container runtime configuration via CLI, causing failures specifically in Kubernetes distributions like k0s that statically compile containerd binaries. Although issue #777 added fallback support if containerd CLI doesn't exist, the current problem persists because there is no explicit way to always enforce using the configuration file instead of the containerd CLI(to my knowledge). References: Temporary WorkaroundEither:
|
@diamonwiggins since you have an environment that reproduces this information would you be able to verify that:
|
@tariq1890 is the following really the desired functionality:
Would retrieving the config from the |
@elezar The following two commands produce the same config for me:
No matter what address i pass in, the
If i uncomment and modify the grpc address via Running |
Thanks for checking the behaviour @diamonwiggins. This means that we can't "simply" specify the socket when running an arbitrary containerd binary and expect the config to be consistent. One option we would have is to allow the path to the contianerd binary to be specificed as an argument to the toolkit container instead of looking for this in the path. (This is what I was refering to in #982 (comment) but I did not express it clearly). Could you confirm that running the |
NOTE: This issue is specific to container-toolkit when run as an OCI container via gpu-operator
To enable the nvidia-specific container runtime handlers, the toolkit must overlay config changes on the existing containerd configuration. For the toolkit to do this, it must retrieve the existing containerd config. Currently, it executes the following steps to achieve this:
i) Run
containerd config dump
(chroot into the host system)ii) If i) fails, fall back to retrieving the config TOML from the file specified in
CONTAINERD_CONFIG
This algorithm falls short in the scenario of multiple containerd instances running on the same host.
Consider the example of a k0s-based node which runs off a containerd embedded within the
k0s
system.The same system also has a vanilla containerd installed. So we have containerd binaries in two locations
In this case, we expect the toolkit to modify the config of the k0s-embedded containerd. What ends up happening is - Step i) of the algorithm is run, which executes the containerd binary located in
/usr/bin
(this binary path is chosen since it is resolved via thePATH
env var).In this case, we would have wanted the toolkit to fall back to step ii., as it would then retrieve the desired config from the k0s-managed containerd; but the fallback is never triggered as step i. is successful.
Reproduction
To reproduce this issue
i. Install a vanilla
containerd
package on the host and ensure that it's runningii. Install k0s and setup a k0s cluster.
iii. Install gpu-operator (which includes toolkit) with the necessary config overrides to point to k0s
The text was updated successfully, but these errors were encountered: