-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Operator fails with containerd runtime error on k0s #1323
Comments
Hi @diamonwiggins , do you have a |
@tariq1890 yup we are describing the same issue, thanks! If you want to close this one as a duplicate that's fine with me. I think it would be helpful to have the workaround steps on your issue as well though if you do that. thanks again! |
Yes, let's close this issue and use the container-toolkit GH issue instead to discuss this |
Closing in favor of NVIDIA/nvidia-container-toolkit#982 |
Description
Recent changes in NVIDIA Container Toolkit, specifically switching from file-based to CLI-based (containerd config dump) retrieval of container runtime configurations, have introduced compatibility issues for Kubernetes distributions that statically compile their own containerd like
k0s
. As a result, GPU Operator v24.9.x encounters runtime configuration errors in such environments due to having a broken containerd config.Versions
550.127.05
)Container Toolkit Configurations
Working configuration:
Broken configuration:
Error
The deployment fails with the following error:
Additional Context
GPU Operator v24.9.x switched to fetching container runtime configuration via CLI, causing failures specifically in Kubernetes distributions like k0s that statically compile containerd binaries. Although issue #777 added fallback support if containerd CLI doesn't exist, the current problem persists because there is no explicit way to always enforce using the configuration file instead of the containerd CLI(to my knowledge).
References:
Original related issue (#1109)
PR to implement fallback to file base retreival(#777)
Temporary Workaround
Either:
550.127.05
, or1.16.2
.Environment
Reproduction Steps
Expected Behavior
GPU Operator successfully deploys without errors, correctly recognizing Containerd runtime.
Actual Behavior
GPU Operator pods fail to start with container runtime errors as described.
The text was updated successfully, but these errors were encountered: