-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that libcuda.so is in the ldcache #947
base: main
Are you sure you want to change the base?
Conversation
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR adds an end-to-end test to verify that the libcuda symlink chain is properly present in the ldcache when running via Docker with NVIDIA runtime. Key changes include:
- Addition of the "strings" import to support string splitting.
- New test block that pulls an Ubuntu image and runs a container to inspect the ldcache output for libcuda.
- Parsing and validation of the ldcache output to ensure both "libcuda.so" and "libcuda.so.1" are present.
Reviewed Changes
File | Description |
---|---|
tests/e2e/nvidia-container-toolkit_test.go | Added an end-to-end test to validate the presence of libcuda symlink entries in the ldcache |
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
6e7e1a8
to
44e6630
Compare
|
||
// Create the 'create-soname-symlinks' command | ||
c := cli.Command{ | ||
Name: "create-soname-symlinks", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klueska I elected to add a new hook entirely instead of modifying the existing update-ldcache
. This is in keeping with "purpose-built hooks" and also means that the hook name can be used to indicate the intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just one nit
44e6630
to
e43da10
Compare
daadba9
to
9daa179
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one clarifying question, but not a blocker.
"-N", | ||
) | ||
// Explicitly specific the directories to add. | ||
args = append(args, dirs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- does this create .so
symlinks for all libraries present in the specified directories? Does this differ from the behavior of the legacy libnvidia-container implementation, which IIRC would only create the .so
symlinks for a small list of libraries (like libcuda.so)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the .so
symlinks, but the SONAME
symlinks i.e. libcuda.so.1
-> libcuda.so.RM_VERSION
in the case of libcuda
. The .so
symlinks are created using the "standard" create-symlinks
hook.
9daa179
to
035abe0
Compare
@cdesiniotis I am removing the must-backport label for this. Although this was triggered by a customer request, there is a workaround available and I would rather not backport another new hook to the |
Sounds good. |
if hostDriverVersion == "" { | ||
m.logger.Debugf("Host driver version not specified") | ||
return "", nil | ||
} | ||
if !containerRoot.hasPath(cudaCompatPath) { | ||
if !containerRoot.HasPath(cudaCompatPath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have picked a more or less arbitrary place in the code to put the following comment. Having more expressive variable names would make review for me easier. You can postpone looking at this comment until after merge, that's fine with me! In the spirit of learning what we do here however I'd appreciate an answer at some point :)
In my software engineering life I have always deeply cared about file system terminology and operations, and about making related code readable and self-expressive. For example,
- I like to distinguish a "relative path" from an "absolute path" via variable name (if possible)
- I like to distinguish a file object from a file path via variable name
- I like to distinguish a path to a file from a path to a directory (expressing intent, of course these are technically the same)
- I like to call things a "base name" for expressing intent a well (when there are no path separators, and this is supposed to be a "file name" or "directory name").
What is containerRoot
in canonical unix file path terminology?
- Is it guaranteed to point to a directory?
- Is it an absolute path?
- Is it always just
/
?
What are some properties about cudaCompatPath
that we know/guarantee, that we could also express in the variable name or in the type?
- Is it always a relative path (not starting with `/)?
- Is it always a path to a directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for calling this out. I think we could do a pass through this and improve readability greatly. There is no rush on this particular PR (I've elected to hold the changes back from the v1.17.5 release) and as such, I think we can address these concerns here instead of as a follow-up.
The containerRootDir
is the absolute path on the host filesystem to the root (/
) of the container filesystem. It is specified in the OCI Runtime Specification as Root
which informs the slightly ambiguous name. The containerRoot
variable is the typed representation of this directory that allows us to attached helper functions such as HasPath
/ Resolve
/ Glob
to it.
The cudaCompatPath
is the absolute path to the directory containing the CUDA compat libraries in the container (if it exists). It defined as a constant with value /usr/local/cuda/compat
. That is to say, it is always an absolute path to a directory in the container, and we're confirming that this exists, but have to calculate the path to this folder on the host filesystem to do this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can address these concerns here instead of as a follow-up.
Thank you for that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The containerRootDir is the absolute path on the host filesystem to the root (/) of the container filesystem
Is the container filesystem always accessible from within the host filesystem?
It is specified in the OCI Runtime Specification as Root which informs the slightly ambiguous name.
gotcha. Thanks for that background.
The cudaCompatPath is the absolute path to the directory containing the CUDA compat libraries in the container (if it exists).
Thank you for that precision. Now, that is quickly&easily understandable.
but have to calculate the path to this folder on the host filesystem to do this check
meaning that it's always mounted into the container, and never baked into it?
) | ||
|
||
// A ContainerRoot represents the root filesystem of a container. | ||
type ContainerRoot string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's just a string.
Relates to my previous comment: I'd love to see the intent expressed in the variable name: path to a directory?
And this is probably really just my lack of knowledge: I wonder: when is the container's root file system not located at just /
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added som context above. Let me know if it's not clear and I can add more information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work! I understand very little but you have my full emotional support! And I know you will repair everything you destroy. In that sense: approved with honor. 🚀
035abe0
to
305c2e2
Compare
Thsi change moves the ContainerRoot type to the oci package and updates state.GetContainerRootDirPath to return a variable of type ContainerRoot. This enabled better reuse between hooks. Signed-off-by: Evan Lezar <[email protected]>
This change updates the enable-cuda-compat implementation to also use oci.ContainerRoot. Signed-off-by: Evan Lezar <[email protected]>
Since the creation of a .conf file in /etc/ld.so.conf.d is shared by both the update-ldcache and enable-cuda-compat hooks, this is moved to the ContainerRoot type. Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
305c2e2
to
7229201
Compare
This change adds a create-soname-symlinks hook that can be used to ensure that the soname symlinks for injected libraries exist in a container. This is done by calling ldconfig -n -N for the folders containing the injected libraries. This also ensures that libcuda.so is present in the ldcache when the update-ldcache hook is run. Signed-off-by: Evan Lezar <[email protected]>
7229201
to
bdfaea4
Compare
@@ -113,15 +113,15 @@ func (m command) run(c *cli.Context, cfg *config) error { | |||
return fmt.Errorf("failed to load container state: %v", err) | |||
} | |||
|
|||
containerRoot, err := s.GetContainerRoot() | |||
containerRoot, err := s.GetContainerRootDirPath() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ Thanks for adding clarity here.
And now that I have read your explanation.. this is a path that is valid in the host filesystem, and points to the container filesystem's root..? :)
@@ -140,6 +140,7 @@ func (m command) run(c *cli.Context, cfg *config) error { | |||
} | |||
|
|||
// getPaths updates the specified paths relative to the root. | |||
// TODO(elezar): This function should be updated to make use of the oci.ContainerRoot type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
return fmt.Errorf("failed to load container state: %v", err) | ||
} | ||
|
||
containerRoot, err := s.GetContainerRootDirPath() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below I see we sometimes have a nice OnHost
suffix. Could this be applicable here, too?
return fmt.Errorf("failed to determined container root: %v", err) | ||
} | ||
if containerRoot == "" { | ||
m.logger.Warningf("No container root detected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when can/does this happen in practice?
(for my own curiosity/understanding)
if err != nil { | ||
return fmt.Errorf("failed to determined container root: %w", err) | ||
} | ||
|
||
containerForwardCompatDir, err := m.getContainerForwardCompatDir(containerRoot(containerRootDir), cfg.hostDriverVersion) | ||
containerForwardCompatDir, err := m.getContainerForwardCompatDirPathInContainer(containerRootDirPath, cfg.hostDriverVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for having added the InContainer
suffix here. I know these things get lengthy but the information density is really good and this makes stuff no-brainer readable, which was the goal.
And again, the containerRootDirPath
can be thought of as a containerRootDirPathOnHost
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all the iteration. Let's land this. I have left a few more questions while browsing this, but please don't slow down landing this because of my remarks.
These changes add a hook to create soname symlinks (e.g.
libcuda.so.1
->libcuda.so.RM_VERSION
) to ensure that thelibcuda.so
->libcuda.so.1
->libcuda.so.RM_VERSION
symlink chain exists when the ldcache is updated. This allowslibcuda.so
to be present in the ldcache whenldconfig
is run.Note that since we're adding a new hook, a generating client such as the k8s-device plugin or the k8s-dra-driver-gpu must be used with a
nvidia-cdi-hook
binary with a sufficient version.