Skip to content

fix: Make libcudart.so loading more robust for Docker environments#152

Open
chan-yuu wants to merge 1 commit into
nvidia-cosmos:mainfrom
chan-yuu:fix/robust-libcudart-loading
Open

fix: Make libcudart.so loading more robust for Docker environments#152
chan-yuu wants to merge 1 commit into
nvidia-cosmos:mainfrom
chan-yuu:fix/robust-libcudart-loading

Conversation

@chan-yuu

@chan-yuu chan-yuu commented Jan 29, 2026

Copy link
Copy Markdown

Problem

The distributed training initialization fails with OSError: libcudart.so: cannot open shared object file when running in Docker containers that have CUDA runtime but not
the full toolkit installed.
The current code directly calls ctypes.CDLL("libcudart.so") which may fail if:

  • LD_LIBRARY_PATH is not set correctly
  • Only versioned .so files exist (e.g., libcudart.so.x)
  • CUDA libraries are in non-standard paths (common in Docker -- nvcc is not a command)
PixPin_2026-01-30_01-43-02

Solution

  • Add _load_libcudart() helper function that searches multiple paths:
    • Standard CUDA paths (/usr/local/cuda/lib64)
    • PyTorch bundled CUDA libraries
    • System library paths (/usr/lib/x86_64-linux-gnu)
    • Versioned .so files via glob pattern
  • Gracefully handle missing library with a warning instead of crash
  • L2 cache optimization is optional, so training can proceed without it

Testing

Tested with Docker containers using NVIDIA CUDA runtime-only images.
PixPin_2026-01-30_01-42-22

Problem:

The distributed training initialization fails with OSError when running in Docker

containers that have CUDA runtime but not the full toolkit, because the code

directly calls ctypes.CDLL('libcudart.so') which may fail if:

- LD_LIBRARY_PATH is not set correctly

- Only versioned .so files exist (e.g., libcudart.so.12.x)

- CUDA libraries are in non-standard paths

Solution:

- Add _load_libcudart() helper that searches multiple paths:

  - Standard CUDA paths (/usr/local/cuda/lib64)

  - PyTorch bundled CUDA libraries

  - System library paths

  - Versioned .so files

- Gracefully handle missing library with a warning instead of crash

- L2 cache optimization is optional, so training can proceed without it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant