Goal
Make damacy installable via pip install damacy on a host with only the NVIDIA driver — no CUDA toolkit, no nvcomp, no flake.nix, no Docker. Today the only paths are source build (full toolchain required) or the dev image; both are heavy lifts for a coworker who just wants to try it.
Scope: one platform
- Linux x86_64,
manylinux_2_28_x86_64
- CUDA 13 (single major; CUDA 12 only if asked)
- Python 3.11 / 3.12 / 3.13 (cp wheels —
pyproject.toml already sets wheel.py-api = \"\")
Skip aarch64, Windows, conda-forge for v1. conda-forge is a separate recipe and isn't gated on the PyPI wheel.
Bundling strategy
Bundle directly into the wheel (no Requires-Dist: nvidia-*-cu13):
libcudart.so.13
libnvcomp.so + any nvCOMP plugins that get dlopen'd at runtime
libcufile.so.0 + the matching cufile.json (compat-mode enabled, same as the devShell ships at ${cudaPkgs.libcufile}/etc/cufile.json)
Wheel is large (~150-200 MB) but self-contained. Rationale: avoids version-pin friction with PyTorch's nvidia-* wheels, and PyTorch doesn't ship nvcomp / cufile anyway so there's nothing to share. libcuda.so.1 stays a host dep (driver lib, not bundleable).
libmount.so.1 / libudev.so.1 (transitive dlopens from cufile init) stay host deps too — both are present on any modern Linux.
Python shim
damacy/__init__.py patches the loader path before from damacy import _native:
- Prepend the bundled lib dirs to allow nested dlopens (libnvcomp finding its plugins, libcufile finding libcudart, etc).
- Set
CUFILE_ENV_PATH_JSON to the bundled cufile.json unless the caller has already set it.
Pattern crib from PyTorch's torch/__init__.py lib-path injection.
Build plumbing
- New CI workflow building inside
quay.io/pypa/manylinux_2_28_x86_64 with CUDA toolkit + nvcomp redist tarball installed on top of it. The existing Dockerfile is nvidia/cuda:13.2.1-devel-ubuntu24.04 — useful for reference but not manylinux-tagged.
scikit-build-core already drives the build via pyproject.toml; no source changes needed in CMakeLists.
auditwheel repair won't catch dlopen'd libs on its own. Either extend the audit step manually or ship a small post-build script that copies libcufile / libnvcomp into the wheel's data dir and patches RUNPATH.
- Publish to TestPyPI first; promote to PyPI after a fresh-host smoke install confirms
import damacy works.
Where the real work lives
Compile is fast (single-digit minutes). The time goes into the dlopen / RPATH iteration loop: build wheel → install in toolchain-less venv → import damacy → debug the next missing transitive dlopen → repeat. Plan accordingly.
Key files
pyproject.toml — scikit-build-core config; already wheel-ready
Dockerfile — current CUDA-on-Ubuntu build; the manylinux variant cribs from this
flake.nix:106-135 — the canonical env-var wiring (CUFILE_ENV_PATH_JSON, Nvcomp_ROOT, LD_LIBRARY_PATH ordering) the Python shim has to replicate at runtime
python/damacy/__init__.py — where the lib-path shim goes
Goal
Make damacy installable via
pip install damacyon a host with only the NVIDIA driver — no CUDA toolkit, no nvcomp, no flake.nix, no Docker. Today the only paths are source build (full toolchain required) or the dev image; both are heavy lifts for a coworker who just wants to try it.Scope: one platform
manylinux_2_28_x86_64pyproject.tomlalready setswheel.py-api = \"\")Skip aarch64, Windows, conda-forge for v1. conda-forge is a separate recipe and isn't gated on the PyPI wheel.
Bundling strategy
Bundle directly into the wheel (no
Requires-Dist: nvidia-*-cu13):libcudart.so.13libnvcomp.so+ any nvCOMP plugins that get dlopen'd at runtimelibcufile.so.0+ the matchingcufile.json(compat-mode enabled, same as the devShell ships at${cudaPkgs.libcufile}/etc/cufile.json)Wheel is large (~150-200 MB) but self-contained. Rationale: avoids version-pin friction with PyTorch's
nvidia-*wheels, and PyTorch doesn't ship nvcomp / cufile anyway so there's nothing to share.libcuda.so.1stays a host dep (driver lib, not bundleable).libmount.so.1/libudev.so.1(transitive dlopens from cufile init) stay host deps too — both are present on any modern Linux.Python shim
damacy/__init__.pypatches the loader path beforefrom damacy import _native:CUFILE_ENV_PATH_JSONto the bundledcufile.jsonunless the caller has already set it.Pattern crib from PyTorch's
torch/__init__.pylib-path injection.Build plumbing
quay.io/pypa/manylinux_2_28_x86_64with CUDA toolkit + nvcomp redist tarball installed on top of it. The existingDockerfileisnvidia/cuda:13.2.1-devel-ubuntu24.04— useful for reference but not manylinux-tagged.scikit-build-corealready drives the build viapyproject.toml; no source changes needed in CMakeLists.auditwheel repairwon't catch dlopen'd libs on its own. Either extend the audit step manually or ship a small post-build script that copies libcufile / libnvcomp into the wheel's data dir and patches RUNPATH.import damacyworks.Where the real work lives
Compile is fast (single-digit minutes). The time goes into the dlopen / RPATH iteration loop: build wheel → install in toolchain-less venv →
import damacy→ debug the next missing transitive dlopen → repeat. Plan accordingly.Key files
pyproject.toml—scikit-build-coreconfig; already wheel-readyDockerfile— current CUDA-on-Ubuntu build; the manylinux variant cribs from thisflake.nix:106-135— the canonical env-var wiring (CUFILE_ENV_PATH_JSON,Nvcomp_ROOT,LD_LIBRARY_PATHordering) the Python shim has to replicate at runtimepython/damacy/__init__.py— where the lib-path shim goes