Add wheel support for Newton-Schulz method via cuSolverMp#3004
Add wheel support for Newton-Schulz method via cuSolverMp#3004ksivaman wants to merge 7 commits into
Conversation
Greptile SummaryThis PR enables cuSolverMp-backed Newton-Schulz matrix orthogonalization in published PyPI wheels by installing
Confidence Score: 4/5The build infrastructure changes are correct, but the runtime library loading path and wheel metadata generation both have correctness issues likely to break the feature for end users. The Dockerfile changes look solid — system packages are installed, symlinks are created, and ldconfig is called in the same layer. However, the runtime path in init.py searches for nvidia/cusolverMp/ (mixed case) while the installed Python package places its directory at nvidia/cusolvermp/ (all lowercase) on a case-sensitive Linux filesystem, so the RTLD_GLOBAL preload is silently skipped. Combined with the unconditional nvidia-cusolvermp-cuXX entry in install_reqs with no guard on NVTE_WITH_CUSOLVERMP, the generated wheel metadata forces this dependency on all TE users regardless of whether cuSolverMp support was compiled in. setup.py and transformer_engine/common/init.py need the most attention — the dependency guard and the library name casing are the two issues most likely to affect end users. Important Files Changed
Sequence DiagramsequenceDiagram
participant Docker as Dockerfile x86/aarch
participant Script as build_wheels.sh
participant Setup as setup.py
participant CMake as CMake build
participant Init as __init__.py runtime
participant PyPI as nvidia-cusolvermp-cuXX
Docker->>Docker: dnf install libcusolvermp0 and devel
Docker->>Docker: symlink to /opt/nvidia/cusolvermp include and lib
Docker->>Docker: ldconfig and export CUSOLVERMP_HOME
Docker->>Script: run build_wheels.sh
Script->>Script: "export NVTE_WITH_CUSOLVERMP=1"
Script->>Setup: python setup.py bdist_wheel
Setup->>Setup: "te_compile_args adds DNVTE_WITH_CUSOLVERMP=ON"
Setup->>CMake: cmake build links against system libcusolvermp
Setup->>Setup: setup_requirements adds nvidia-cusolvermp-cuXX unconditionally
Setup-->>Script: wheel with METADATA requiring nvidia-cusolvermp-cuXX
Note over Init,PyPI: At user runtime after pip install
Init->>Init: "_load_cuda_library_from_python cusolverMp strict=False"
Init->>PyPI: search nvidia/cusolverMp/ case-sensitive lookup
PyPI-->>Init: not found actual dir is nvidia/cusolvermp/ lowercase
Init->>Init: silently continues RTLD_GLOBAL preload skipped
Reviews (7): Last reviewed commit: "Merge branch 'main' into expand_wheel_bu..." | Re-trigger Greptile |
|
|
||
| SITE_PACKAGES=$(/opt/python/cp310-cp310/bin/python -c "import sysconfig; print(sysconfig.get_paths()['purelib'])") | ||
| export CUBLASMP_HOME="${SITE_PACKAGES}/nvidia/cublasmp/cu${CUDA_MAJOR}" | ||
| export CUSOLVERMP_HOME="${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR}" |
There was a problem hiding this comment.
Likely incorrect
CUSOLVERMP_HOME path
The path ${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR} is missing the package-name segment. Every other NVIDIA Python package follows the layout site-packages/nvidia/<package-name>/cu<ver>/ — for example, nvidia-cublasmp-cu12 installs under nvidia/cublasmp/cu12/, so nvidia-cusolvermp-cu12 should install under nvidia/cusolvermp/cu12/. With the current path the .so symlink loop silently skips cuSolverMP's lib/ directory ([ -d "$lib_dir" ] || continue), no unversioned .so stubs are created, and the linker will not find cuSolverMP at build time even though NVTE_WITH_CUSOLVERMP=1 is exported.
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
522c631 to
df140b3
Compare
| # Enable optional build features. cuSolverMp is provided by the build image | ||
| # (see Dockerfile.x86 / Dockerfile.aarch), which also sets CUSOLVERMP_HOME. | ||
| export NVTE_WITH_CUSOLVERMP=1 |
There was a problem hiding this comment.
Three of the four advertised flags never get exported
The PR description and title claim to enable NVTE_WITH_CUSOLVERMP, NVTE_WITH_CUBLASMP, NVTE_ENABLE_NVSHMEM, and NVTE_UB_WITH_MPI in the wheel build. Only NVTE_WITH_CUSOLVERMP is exported here. Neither NVTE_WITH_CUBLASMP, NVTE_ENABLE_NVSHMEM, nor NVTE_UB_WITH_MPI are exported in build_wheels.sh, and no corresponding packages (cuBLASMP, NVSHMEM, OpenMPI) are installed in either Dockerfile. Wheels built from this script will silently omit those three features.
On the second thought - we need nvidia-cusolvermp-cu12/cu13 dependency at runtime, not just when building TE/Common
Signed-off-by: ksivamani <ksivamani@nvidia.com>
|
/te-ci |
|
/te-ci |
| "pydantic", | ||
| "importlib-metadata>=1.0", | ||
| "packaging", | ||
| cusolvermp_pypi_package_name(), |
There was a problem hiding this comment.
cuSolverMp added as unconditional install requirement
cusolvermp_pypi_package_name() is appended to install_reqs with no guard on NVTE_WITH_CUSOLVERMP, so the generated wheel's METADATA always lists nvidia-cusolvermp-cu12 (or cu13) as a mandatory runtime dependency — even when TE is built without cuSolverMp support (the default, since CMakeLists.txt has option(NVTE_WITH_CUSOLVERMP … OFF)). Any downstream user who installs a source-built wheel compiled without the flag will be forced to pull in the cuSolverMp library unnecessarily, and a pip solve in an environment that lacks the package will fail entirely.
The requirement should mirror the cmake-flag guard already used for cublasmp (line 75-80):
if bool(int(os.getenv("NVTE_WITH_CUSOLVERMP", "0"))):
install_reqs.append(cusolvermp_pypi_package_name())
Description
#2706 added distributed Newton-Schulz matrix orthogonalization API via cuSolverMp, this PR brings the support for the same via published wheels.
Type of change
Changes
NVTE_WITH_CUSOLVERMPTE build via PyPI wheel.Checklist: