Skip to content

Add wheel support for Newton-Schulz method via cuSolverMp#3004

Open
ksivaman wants to merge 7 commits into
NVIDIA:mainfrom
ksivaman:expand_wheel_builds
Open

Add wheel support for Newton-Schulz method via cuSolverMp#3004
ksivaman wants to merge 7 commits into
NVIDIA:mainfrom
ksivaman:expand_wheel_builds

Conversation

@ksivaman
Copy link
Copy Markdown
Member

@ksivaman ksivaman commented May 17, 2026

Description

#2706 added distributed Newton-Schulz matrix orthogonalization API via cuSolverMp, this PR brings the support for the same via published wheels.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Enable NVTE_WITH_CUSOLVERMP TE build via PyPI wheel.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@ksivaman ksivaman requested review from cyanguwa, denera and mk-61 May 17, 2026 01:37
@ksivaman ksivaman marked this pull request as draft May 17, 2026 01:38
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 17, 2026

Greptile Summary

This PR enables cuSolverMp-backed Newton-Schulz matrix orthogonalization in published PyPI wheels by installing libcusolvermp from system packages in the wheel-build Dockerfiles, wiring up CUSOLVERMP_HOME, and adding nvidia-cusolvermp-cuXX as a runtime dependency.

  • Dockerfiles (x86 + aarch64): add libcusolvermp0-cuda-* / -devel-* via dnf, create a stable /opt/nvidia/cusolvermp symlink tree, register the lib path with ldconfig, and export CUSOLVERMP_HOME — all in a single RUN layer.
  • build_wheels.sh: exports NVTE_WITH_CUSOLVERMP=1 before any setup.py invocation so CMake picks up the flag.
  • setup.py + build_tools/utils.py: introduces cusolvermp_pypi_package_name() and unconditionally appends the result to install_reqs.
  • transformer_engine/common/__init__.py: adds a non-strict _load_cuda_library_from_python("cusolverMp") call at import time so the shared library is pre-loaded with RTLD_GLOBAL.

Confidence Score: 4/5

The build infrastructure changes are correct, but the runtime library loading path and wheel metadata generation both have correctness issues likely to break the feature for end users.

The Dockerfile changes look solid — system packages are installed, symlinks are created, and ldconfig is called in the same layer. However, the runtime path in init.py searches for nvidia/cusolverMp/ (mixed case) while the installed Python package places its directory at nvidia/cusolvermp/ (all lowercase) on a case-sensitive Linux filesystem, so the RTLD_GLOBAL preload is silently skipped. Combined with the unconditional nvidia-cusolvermp-cuXX entry in install_reqs with no guard on NVTE_WITH_CUSOLVERMP, the generated wheel metadata forces this dependency on all TE users regardless of whether cuSolverMp support was compiled in.

setup.py and transformer_engine/common/init.py need the most attention — the dependency guard and the library name casing are the two issues most likely to affect end users.

Important Files Changed

Filename Overview
build_tools/wheel_utils/Dockerfile.x86 Installs libcusolvermp system packages, creates /opt/nvidia/cusolvermp symlink tree, and sets CUSOLVERMP_HOME; ldconfig is correctly called in the same RUN layer for the cusolvermp conf.
build_tools/wheel_utils/Dockerfile.aarch Identical cuSolverMp additions as Dockerfile.x86 for aarch64; uses /usr/lib64 which is correct for RHEL-based manylinux images.
build_tools/wheel_utils/build_wheels.sh Exports NVTE_WITH_CUSOLVERMP=1 to enable the feature in all wheel builds.
setup.py cusolvermp_pypi_package_name() is appended to install_reqs unconditionally with no NVTE_WITH_CUSOLVERMP guard, unlike the guarded cublasmp pattern; all wheels will carry the dependency regardless of build flags.
transformer_engine/common/init.py Adds non-strict _load_cuda_library_from_python("cusolverMp") at import time; the mixed-case lib_name will not match the lowercase nvidia/cusolvermp/ directory on case-sensitive Linux, so the preload silently skips.
build_tools/utils.py Adds cusolvermp_pypi_package_name() helper and cleans up the PackageNotFoundError import; logic is straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant Docker as Dockerfile x86/aarch
    participant Script as build_wheels.sh
    participant Setup as setup.py
    participant CMake as CMake build
    participant Init as __init__.py runtime
    participant PyPI as nvidia-cusolvermp-cuXX

    Docker->>Docker: dnf install libcusolvermp0 and devel
    Docker->>Docker: symlink to /opt/nvidia/cusolvermp include and lib
    Docker->>Docker: ldconfig and export CUSOLVERMP_HOME
    Docker->>Script: run build_wheels.sh
    Script->>Script: "export NVTE_WITH_CUSOLVERMP=1"
    Script->>Setup: python setup.py bdist_wheel
    Setup->>Setup: "te_compile_args adds DNVTE_WITH_CUSOLVERMP=ON"
    Setup->>CMake: cmake build links against system libcusolvermp
    Setup->>Setup: setup_requirements adds nvidia-cusolvermp-cuXX unconditionally
    Setup-->>Script: wheel with METADATA requiring nvidia-cusolvermp-cuXX

    Note over Init,PyPI: At user runtime after pip install
    Init->>Init: "_load_cuda_library_from_python cusolverMp strict=False"
    Init->>PyPI: search nvidia/cusolverMp/ case-sensitive lookup
    PyPI-->>Init: not found actual dir is nvidia/cusolvermp/ lowercase
    Init->>Init: silently continues RTLD_GLOBAL preload skipped
Loading

Reviews (7): Last reviewed commit: "Merge branch 'main' into expand_wheel_bu..." | Re-trigger Greptile

Comment thread build_tools/wheel_utils/build_wheels.sh Outdated

SITE_PACKAGES=$(/opt/python/cp310-cp310/bin/python -c "import sysconfig; print(sysconfig.get_paths()['purelib'])")
export CUBLASMP_HOME="${SITE_PACKAGES}/nvidia/cublasmp/cu${CUDA_MAJOR}"
export CUSOLVERMP_HOME="${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Likely incorrect CUSOLVERMP_HOME path

The path ${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR} is missing the package-name segment. Every other NVIDIA Python package follows the layout site-packages/nvidia/<package-name>/cu<ver>/ — for example, nvidia-cublasmp-cu12 installs under nvidia/cublasmp/cu12/, so nvidia-cusolvermp-cu12 should install under nvidia/cusolvermp/cu12/. With the current path the .so symlink loop silently skips cuSolverMP's lib/ directory ([ -d "$lib_dir" ] || continue), no unversioned .so stubs are created, and the linker will not find cuSolverMP at build time even though NVTE_WITH_CUSOLVERMP=1 is exported.

Comment thread build_tools/wheel_utils/Dockerfile.x86 Outdated
Comment thread build_tools/wheel_utils/Dockerfile.aarch Outdated
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
@ksivaman ksivaman force-pushed the expand_wheel_builds branch from 522c631 to df140b3 Compare May 19, 2026 23:33
@ksivaman ksivaman marked this pull request as ready for review May 19, 2026 23:34
Comment on lines +28 to +30
# Enable optional build features. cuSolverMp is provided by the build image
# (see Dockerfile.x86 / Dockerfile.aarch), which also sets CUSOLVERMP_HOME.
export NVTE_WITH_CUSOLVERMP=1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Three of the four advertised flags never get exported

The PR description and title claim to enable NVTE_WITH_CUSOLVERMP, NVTE_WITH_CUBLASMP, NVTE_ENABLE_NVSHMEM, and NVTE_UB_WITH_MPI in the wheel build. Only NVTE_WITH_CUSOLVERMP is exported here. Neither NVTE_WITH_CUBLASMP, NVTE_ENABLE_NVSHMEM, nor NVTE_UB_WITH_MPI are exported in build_wheels.sh, and no corresponding packages (cuBLASMP, NVSHMEM, OpenMPI) are installed in either Dockerfile. Wheels built from this script will silently omit those three features.

@ksivaman ksivaman changed the title Add optional core lib features to wheel build Add wheel support for Newton-Schulz method via cuSolverMp May 19, 2026
vcherepanov-nv
vcherepanov-nv previously approved these changes May 20, 2026
Copy link
Copy Markdown
Collaborator

@vcherepanov-nv vcherepanov-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vcherepanov-nv vcherepanov-nv dismissed their stale review May 20, 2026 00:39

On the second thought - we need nvidia-cusolvermp-cu12/cu13 dependency at runtime, not just when building TE/Common

Comment thread transformer_engine/common/__init__.py
@ksivaman
Copy link
Copy Markdown
Member Author

ksivaman commented Jun 2, 2026

/te-ci

vcherepanov-nv
vcherepanov-nv previously approved these changes Jun 2, 2026
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
@ksivaman
Copy link
Copy Markdown
Member Author

ksivaman commented Jun 3, 2026

/te-ci

Comment thread setup.py
"pydantic",
"importlib-metadata>=1.0",
"packaging",
cusolvermp_pypi_package_name(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 cuSolverMp added as unconditional install requirement

cusolvermp_pypi_package_name() is appended to install_reqs with no guard on NVTE_WITH_CUSOLVERMP, so the generated wheel's METADATA always lists nvidia-cusolvermp-cu12 (or cu13) as a mandatory runtime dependency — even when TE is built without cuSolverMp support (the default, since CMakeLists.txt has option(NVTE_WITH_CUSOLVERMP … OFF)). Any downstream user who installs a source-built wheel compiled without the flag will be forced to pull in the cuSolverMp library unnecessarily, and a pip solve in an environment that lacks the package will fail entirely.

The requirement should mirror the cmake-flag guard already used for cublasmp (line 75-80):

if bool(int(os.getenv("NVTE_WITH_CUSOLVERMP", "0"))):
    install_reqs.append(cusolvermp_pypi_package_name())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants