Add wheel support for Newton-Schulz method via cuSolverMp by ksivaman · Pull Request #3004 · NVIDIA/TransformerEngine

ksivaman · 2026-05-17T01:37:56Z

Description

#2706 added distributed Newton-Schulz matrix orthogonalization API via cuSolverMp, this PR brings the support for the same via published wheels.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Enable NVTE_WITH_CUSOLVERMP TE build via PyPI wheel.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-05-17T01:40:52Z

Greptile Summary

This PR enables cuSolverMp-backed Newton-Schulz matrix orthogonalization in published PyPI wheels by installing libcusolvermp from system packages in the wheel-build Dockerfiles, wiring up CUSOLVERMP_HOME, and adding nvidia-cusolvermp-cuXX as a runtime dependency.

Dockerfiles (x86 + aarch64): add libcusolvermp0-cuda-* / -devel-* via dnf, create a stable /opt/nvidia/cusolvermp symlink tree, register the lib path with ldconfig, and export CUSOLVERMP_HOME — all in a single RUN layer.
build_wheels.sh: exports NVTE_WITH_CUSOLVERMP=1 before any setup.py invocation so CMake picks up the flag.
setup.py + build_tools/utils.py: introduces cusolvermp_pypi_package_name() and unconditionally appends the result to install_reqs.
transformer_engine/common/__init__.py: adds a non-strict _load_cuda_library_from_python("cusolverMp") call at import time so the shared library is pre-loaded with RTLD_GLOBAL.

Confidence Score: 4/5

The build infrastructure changes are correct, but the runtime library loading path and wheel metadata generation both have correctness issues likely to break the feature for end users.

The Dockerfile changes look solid — system packages are installed, symlinks are created, and ldconfig is called in the same layer. However, the runtime path in init.py searches for nvidia/cusolverMp/ (mixed case) while the installed Python package places its directory at nvidia/cusolvermp/ (all lowercase) on a case-sensitive Linux filesystem, so the RTLD_GLOBAL preload is silently skipped. Combined with the unconditional nvidia-cusolvermp-cuXX entry in install_reqs with no guard on NVTE_WITH_CUSOLVERMP, the generated wheel metadata forces this dependency on all TE users regardless of whether cuSolverMp support was compiled in.

setup.py and transformer_engine/common/init.py need the most attention — the dependency guard and the library name casing are the two issues most likely to affect end users.

Important Files Changed

Filename	Overview
build_tools/wheel_utils/Dockerfile.x86	Installs libcusolvermp system packages, creates /opt/nvidia/cusolvermp symlink tree, and sets CUSOLVERMP_HOME; ldconfig is correctly called in the same RUN layer for the cusolvermp conf.
build_tools/wheel_utils/Dockerfile.aarch	Identical cuSolverMp additions as Dockerfile.x86 for aarch64; uses /usr/lib64 which is correct for RHEL-based manylinux images.
build_tools/wheel_utils/build_wheels.sh	Exports NVTE_WITH_CUSOLVERMP=1 to enable the feature in all wheel builds.
setup.py	cusolvermp_pypi_package_name() is appended to install_reqs unconditionally with no NVTE_WITH_CUSOLVERMP guard, unlike the guarded cublasmp pattern; all wheels will carry the dependency regardless of build flags.
transformer_engine/common/init.py	Adds non-strict _load_cuda_library_from_python("cusolverMp") at import time; the mixed-case lib_name will not match the lowercase nvidia/cusolvermp/ directory on case-sensitive Linux, so the preload silently skips.
build_tools/utils.py	Adds cusolvermp_pypi_package_name() helper and cleans up the PackageNotFoundError import; logic is straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant Docker as Dockerfile x86/aarch
    participant Script as build_wheels.sh
    participant Setup as setup.py
    participant CMake as CMake build
    participant Init as __init__.py runtime
    participant PyPI as nvidia-cusolvermp-cuXX

    Docker->>Docker: dnf install libcusolvermp0 and devel
    Docker->>Docker: symlink to /opt/nvidia/cusolvermp include and lib
    Docker->>Docker: ldconfig and export CUSOLVERMP_HOME
    Docker->>Script: run build_wheels.sh
    Script->>Script: "export NVTE_WITH_CUSOLVERMP=1"
    Script->>Setup: python setup.py bdist_wheel
    Setup->>Setup: "te_compile_args adds DNVTE_WITH_CUSOLVERMP=ON"
    Setup->>CMake: cmake build links against system libcusolvermp
    Setup->>Setup: setup_requirements adds nvidia-cusolvermp-cuXX unconditionally
    Setup-->>Script: wheel with METADATA requiring nvidia-cusolvermp-cuXX

    Note over Init,PyPI: At user runtime after pip install
    Init->>Init: "_load_cuda_library_from_python cusolverMp strict=False"
    Init->>PyPI: search nvidia/cusolverMp/ case-sensitive lookup
    PyPI-->>Init: not found actual dir is nvidia/cusolvermp/ lowercase
    Init->>Init: silently continues RTLD_GLOBAL preload skipped

_{Reviews (7): Last reviewed commit: "Merge branch 'main' into expand_wheel_bu..." | Re-trigger Greptile}

greptile-apps · 2026-05-17T01:40:56Z

+
+SITE_PACKAGES=$(/opt/python/cp310-cp310/bin/python -c "import sysconfig; print(sysconfig.get_paths()['purelib'])")
+export CUBLASMP_HOME="${SITE_PACKAGES}/nvidia/cublasmp/cu${CUDA_MAJOR}"
+export CUSOLVERMP_HOME="${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR}"


Likely incorrect CUSOLVERMP_HOME path

The path ${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR} is missing the package-name segment. Every other NVIDIA Python package follows the layout site-packages/nvidia/<package-name>/cu<ver>/ — for example, nvidia-cublasmp-cu12 installs under nvidia/cublasmp/cu12/, so nvidia-cusolvermp-cu12 should install under nvidia/cusolvermp/cu12/. With the current path the .so symlink loop silently skips cuSolverMP's lib/ directory ([ -d "$lib_dir" ] || continue), no unversioned .so stubs are created, and the linker will not find cuSolverMP at build time even though NVTE_WITH_CUSOLVERMP=1 is exported.

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

greptile-apps · 2026-05-19T23:36:01Z

+# Enable optional build features. cuSolverMp is provided by the build image
+# (see Dockerfile.x86 / Dockerfile.aarch), which also sets CUSOLVERMP_HOME.
+export NVTE_WITH_CUSOLVERMP=1


Three of the four advertised flags never get exported

The PR description and title claim to enable NVTE_WITH_CUSOLVERMP, NVTE_WITH_CUBLASMP, NVTE_ENABLE_NVSHMEM, and NVTE_UB_WITH_MPI in the wheel build. Only NVTE_WITH_CUSOLVERMP is exported here. Neither NVTE_WITH_CUBLASMP, NVTE_ENABLE_NVSHMEM, nor NVTE_UB_WITH_MPI are exported in build_wheels.sh, and no corresponding packages (cuBLASMP, NVSHMEM, OpenMPI) are installed in either Dockerfile. Wheels built from this script will silently omit those three features.

vcherepanov-nv

LGTM

On the second thought - we need nvidia-cusolvermp-cu12/cu13 dependency at runtime, not just when building TE/Common

Signed-off-by: ksivamani <ksivamani@nvidia.com>

ksivaman · 2026-06-02T21:24:06Z

/te-ci

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman · 2026-06-03T19:14:01Z

/te-ci

greptile-apps · 2026-06-03T19:18:23Z

        "pydantic",
        "importlib-metadata>=1.0",
        "packaging",
+        cusolvermp_pypi_package_name(),


cuSolverMp added as unconditional install requirement

cusolvermp_pypi_package_name() is appended to install_reqs with no guard on NVTE_WITH_CUSOLVERMP, so the generated wheel's METADATA always lists nvidia-cusolvermp-cu12 (or cu13) as a mandatory runtime dependency — even when TE is built without cuSolverMp support (the default, since CMakeLists.txt has option(NVTE_WITH_CUSOLVERMP … OFF)). Any downstream user who installs a source-built wheel compiled without the flag will be forced to pull in the cuSolverMp library unnecessarily, and a pip solve in an environment that lacks the package will fail entirely.

The requirement should mirror the cmake-flag guard already used for cublasmp (line 75-80):

if bool(int(os.getenv("NVTE_WITH_CUSOLVERMP", "0"))): install_reqs.append(cusolvermp_pypi_package_name())

ksivaman requested review from cyanguwa, denera and mk-61 May 17, 2026 01:37

ksivaman marked this pull request as draft May 17, 2026 01:38

greptile-apps Bot reviewed May 17, 2026

View reviewed changes

Add NS via cusolvermp to wheel build

df140b3

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman force-pushed the expand_wheel_builds branch from 522c631 to df140b3 Compare May 19, 2026 23:33

ksivaman marked this pull request as ready for review May 19, 2026 23:34

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

ksivaman changed the title ~~Add optional core lib features to wheel build~~ Add wheel support for Newton-Schulz method via cuSolverMp May 19, 2026

vcherepanov-nv previously approved these changes May 20, 2026

View reviewed changes

ksivaman added 3 commits May 26, 2026 11:55

Merge branch 'NVIDIA:main' into expand_wheel_builds

0fe4daf

Merge branch 'main' into expand_wheel_builds

de2dd20

Build dep runtime

50f1753

Signed-off-by: ksivamani <ksivamani@nvidia.com>

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread transformer_engine/common/__init__.py

vcherepanov-nv previously approved these changes Jun 2, 2026

View reviewed changes

Fix

ccaccd5

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman dismissed vcherepanov-nv’s stale review via ccaccd5 June 3, 2026 19:13

Merge branch 'main' into expand_wheel_builds

123f778

vcherepanov-nv approved these changes Jun 3, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

Merge branch 'main' into expand_wheel_builds

1316c1e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add wheel support for Newton-Schulz method via cuSolverMp#3004

Add wheel support for Newton-Schulz method via cuSolverMp#3004
ksivaman wants to merge 7 commits into
NVIDIA:mainfrom
ksivaman:expand_wheel_builds

ksivaman commented May 17, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 17, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 17, 2026

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 19, 2026

Uh oh!

vcherepanov-nv left a comment

Uh oh!

Uh oh!

ksivaman commented Jun 2, 2026

Uh oh!

ksivaman commented Jun 3, 2026

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ksivaman commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

vcherepanov-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ksivaman commented Jun 2, 2026

Uh oh!

ksivaman commented Jun 3, 2026

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ksivaman commented May 17, 2026 •

edited

Loading

greptile-apps Bot commented May 17, 2026 •

edited

Loading