Newton-Schulz via cuSOLVERMp#2706
Conversation
Greptile SummaryThis PR adds a distributed Newton-Schulz matrix orthogonalization API backed by cuSolverMp, with RAII-managed C++ context lifecycle, a Python Confidence Score: 4/5Safe to merge after addressing a handful of previously-identified items that remain open (option() placement, num_coefficients dead parameter, devInfo nullptr). The one new finding (Literal type annotation) is minor and non-blocking at runtime. All P0 concerns from prior threads have been resolved (RAII handles, contiguity/dtype guards, test timeout, row-major transpose trick correctly applied). Several P1/P2 items flagged in prior rounds remain open per the thread (option() placement, num_coefficients unused, devInfo nullptr convergence diagnostics, debug print in build_tools/utils.py), which keeps the score at 4 rather than 5. transformer_engine/pytorch/newton_schulz.py (Literal annotation), transformer_engine/common/newton_schulz/newton_schulz.cpp (devInfo nullptr, unused num_coefficients — both from prior threads still open) Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant CusolverMpCtx
participant newton_schulz_py as newton_schulz (Python)
participant tex as transformer_engine_torch (C++ ext)
participant cpp as nvte_newton_schulz (C++)
participant cuSolverMp
Caller->>CusolverMpCtx: CusolverMpCtx(group)
CusolverMpCtx->>tex: cusolvermp_ctx_create(nccl_comm_ptr, nranks, rank)
tex->>cpp: nvte_cusolvermp_ctx_create(comm, nranks, rank)
cpp->>cuSolverMp: cusolverMpCreate + cusolverMpCreateDeviceGrid
cpp-->>tex: NVTECusolverMpCtx*
tex-->>CusolverMpCtx: ctx_ptr (int64)
Caller->>newton_schulz_py: newton_schulz(x_local, ctx, num_iterations, coefficients)
newton_schulz_py->>newton_schulz_py: validate dtype, contiguity, dims
newton_schulz_py->>tex: newton_schulz(ctx_ptr, m, n, x, iters, coeffs)
tex->>cpp: nvte_newton_schulz(ctx, m, n, x_tensor, iters, coeffs, caller_stream)
cpp->>cpp: record in_ready event on caller_stream
cpp->>cpp: wait on internal stream
cpp->>cuSolverMp: cusolverMpNewtonSchulz_bufferSize
cpp->>cpp: allocate/grow workspace (ncclMemAlloc or cudaMalloc)
cpp->>cuSolverMp: cusolverMpNewtonSchulz (in-place on x^T)
cpp->>cpp: record out_ready event, caller stream waits
cpp-->>newton_schulz_py: return
newton_schulz_py-->>Caller: x_local modified in-place
Caller->>CusolverMpCtx: ctx.destroy()
CusolverMpCtx->>tex: cusolvermp_ctx_destroy(ctx_ptr)
tex->>cpp: nvte_cusolvermp_ctx_destroy(ctx)
cpp->>cuSolverMp: free workspace + destroy grid + destroy handle
Reviews (34): Last reviewed commit: "Couple num_iterations with coeff types i..." | Re-trigger Greptile |
| # Check: if X = A^{-1/2}, then X @ A @ X should be the identity matrix | ||
| if rank == 0: | ||
| XXT = X @ X.t() | ||
| I = torch.eye(N, device=XXT.device, dtype=XXT.dtype) | ||
| max_diff = (XXT - I).abs().max().item() | ||
| print(f"Max |X @ X.t() - I|: {max_diff:.6e}", flush=True) |
There was a problem hiding this comment.
verification doesn't match the comment - if X = A^{-1/2}, the check should be X @ A @ X ≈ I, not X @ X.t() ≈ I. The current check verifies X is orthogonal, not that X is the inverse square root of A. Note that A_orig is created on line 76 but never used.
| # Check: if X = A^{-1/2}, then X @ A @ X should be the identity matrix | |
| if rank == 0: | |
| XXT = X @ X.t() | |
| I = torch.eye(N, device=XXT.device, dtype=XXT.dtype) | |
| max_diff = (XXT - I).abs().max().item() | |
| print(f"Max |X @ X.t() - I|: {max_diff:.6e}", flush=True) | |
| # Check: if X = A^{-1/2}, then X @ A @ X should be the identity matrix | |
| XAX = X @ A_orig @ X | |
| I = torch.eye(N, device=XAX.device, dtype=XAX.dtype) | |
| max_diff = (XAX - I).abs().max().item() | |
| print(f"Max |X @ A @ X - I|: {max_diff:.6e}", flush=True) | |
| if torch.allclose(XAX, I, atol=args.atol, rtol=args.rtol): |
| nccl_backend = group._get_backend(torch.device("cuda")) | ||
| return nccl_backend._comm_ptr() |
There was a problem hiding this comment.
uses private PyTorch APIs (_get_backend, _comm_ptr) that may change in future versions
| quintic_coefficients = [ | ||
| 4.0848, | ||
| -6.8946, | ||
| 2.9270, | ||
| 3.9505, | ||
| -6.3029, | ||
| 2.6377, | ||
| 3.7418, | ||
| -5.5913, | ||
| 2.3037, | ||
| 2.8769, | ||
| -3.1427, | ||
| 1.2046, | ||
| 2.8366, | ||
| -3.0525, | ||
| 1.2012, | ||
| ] | ||
| coefficients = ( | ||
| quintic_coefficients if args.num_iterations == 5 else [1.5, -0.5, 0.0] * args.num_iterations | ||
| ) |
There was a problem hiding this comment.
coefficients mismatch with API defaults - test uses 15 coefficients for 5 iterations, but newton_schulz.py defaults to 5 coefficients. This inconsistency means default API behavior isn't tested.
| * \brief Functions for distributed Newton-Schulz inverse square root. | ||
| * | ||
| * This API is a TE-native binding to the cuSolverMp library. | ||
| * It computes an iterative Newton-Schulz inverse square root | ||
| * approximation on a distributed matrix. |
There was a problem hiding this comment.
Documentation claims this computes "inverse square root" but the test validates orthogonality (X @ X.t() ≈ I), and commit dd1dd0b states "it approximates orthogonal matrix, not inverse square root". If this computes the polar decomposition (orthogonal factor), the documentation should be updated to reflect that. Inverse square root would satisfy X @ A @ X ≈ I, which is different from orthogonality.
| m = x.size(0) * nranks # rows are distributed across ranks | ||
| n = x.size(1) |
There was a problem hiding this comment.
Assumes rows are evenly distributed (m = x.size(0) * nranks) but doesn't validate this. If matrix size isn't divisible by nranks, the computed global size m will be incorrect, leading to wrong results from cuSOLVERMp. Consider adding validation:
| m = x.size(0) * nranks # rows are distributed across ranks | |
| n = x.size(1) | |
| # Global matrix dimensions | |
| # Rows must be evenly distributed across ranks | |
| local_rows = x.size(0) | |
| m = local_rows * nranks | |
| n = x.size(1) |
Then add a validation check that all ranks have the same local_rows via dist.all_reduce.
| num_iterations: int = 5, | ||
| coefficients: Optional[List[float]] = None, | ||
| ) -> None: | ||
| """Compute Newton-Schulz inverse square root in-place on a distributed matrix. |
There was a problem hiding this comment.
Docstring says "inverse square root" but test checks orthogonality. Update to match actual behavior (see comment on header file).
| from transformer_engine.pytorch import optimizers | ||
| from transformer_engine.pytorch.export import onnx_export | ||
| from transformer_engine.pytorch.cross_entropy import parallel_cross_entropy | ||
| from transformer_engine.pytorch.newton_schulz import newton_schulz |
There was a problem hiding this comment.
Unconditional import of optional feature
newton_schulz is unconditionally imported and exported as part of the public API, even when TE is built without NVTE_WITH_CUSOLVERMP. While the function itself raises a runtime error when called, this exposes the symbol to all users and makes it appear as a supported feature in auto-complete and docs. Consider guarding this import behind a check (similar to how other optional features are handled), or at minimum adding a note in the docstring that the function requires NVTE_WITH_CUSOLVERMP=1 at build time.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| ctx_ptr = tex.cusolvermp_ctx_create(nccl_comm_ptr, nranks, rank) | ||
| try: | ||
| tex.newton_schulz(ctx_ptr, m, n, x, num_iterations, coefficients) | ||
| finally: | ||
| tex.cusolvermp_ctx_destroy(ctx_ptr) |
There was a problem hiding this comment.
Context created/destroyed per call wastes resources
A new NVTECusolverMpCtx is created and destroyed on every invocation of newton_schulz. Context creation involves cudaStreamCreate, two cudaEventCreate calls, cusolverMpCreate, and cusolverMpCreateDeviceGrid — all of which are heavyweight operations. And since the context is destroyed afterward, the grow-only workspace caching in the C++ layer (lines 170-177 of newton_schulz.cpp) is never actually reused.
Consider caching the context (e.g., in a module-level dict keyed by (nccl_comm_ptr, nranks, rank)) and reusing it across calls, or exposing the context lifecycle to callers so they can amortize the cost when calling newton_schulz repeatedly in a training loop.
| assert ( | ||
| len(coefficients) == num_iterations * 3 | ||
| ), f"Unexpected number of coefficients: {len(coefficients)} for {num_iterations} iterations" |
There was a problem hiding this comment.
use ValueError instead of assert for validation - assert can be disabled with Python's -O flag
| assert ( | |
| len(coefficients) == num_iterations * 3 | |
| ), f"Unexpected number of coefficients: {len(coefficients)} for {num_iterations} iterations" | |
| if len(coefficients) != num_iterations * 3: | |
| raise ValueError( | |
| f"Unexpected number of coefficients: {len(coefficients)} for {num_iterations} iterations" | |
| ) |
| PATHS ${CUSOLVERMP_DIR} | ||
| PATH_SUFFIXES lib | ||
| REQUIRED) | ||
| target_link_libraries(transformer_engine PUBLIC ${CUSOLVERMP_LIB}) |
There was a problem hiding this comment.
PUBLIC linkage exposes cuSOLVERMp to all downstream consumers of transformer_engine library. Since newton_schulz.h doesn't expose cuSOLVERMp types in the public API, PRIVATE linkage would provide better encapsulation (consumers don't need cuSOLVERMp at link time).
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| f"'nm' failed on {lib_path} (exit code {e.returncode}):\n{e.stderr}" | ||
| ) from e | ||
|
|
||
| return symbol in result.stdout |
There was a problem hiding this comment.
Substring match can produce false positives
symbol in result.stdout does a plain substring search over the entire nm output. If the library ever contains a symbol that has the target symbol as a prefix (e.g. nvte_cusolvermp_ctx_create_with_options), this check will incorrectly return True and enable cuSolverMp support in the PyTorch extension even though the real symbol is absent.
The nm -D output format is <address> <type> <name>\n per line. A safer check is to match against word boundaries:
import re
return bool(re.search(r'\b' + re.escape(symbol) + r'\b', result.stdout))or equivalently test for whitespace/line boundaries:
return any(line.split()[-1] == symbol for line in result.stdout.splitlines() if line.strip())
Additional Comments (1)
The CMake convention is to declare |
| #ifndef TRANSFORMER_ENGINE_COMMON_NEWTON_SCHULZ_H_ | ||
| #define TRANSFORMER_ENGINE_COMMON_NEWTON_SCHULZ_H_ | ||
|
|
||
| #include <nccl.h> |
There was a problem hiding this comment.
Unconditional #include <nccl.h> in a public header.
newton_schulz.h is installed as a public header (under include/transformer_engine/). The unconditional #include <nccl.h> means that any downstream project that includes this header — even one with no interest in Newton-Schulz — now requires NCCL in its include path.
ncclComm_t is only used in the function signatures of nvte_cusolvermp_ctx_create and nvte_newton_schulz, which are themselves only meaningful when NVTE_WITH_CUSOLVERMP is defined. Guarding the include and the declarations together would prevent the leakage:
| #include <nccl.h> | |
| #ifdef NVTE_WITH_CUSOLVERMP | |
| #include <nccl.h> | |
| // ... struct and function declarations ... | |
| #endif // NVTE_WITH_CUSOLVERMP |
There was a problem hiding this comment.
I'm also a little uneasy at exposing NCCL as a required dependency, but I see we already import the NCCL header elsewhere in the codebase:
There was a problem hiding this comment.
We use NCCL types (ncclComm_t) in nvte_ctx_create API, so if we're taking the route of always defining the functions - we need to pull NCCL headers unconditionally.
f11c55f to
2cf55f6
Compare
Additional Comments (3)
When This scenario is realistic: the common library might be built in a non-standard path, or a developer may be doing an incremental build of only the PyTorch extension. The function should return if lib_path is None:
import warnings
warnings.warn(
"Could not find libtransformer_engine.so; cuSOLVERMp auto-detection disabled.",
stacklevel=2,
)
return FalseThe
The file correctly skips when fewer than 2 GPUs are present, but there is no skip when TE was built without A simple guard at module level (or inside the test) would mirror the GPU-count check: import transformer_engine_torch as tex
if not hasattr(tex, "newton_schulz"):
pytest.skip(
"Newton-Schulz tests require TE built with NVTE_WITH_CUSOLVERMP=1.",
allow_module_level=True,
)
When If 15-iteration runs use the test's custom coefficients (which are different from both defaults), the fallback path is also never exercised by the current tests with the default API — making the generic path untested in practice. Consider either:
|
Additional Comments (12)
The test is added without any guard for Add a conditional:
The file skips when fewer than 2 GPUs are available, but not when Transformer Engine is built without Add a module-level skip check:
The cuSolverMp block silently defaults to Align with the NVSHMEM pattern:
Lines 20-21 use Consider using public APIs or documenting this dependency clearly in comments, noting that this code may need updates with new PyTorch releases.
The assertion on line 65 validates user input. Assertions can be disabled with Python's
The C++ code calls Also add dtype validation since the docstring specifies
The distributed test subprocess has no timeout. If it deadlocks (e.g., NCCL communication issue), the test will block indefinitely, hanging the CI job. Add a timeout:
cuSOLVERMp is linked with
The Move the
Guard this import similarly to other optional features, or add a note in the docstring that
Raw CUDA handles ( Wrap each handle in an RAII type (similar to |
Additional Comments (4)
When Compare to the if bool(int(os.getenv("NVTE_WITH_CUSOLVERMP", "0"))):
assert (
os.getenv("CUSOLVERMP_HOME") is not None
), "CUSOLVERMP_HOME must be set when compiling with NVTE_WITH_CUSOLVERMP=1"
cusolvermp_home = Path(os.getenv("CUSOLVERMP_HOME"))
...
This test is added unconditionally to the QA script, so it will always execute even when TE is built without Add a build-flag guard matching the build configuration: if [ "${NVTE_WITH_CUSOLVERMP:-0}" = "1" ]; then
python3 -m pytest -v -s --junitxml=$XML_LOG_DIR/pytest_test_newton_schulz.xml $TE_PATH/tests/pytorch/distributed/test_newton_schulz.py || test_fail "test_newton_schulz.py"
fi
The test only skips when fewer than 2 GPUs are available, but does not check whether TE was built with Add an early skip guard: import transformer_engine_torch as tex
if not hasattr(tex, "newton_schulz"):
pytest.skip("Newton-Schulz tests require TE built with NVTE_WITH_CUSOLVERMP=1.", allow_module_level=True)
When Consider either:
|
There was a problem hiding this comment.
I think we agreed on putting this file in transformer_engine/pytorch/optimizers?
Also, we want to fully replicate the functionality of newton_schulz_tp, so may need to take a look at supporting the partition_dim and mode parameters? Looking at Emerging-Optimizers' implementation, I think it's mostly wrapper code that we need to add in Python.
There was a problem hiding this comment.
Yea, I'd suggest something liketransformer_engine/pytorch/optimizers/muon.
There was a problem hiding this comment.
Actually, this depends on how we're planning on exposing this. If TE owns an implementation of Muon, then I agree we can treat this as internal utilities. If the Muon implementation lives in Megatron-LM or somewhere else, then we should treat this as a general API in something like transformer_engine/pytorch/cusolvermp.
| dist.all_gather(gathered, x_local) | ||
| X = torch.cat(gathered, dim=0) | ||
|
|
||
| # Check: the resulting matrix should be orthogonal |
There was a problem hiding this comment.
Checking whether the result is orthogonal might not be enough - I think we said we'd compare the actual values of the matrix with the result of a non-distributed version?
| @pytest.mark.parametrize("dtype", ["float32", "bfloat16"]) | ||
| @pytest.mark.parametrize("matrix_size", [256]) | ||
| @pytest.mark.parametrize("num_iterations", [5, 15]) | ||
| def test_newton_schulz(dtype, matrix_size, num_iterations): |
There was a problem hiding this comment.
Could we expand the tests a bit to cover more sizes, shapes, etc?
https://github.com/NVIDIA-NeMo/Emerging-Optimizers/blob/main/tests/test_muon_utils.py
16ba1fd to
ea4db43
Compare
Add a new distributed Newton-Schulz inverse square root API to Transformer Engine's common C library. This wraps the cusolverMpNewtonSchulz library function, following the same pattern as the existing cuBLASMp integration for comm_gemm. New files: - newton_schulz.h: Public C API header with context management and computation functions - newton_schulz/newton_schulz.cpp: Implementation with RAII wrappers for cuSolverMp handles Build integration: - New NVTE_WITH_CUSOLVERMP CMake option and CUSOLVERMP_HOME env var - NVTE_CHECK_CUSOLVERMP error checking macro in logging.h - Conditional compilation guarded by NVTE_WITH_CUSOLVERMP Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Add PyTorch-level bindings for the cuSolverMp Newton-Schulz inverse square root API introduced in the previous commit. New files: - pytorch/csrc/extensions/newton_schulz.cpp: C++ extension wrapping the C API with PyTorch tensor support - pytorch/newton_schulz.py: Python wrapper that extracts NCCL communicator from torch.distributed ProcessGroup - tests/pytorch/distributed/test_newton_schulz.py: pytest launcher - tests/pytorch/distributed/run_newton_schulz.py: distributed test worker with reference implementation for numerical validation Modified files: - pytorch/csrc/extensions.h: Function declarations - pytorch/csrc/extensions/pybind.cpp: pybind11 registrations - pytorch/__init__.py: Public API export Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Fix API mismatches discovered during compilation: - cusolverMpCreate takes (handle*, deviceId, stream), not (handle*, stream) - cusolverMpCreateDeviceGrid takes handle as first arg with different parameter order - Use cusolverMpGridMapping_t (not cusolverMpGridLayout_t) and CUSOLVERMP_GRID_MAPPING_COL_MAJOR - cusolverMpCreateMatrixDesc has different parameter order: (desc*, grid, dtype, M, N, MB, NB, RSRC, CSRC, LLD) - cusolverMpNewtonSchulzDescriptorCreate takes only (nsDesc*) with no iteration/coefficient args - No cusolverMpStreamSet exists; create handle per-call with user stream - cusolverMpNewtonSchulz requires computeType and info parameters - Switch from generic template RAII to explicit deleter structs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
…build Add NVTE_WITH_CUSOLVERMP compiler define and cusolverMp include/library paths to the PyTorch C++ extension build, following the same pattern as NVTE_UB_WITH_MPI and NVTE_ENABLE_NVSHMEM. Without this, the #ifdef NVTE_WITH_CUSOLVERMP guards in the PyTorch extension code would never be active since the define was only set as PRIVATE in the CMake build for the common library. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Two fixes: - Use ProcessGroupNCCL._comm_ptr() to extract the raw NCCL communicator pointer instead of the non-existent get_nccl_comm() method - Pass global matrix dimensions (m, n) from Python to C++ instead of using local tensor dimensions, which would produce incorrect ScaLAPACK block sizes in the distributed computation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
cuSolverMp handle and grid creation are expensive operations. Move them from per-call creation in nvte_newton_schulz into the NVTECusolverMpCtx, which is their natural home — the context exists to encapsulate the grid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
cuSolverMp cannot work with the default CUDA stream. Create a dedicated stream inside nvte_cusolvermp_ctx_create and remove the stream parameter from both C API functions since the context now owns its stream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
The internal dedicated stream was reading the input tensor before the caller's stream had finished producing it, resulting in all-zero output. Add event-based synchronisation: the internal stream waits for the caller's input to be ready, and the caller's stream waits for the output to be written. Replaces the blocking cudaStreamSynchronize. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
There was a problem hiding this comment.
Yea, I'd suggest something liketransformer_engine/pytorch/optimizers/muon.
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: vcherepanov-nv <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vcherepanov-nv <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
…t cuSOLVERMp support Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
timmoon10
left a comment
There was a problem hiding this comment.
We could tweak the design of the Python wrapper for the cuSOLVERMp context, but otherwise this looks good to me.
| #ifndef TRANSFORMER_ENGINE_COMMON_NEWTON_SCHULZ_H_ | ||
| #define TRANSFORMER_ENGINE_COMMON_NEWTON_SCHULZ_H_ | ||
|
|
||
| #include <nccl.h> |
There was a problem hiding this comment.
I'm also a little uneasy at exposing NCCL as a required dependency, but I see we already import the NCCL header elsewhere in the codebase:
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: vcherepanov-nv <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
* [Common] Add Newton-Schulz inverse square root C API via cuSolverMp Add a new distributed Newton-Schulz inverse square root API to Transformer Engine's common C library. This wraps the cusolverMpNewtonSchulz library function, following the same pattern as the existing cuBLASMp integration for comm_gemm. New files: - newton_schulz.h: Public C API header with context management and computation functions - newton_schulz/newton_schulz.cpp: Implementation with RAII wrappers for cuSolverMp handles Build integration: - New NVTE_WITH_CUSOLVERMP CMake option and CUSOLVERMP_HOME env var - NVTE_CHECK_CUSOLVERMP error checking macro in logging.h - Conditional compilation guarded by NVTE_WITH_CUSOLVERMP Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [PyTorch] Add Newton-Schulz PyTorch bindings and distributed tests Add PyTorch-level bindings for the cuSolverMp Newton-Schulz inverse square root API introduced in the previous commit. New files: - pytorch/csrc/extensions/newton_schulz.cpp: C++ extension wrapping the C API with PyTorch tensor support - pytorch/newton_schulz.py: Python wrapper that extracts NCCL communicator from torch.distributed ProcessGroup - tests/pytorch/distributed/test_newton_schulz.py: pytest launcher - tests/pytorch/distributed/run_newton_schulz.py: distributed test worker with reference implementation for numerical validation Modified files: - pytorch/csrc/extensions.h: Function declarations - pytorch/csrc/extensions/pybind.cpp: pybind11 registrations - pytorch/__init__.py: Public API export Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [Common] Fix cuSolverMp API signatures in Newton-Schulz implementation Fix API mismatches discovered during compilation: - cusolverMpCreate takes (handle*, deviceId, stream), not (handle*, stream) - cusolverMpCreateDeviceGrid takes handle as first arg with different parameter order - Use cusolverMpGridMapping_t (not cusolverMpGridLayout_t) and CUSOLVERMP_GRID_MAPPING_COL_MAJOR - cusolverMpCreateMatrixDesc has different parameter order: (desc*, grid, dtype, M, N, MB, NB, RSRC, CSRC, LLD) - cusolverMpNewtonSchulzDescriptorCreate takes only (nsDesc*) with no iteration/coefficient args - No cusolverMpStreamSet exists; create handle per-call with user stream - cusolverMpNewtonSchulz requires computeType and info parameters - Switch from generic template RAII to explicit deleter structs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [PyTorch] Propagate NVTE_WITH_CUSOLVERMP define to PyTorch extension build Add NVTE_WITH_CUSOLVERMP compiler define and cusolverMp include/library paths to the PyTorch C++ extension build, following the same pattern as NVTE_UB_WITH_MPI and NVTE_ENABLE_NVSHMEM. Without this, the #ifdef NVTE_WITH_CUSOLVERMP guards in the PyTorch extension code would never be active since the define was only set as PRIVATE in the CMake build for the common library. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [PyTorch] Fix NCCL comm extraction and pass global dims to Newton-Schulz Two fixes: - Use ProcessGroupNCCL._comm_ptr() to extract the raw NCCL communicator pointer instead of the non-existent get_nccl_comm() method - Pass global matrix dimensions (m, n) from Python to C++ instead of using local tensor dimensions, which would produce incorrect ScaLAPACK block sizes in the distributed computation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [Common] Cache cuSolverMp handle and grid in Newton-Schulz context cuSolverMp handle and grid creation are expensive operations. Move them from per-call creation in nvte_newton_schulz into the NVTECusolverMpCtx, which is their natural home — the context exists to encapsulate the grid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [Common] Create dedicated CUDA stream in Newton-Schulz context cuSolverMp cannot work with the default CUDA stream. Create a dedicated stream inside nvte_cusolvermp_ctx_create and remove the stream parameter from both C API functions since the context now owns its stream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [Common] Fix Newton-Schulz zero output with event-based stream sync The internal dedicated stream was reading the input tensor before the caller's stream had finished producing it, resulting in all-zero output. Add event-based synchronisation: the internal stream waits for the caller's input to be ready, and the caller's stream waits for the output to be written. Replaces the blocking cudaStreamSynchronize. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [Common] Fix Newton-Schulz NaNs by keeping host workspace alive cuSolverMp is asynchronous and uses the host workspace during multi-GPU execution. The event-based output sync did not block the host, so the local workspace_host vector was destroyed while the GPU was still reading from it. Restore cudaStreamSynchronize to ensure the host workspace remains valid for the full duration of the operation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [Common] Cache CUDA event in Newton-Schulz context Avoid creating and destroying a cudaEvent_t on every nvte_newton_schulz call by making it a persistent member of NVTECusolverMpCtx, matching the existing pattern for the stream. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [Common] Use separate in/out events for Newton-Schulz stream sync Replace single event with in_ready and out_ready events. After the cuSolverMp call, record out_ready on the internal stream and make the caller's stream wait on it, ensuring the output tensor is ready before the caller uses it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Correct coefficients Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * No stream synchronize Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [Test] Verify Newton-Schulz result with XAX=I identity check Replace reference-comparison test with a direct arithmetic check: if X is the inverse square root of A, then X @ A @ X must equal the identity matrix. This is more robust and removes the need for a separate reference implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Change test - it approximates orthogonal matrix, not inverse square root Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Generalize number of iterations in tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove extra info diag - everything should be in logs Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Add Newton-Schulz tests to the QA script Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix outdated comments Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove unused variable Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Move magic numbers from tests to impl Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix outdated comments Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Check num_coefficients Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Auto-detect cuSolverMp support from common library binary Instead of requiring NVTE_WITH_CUSOLVERMP env var to be set for both the common library and PyTorch extension builds, inspect the already-built libtransformer_engine.so for exported symbols. This is more robust for incremental builds and CI environments where the env var may not be propagated to the extension build step. The PyTorch extension only calls nvte_* C API functions, so it does not need cusolverMp headers or libraries — only the compile definition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Conditionally exclude Newton-Schulz API from PyTorch extension When NVTE_WITH_CUSOLVERMP is not defined, omit the Newton-Schulz functions entirely from the pybind module instead of registering stubs that throw runtime errors. The Python wrapper checks for the attribute at call time and raises a clear error message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Make symbol detection errors fatal in common_lib_has_symbol Raise FileNotFoundError when no libtransformer_engine.so is found in any candidate location, and raise RuntimeError when nm is unavailable or exits non-zero, rather than silently returning False in both cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Search for libtransformer_engine.so via installed module location first In common_lib_has_symbol, prepend a candidate derived by importing transformer_engine via importlib.util.find_spec and using the package directory as the root. This correctly resolves the SO path for source and PyPI installs (where it lives inside transformer_engine/), before falling back to the repo-root and CMake build dir candidates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Add site packages to search paths for TE common Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Revert "Auto-detect cuSolverMp support from common library binary" This reverts commit 8f50bd5. Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove unused import Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix incorrect 'inverse square root' references in Newton-Schulz comments Replace misleading 'inverse square root' descriptions with accurate 'matrix orthogonalization' in the module docstring, function docstring, and pybind11 binding docstring. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [PyTorch] Expose cuSolverMp context creation/destruction as public API Context creation is expensive and should not happen on every newton_schulz call. Introduce CusolverMpCtx and cusolvermp_ctx_create() so callers can create a context once from a ProcessGroup and reuse it. CusolverMpCtx supports explicit destroy() and use as a context manager. newton_schulz() now takes CusolverMpCtx instead of ProcessGroup. Export CusolverMpCtx and cusolvermp_ctx_create from the pytorch package. Update the distributed test worker to use explicit context lifecycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [PyTorch] Strengthen input validation in newton_schulz Replace assert with ValueError for the coefficients length check. Add dtype (float32/bfloat16) and contiguity checks for the input tensor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use ncclMemAlloc for cuSolverMp Newton-Schulz workspace Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Add Newton-Schulz reference tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix Newton-Schulz reference test logic Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix column-major usage of cuSOLVERMp; add rectangular test cases Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Avoid explicit transpose Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * More cleanup Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Update transformer_engine/common/newton_schulz/newton_schulz.cpp Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: vcherepanov-nv <vcherepanov@nvidia.com> * Fix syntax Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Apply suggestions from code review Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vcherepanov-nv <vcherepanov@nvidia.com> * Add timeout Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use RAII for cusolvermp CUDA resources Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Make NS API declared unconditional, with stub / runtime errors without cuSOLVERMp support Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix index in diag Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * CMake fixes Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Update transformer_engine/pytorch/newton_schulz.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: vcherepanov-nv <vcherepanov@nvidia.com> * Fix a typo Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup context management Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Borrow more coefficient sets from Emerging Optimizers Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Couple num_iterations with coeff types in tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by: vcherepanov-nv <vcherepanov@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Description
Adds an API to call Newton-Schulz method on a distributed tensor.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: