Name	Name	Last commit message	Last commit date
parent directory ..
CMakeLists.txt	CMakeLists.txt
README.md	README.md
geadd.cu	geadd.cu
gemm.cu	gemm.cu
gemr2d.cu	gemr2d.cu
helpers.h	helpers.h
matmul.h	matmul.h
matmul_ag.cu	matmul_ag.cu
matmul_ar.cu	matmul_ar.cu
matmul_rs.cu	matmul_rs.cu
matrix_generator.hxx	matrix_generator.hxx
symm.cu	symm.cu
syr2k.cu	syr2k.cu
syrk.cu	syrk.cu
syrkx.cu	syrkx.cu
tp_matmul.cu	tp_matmul.cu
tradd.cu	tradd.cu
trmm.cu	trmm.cu
trsm.cu	trsm.cu

cuBLASMp Library API examples

Description

This folder demonstrates cuBLASMp library API usage. Each sample is a self-contained program that initializes MPI, creates a process grid, and runs a distributed linear algebra operation with performance timing.

Samples

Tensor Parallelism Matmul (communication-overlapped variants):

Sample	Description
tp_matmul	Tensor parallelism example covering AllGather + GEMM and GEMM + ReduceScatter
matmul_ag	AllGather + GEMM with configurable data types and scaling
matmul_rs	GEMM + ReduceScatter with configurable data types and scaling
matmul_ar	GEMM + AllReduce with configurable data types and scaling

PBLAS-style operations (2D block-cyclic distribution):

Sample	Description
gemm	General matrix-matrix multiply (GEMM)
trsm	Triangular solve (TRSM)
trmm	Triangular matrix-matrix multiply (TRMM)
syrk	Symmetric rank-k update (SYRK)
syr2k	Symmetric rank-2k update (SYR2K)
syrkx	Extended symmetric rank-k update (SYRKX)
symm	Symmetric matrix-matrix multiply (SYMM)
geadd	General matrix addition
tradd	Triangular matrix addition
gemr2d	General matrix redistribution between block-cyclic layouts

Supported OSes

Linux

Supported CPU Architectures

x86_64
arm64-sbsa

Supported Compute Capabilities

Documentation

cuBLASMp documentation

Usage

Prerequisites

cuBLASMp is distributed through NVIDIA Developer Zone, PyPI (CUDA 12, CUDA 13), Conda, conda-forge and HPC SDK. cuBLASMp requires CUDA Toolkit and NCCL to be installed on the system. The samples require a C++17 compatible compiler and MPI (HPC-X recommended).

Build Steps

git clone https://github.com/NVIDIA/CUDALibrarySamples.git
cd CUDALibrarySamples/cuBLASMp
mkdir build && cd build

export HPCXROOT=<path/to/hpcx>
export CUBLASMP_HOME=<path/to/cublasmp>
export NCCL_HOME=<path/to/nccl>
source ${HPCXROOT}/hpcx-mt-init-ompi.sh
hpcx_load

cmake .. -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CUDA_ARCHITECTURES="75;80;90;100;120" \
    -DCUBLASMP_INCLUDE_DIRECTORIES=${CUBLASMP_HOME}/include \
    -DCUBLASMP_LIBRARIES=${CUBLASMP_HOME}/lib/libcublasmp.so \
    -DNCCL_INCLUDE_DIRECTORIES=${NCCL_HOME}/include \
    -DNCCL_LIBRARIES=${NCCL_HOME}/lib/libnccl.so
make -j

Running

The number of MPI processes must equal the process grid size (p * q). All samples accept -help for a full list of options.

Tensor Parallelism Matmul (1D grid, processes along one dimension):

# AllGather + GEMM with FP16
mpirun -n 2 ./matmul_ag -typeA fp16 -typeB fp16 -typeD fp16 -transA t -transB n

# GEMM + ReduceScatter with FP16
mpirun -n 2 ./matmul_rs -typeA fp16 -typeB fp16 -typeD fp16 -transA t -transB n

# GEMM + AllReduce with FP16
mpirun -n 2 ./matmul_ar -typeA fp16 -typeB fp16 -typeD fp16 -transA t -transB n

# End-to-end tensor parallelism
mpirun -n 2 ./tp_matmul

PBLAS-style operations (2D grid with -p rows and -q columns):

mpirun -n 2 ./gemm -p 2 -q 1
mpirun -n 2 ./trsm -p 2 -q 1
mpirun -n 2 ./trmm -p 2 -q 1
mpirun -n 2 ./syrk -p 2 -q 1
mpirun -n 2 ./syr2k -p 2 -q 1
mpirun -n 2 ./syrkx -p 2 -q 1
mpirun -n 2 ./symm -p 2 -q 1
mpirun -n 2 ./geadd -p 2 -q 1
mpirun -n 2 ./tradd -p 2 -q 1
mpirun -n 2 ./gemr2d -p 2 -q 1

Common Options

Individual operations may use only a subset of these options, and not every datatype or scaling-mode combination is valid for every sample.

Option	Description
`-m`, `-n`, `-k`	Matrix dimensions
`-mbA`, `-nbA`, `-mbB`, `-nbB`, `-mbC`, `-nbC`	Block sizes for the distributed matrices
`-ia`, `-ja`, `-ib`, `-jb`, `-ic`, `-jc`	1-based starting indices of the operated submatrices
`-p`, `-q`	Process grid dimensions (p rows, q columns)
`-typeA`, `-typeB`, `-typeC`, `-typeD`	Data types (`fp16`, `bf16`, `fp32`, `fp64`, `fp8_e4m3`, `fp8_e5m2`, `fp4_e2m1`, `cfp32`, `cfp64`)
`-transA`, `-transB`	Transpose operations (`n`, `t`, `c`)
`-scaleA`, `-scaleB`, `-scaleD`, `-scaleDOut`	Scaling modes (`scalar_fp32`, `vec16_ue4m3`, `vec32_ue8m0`, `outer_vec_fp32`, `vec128_fp32`, `blk128x128_fp32`)
`-gridLayout`	Process grid layout (`c` column-major, `r` row-major)
`-emulationStrategy`	FP emulation strategy (`default`, `performant`, `eager`)
`-checkResult`	Enable result verification (`true` or `false`; default: `true`)
`-no-check`	Disable result verification
`-cycles`	Number of iterations for timing
`-warmup`	Number of warmup iterations
`-verbose`	Print detailed output
`-help`	Print all available options

Matmul Scaling Modes

The matmul_ag, matmul_rs, and matmul_ar samples support both Hopper FP8 scaling modes (vec128_fp32, blk128x128_fp32, outer_vec_fp32) and Blackwell block scaling modes (vec32_ue8m0, vec16_ue4m3). Support for a given datatype and scaling-mode combination depends on the GPU architecture, CUDA Toolkit, and cuBLASLt support available at runtime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

cuBLASMp Library API examples

Description

Samples

Supported OSes

Supported CPU Architectures

Supported Compute Capabilities

Documentation

Usage

Prerequisites

Build Steps

Running

Common Options

Matmul Scaling Modes

FilesExpand file tree

cuBLASMp

Directory actions

More options

Directory actions

More options

Latest commit

History

cuBLASMp

Folders and files

parent directory

README.md

cuBLASMp Library API examples

Description

Samples

Supported OSes

Supported CPU Architectures

Supported Compute Capabilities

Documentation

Usage

Prerequisites

Build Steps

Running

Common Options

Matmul Scaling Modes