Skip to content

Latest commit

 

History

History

README.md

cuBLASMp Library API examples

Description

This folder demonstrates cuBLASMp library API usage. Each sample is a self-contained program that initializes MPI, creates a process grid, and runs a distributed linear algebra operation with performance timing.

Samples

Tensor Parallelism Matmul (communication-overlapped variants):

Sample Description
tp_matmul Tensor parallelism example covering AllGather + GEMM and GEMM + ReduceScatter
matmul_ag AllGather + GEMM with configurable data types and scaling
matmul_rs GEMM + ReduceScatter with configurable data types and scaling
matmul_ar GEMM + AllReduce with configurable data types and scaling

PBLAS-style operations (2D block-cyclic distribution):

Sample Description
gemm General matrix-matrix multiply (GEMM)
trsm Triangular solve (TRSM)
trmm Triangular matrix-matrix multiply (TRMM)
syrk Symmetric rank-k update (SYRK)
syr2k Symmetric rank-2k update (SYR2K)
syrkx Extended symmetric rank-k update (SYRKX)
symm Symmetric matrix-matrix multiply (SYMM)
geadd General matrix addition
tradd Triangular matrix addition
gemr2d General matrix redistribution between block-cyclic layouts

Supported OSes

  • Linux

Supported CPU Architectures

  • x86_64
  • arm64-sbsa

Supported Compute Capabilities

Documentation

cuBLASMp documentation

Usage

Prerequisites

cuBLASMp is distributed through NVIDIA Developer Zone, PyPI (CUDA 12, CUDA 13), Conda, conda-forge and HPC SDK. cuBLASMp requires CUDA Toolkit and NCCL to be installed on the system. The samples require a C++17 compatible compiler and MPI (HPC-X recommended).

Build Steps

git clone https://github.com/NVIDIA/CUDALibrarySamples.git
cd CUDALibrarySamples/cuBLASMp
mkdir build && cd build

export HPCXROOT=<path/to/hpcx>
export CUBLASMP_HOME=<path/to/cublasmp>
export NCCL_HOME=<path/to/nccl>
source ${HPCXROOT}/hpcx-mt-init-ompi.sh
hpcx_load

cmake .. -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CUDA_ARCHITECTURES="75;80;90;100;120" \
    -DCUBLASMP_INCLUDE_DIRECTORIES=${CUBLASMP_HOME}/include \
    -DCUBLASMP_LIBRARIES=${CUBLASMP_HOME}/lib/libcublasmp.so \
    -DNCCL_INCLUDE_DIRECTORIES=${NCCL_HOME}/include \
    -DNCCL_LIBRARIES=${NCCL_HOME}/lib/libnccl.so
make -j

Running

The number of MPI processes must equal the process grid size (p * q). All samples accept -help for a full list of options.

Tensor Parallelism Matmul (1D grid, processes along one dimension):

# AllGather + GEMM with FP16
mpirun -n 2 ./matmul_ag -typeA fp16 -typeB fp16 -typeD fp16 -transA t -transB n

# GEMM + ReduceScatter with FP16
mpirun -n 2 ./matmul_rs -typeA fp16 -typeB fp16 -typeD fp16 -transA t -transB n

# GEMM + AllReduce with FP16
mpirun -n 2 ./matmul_ar -typeA fp16 -typeB fp16 -typeD fp16 -transA t -transB n

# End-to-end tensor parallelism
mpirun -n 2 ./tp_matmul

PBLAS-style operations (2D grid with -p rows and -q columns):

mpirun -n 2 ./gemm -p 2 -q 1
mpirun -n 2 ./trsm -p 2 -q 1
mpirun -n 2 ./trmm -p 2 -q 1
mpirun -n 2 ./syrk -p 2 -q 1
mpirun -n 2 ./syr2k -p 2 -q 1
mpirun -n 2 ./syrkx -p 2 -q 1
mpirun -n 2 ./symm -p 2 -q 1
mpirun -n 2 ./geadd -p 2 -q 1
mpirun -n 2 ./tradd -p 2 -q 1
mpirun -n 2 ./gemr2d -p 2 -q 1

Common Options

Individual operations may use only a subset of these options, and not every datatype or scaling-mode combination is valid for every sample.

Option Description
-m, -n, -k Matrix dimensions
-mbA, -nbA, -mbB, -nbB, -mbC, -nbC Block sizes for the distributed matrices
-ia, -ja, -ib, -jb, -ic, -jc 1-based starting indices of the operated submatrices
-p, -q Process grid dimensions (p rows, q columns)
-typeA, -typeB, -typeC, -typeD Data types (fp16, bf16, fp32, fp64, fp8_e4m3, fp8_e5m2, fp4_e2m1, cfp32, cfp64)
-transA, -transB Transpose operations (n, t, c)
-scaleA, -scaleB, -scaleD, -scaleDOut Scaling modes (scalar_fp32, vec16_ue4m3, vec32_ue8m0, outer_vec_fp32, vec128_fp32, blk128x128_fp32)
-gridLayout Process grid layout (c column-major, r row-major)
-emulationStrategy FP emulation strategy (default, performant, eager)
-checkResult Enable result verification (true or false; default: true)
-no-check Disable result verification
-cycles Number of iterations for timing
-warmup Number of warmup iterations
-verbose Print detailed output
-help Print all available options

Matmul Scaling Modes

The matmul_ag, matmul_rs, and matmul_ar samples support both Hopper FP8 scaling modes (vec128_fp32, blk128x128_fp32, outer_vec_fp32) and Blackwell block scaling modes (vec32_ue8m0, vec16_ue4m3). Support for a given datatype and scaling-mode combination depends on the GPU architecture, CUDA Toolkit, and cuBLASLt support available at runtime.