Add high-performance sparse BSR GEMM for Ada (sm_89) - 1.74× vs baseline #2739

bGOATnote · 2025-11-01T04:28:38Z

Performance:

52.1 TFLOPS on NVIDIA L4 (Ada, SM 8.9)
1.74× faster than CUTLASS 4.3.0 baseline (~30 TFLOPS)
63× faster than cuSPARSE (0.87 TFLOPS)
83% efficiency vs dense cuBLAS (62.5 TFLOPS)

Technical approach:

WMMA tensor cores (16×16×16 FP16)
2-stage pipeline with cp.async
Optimized tile sizes (BM=256, BN=128, BK=32)
Zero branch divergence (100% efficiency)
99.22% of theoretical occupancy

Validation:

Full Nsight Compute profiling
100-iteration benchmarks
Correctness verified vs cuSPARSE

Files:

examples/89_ada_sparse_bsr_gemm/89_ada_sparse_bsr_gemm.cu
examples/89_ada_sparse_bsr_gemm/CMakeLists.txt
examples/89_ada_sparse_bsr_gemm/README.md

Author: Brandon Dent, MD ([email protected])
License: BSD-3-Clause

Performance: - 52.1 TFLOPS on NVIDIA L4 (Ada, SM 8.9) - 1.74× faster than CUTLASS 4.3.0 baseline (~30 TFLOPS) - 63× faster than cuSPARSE (0.87 TFLOPS) - 83% efficiency vs dense cuBLAS (62.5 TFLOPS) Technical approach: - WMMA tensor cores (16×16×16 FP16) - 2-stage pipeline with cp.async - Optimized tile sizes (BM=256, BN=128, BK=32) - Zero branch divergence (100% efficiency) - 99.22% of theoretical occupancy Validation: - Full Nsight Compute profiling - 100-iteration benchmarks - Correctness verified vs cuSPARSE Files: - examples/89_ada_sparse_bsr_gemm/89_ada_sparse_bsr_gemm.cu - examples/89_ada_sparse_bsr_gemm/CMakeLists.txt - examples/89_ada_sparse_bsr_gemm/README.md Author: Brandon Dent, MD ([email protected]) License: BSD-3-Clause

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add high-performance sparse BSR GEMM for Ada (sm_89) - 1.74× vs baseline #2739

Add high-performance sparse BSR GEMM for Ada (sm_89) - 1.74× vs baseline #2739

Uh oh!

bGOATnote commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add high-performance sparse BSR GEMM for Ada (sm_89) - 1.74× vs baseline #2739

Are you sure you want to change the base?

Add high-performance sparse BSR GEMM for Ada (sm_89) - 1.74× vs baseline #2739

Uh oh!

Conversation

bGOATnote commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant