Simple gemm kernel lib for peak performance on Nvidia & AMD GPUs.
| OP | DEVICE | SHAPE | TIME | PERF | PEAK | RATIO |
|---|---|---|---|---|---|---|
| hgemm-wmma-bf16 | CUDA-RTX4090 | m=8192,n=8192,k=8192 | 6.7 ms | 163.5 TFLOPS | 165.2 TFLOPS | 99% |
| hgemm-wmma-bf16 | CUDA-H20 | m=8192,n=8192,k=8192 | 12.2 ms | 89.5 TFLOPS | 95 TFLOPS | 94% |
git clone https://github.com/xytpai/PeakGemm
cd PeakGemm
python3 -m pip install -e . --no-build-isolationbash build_rocm.sh test/cpp/test_gemm_half.cpp ; ./a.outm:4, n:4096, k:8192, dtype=__half, maxdiff:0.125, ms:0.04416, gbps:1521.9, tflops:6.0787
m:2048, n:2048, k:2048, dtype=__half, maxdiff:0.03125, ms:0.07368, gbps:341.556, tflops:233.169
m:4096, n:4096, k:4096, dtype=__half, maxdiff:0.0625, ms:0.132921, gbps:757.317, tflops:1033.99
m:8192, n:8192, k:8192, dtype=__half, maxdiff:0.125, ms:0.806728, gbps:499.119, tflops:1362.93
m:16384, n:16384, k:16384, dtype=__half, maxdiff:0.125, ms:7.95713, gbps:202.411, tflops:1105.44
m:4, n:4096, k:8192, dtype=__bfloat16, maxdiff:1, ms:0.04488, gbps:1497.49, tflops:5.98118
m:2048, n:2048, k:2048, dtype=__bfloat16, maxdiff:0.25, ms:0.06856, gbps:367.063, tflops:250.582
m:4096, n:4096, k:4096, dtype=__bfloat16, maxdiff:0.5, ms:0.128801, gbps:781.541, tflops:1067.06
m:8192, n:8192, k:8192, dtype=__bfloat16, maxdiff:0.5, ms:0.776568, gbps:518.503, tflops:1415.86
m:16384, n:16384, k:16384, dtype=__bfloat16, maxdiff:1, ms:7.53928, gbps:213.629, tflops:1166.7python3 test/test_gemm.py --m=4096 --n=4096 --k=4096 --dtype=f32run: /workspace/xyt/PeakGemm/test/test_gemm.py, args: Namespace(m=4096, n=4096, k=4096, dtype='f32')
/workspace/xyt/PeakGemm/test/test_gemm.py:43: UserWarning: NOTE: The SGEMM has not been optimized. It's treated as a reference path.
warnings.warn('NOTE: The SGEMM has not been optimized. It\'s treated as a reference path.')
maxdiff_out:0.0
maxdiff_out:0.0
maxdiff_out:0.0
maxdiff_out:0.0
maxdiff_out:0.0
===================== [REF] =====================
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
sm80_xmma_gemm_f32f32_f32f32_f32_tn_n_tilesize64x64x... 0.00% 0.000us 0.00% 0.000us 0.000us 234.070ms 100.00% 234.070ms 5.852ms 40
cudaLaunchKernelExC 0.09% 204.768us 0.81% 1.899ms 47.475us 0.000us 0.00% 0.000us 0.000us 40
Activity Buffer Request 0.72% 1.694ms 0.72% 1.694ms 1.694ms 0.000us 0.00% 0.000us 0.000us 1
cudaDeviceSynchronize 99.19% 232.244ms 99.19% 232.244ms 5.664ms 0.000us 0.00% 0.000us 0.000us 41
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 234.143ms
Self CUDA time total: 234.070ms
===================== [PeakGemm] =====================
/workspace/xyt/PeakGemm/test/test_gemm.py:43: UserWarning: NOTE: The SGEMM has not been optimized. It's treated as a reference path.
warnings.warn('NOTE: The SGEMM has not been optimized. It\'s treated as a reference path.')
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
void sgemm::sgemm_kernel<float, 16, 4, 2, 2, 2, 4, 8... 0.00% 0.000us 0.00% 0.000us 0.000us 219.071ms 100.00% 219.071ms 5.477ms 40
cudaLaunchKernel 0.09% 186.927us 0.86% 1.897ms 47.434us 0.000us 0.00% 0.000us 0.000us 40
Activity Buffer Request 0.78% 1.710ms 0.78% 1.710ms 1.710ms 0.000us 0.00% 0.000us 0.000us 1
cudaDeviceSynchronize 99.14% 217.459ms 99.14% 217.459ms 5.304ms 0.000us 0.00% 0.000us 0.000us 41
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 219.357ms
Self CUDA time total: 219.071ms