Hand-written CUDA SGEMM kernel achieving ~90% of cuBLAS FP32 performance on NVIDIA L40S.
make && ./gemm
gemm_optimized.cuh— kernel implementationGEMM.cu— benchmark harnessMakefile— build system (targets:all,run,clean)
| Name | Name | Last commit date | ||
|---|---|---|---|---|