Skip to content

Commit fd7e058

Browse files
authored
Added examples to enable the unity build (NVIDIA#102)
* Updated documentation of fused GEMM example and removed UNITY BUILD batch size. The default batch size when unity build is enabled tends to be favorable.
1 parent 1ab1027 commit fd7e058

File tree

3 files changed

+34
-5
lines changed

3 files changed

+34
-5
lines changed

examples/13_fused_two_gemms/fused_gemm.cu

+25-1
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,32 @@
2222
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
2323
*
2424
**************************************************************************************************/
25+
/*
26+
27+
This example shows fusing two GEMM mainloops into one kernel. The first GEMM computes relu(alpha*A*B) and
28+
the second GEMM computes relu(alpha*A*B+beta*C). The performance measuring environment compares against
29+
two unfused GEMM operations, demonstrating a speedup of the fused kernel on the
30+
NVIDIA Turing GPU architecture.
31+
32+
Problem size:
33+
34+
GEMM1 (M,N,K): 128*1600, 64, 576
35+
GEMM2 (M,N,K): 128*1600, 128, 64
36+
37+
Note that GEMM1_N = GEMM2_K
38+
39+
The example requires the number of threadblocks be the same across 2 GEMMs and
40+
thread_block_tile_N = problem_N so the data required by each layer is threadblock-resident. It
41+
also requires warp_tile_N = thread_block_tile_N so the data required by each warp is
42+
register-file-resident.
43+
44+
Performance:
45+
46+
- fp16 on Tesla T4 @ 1590MHz (non-fused vs. fused): 1.39011 ms vs. 1.26035 ms
47+
- int8 on Tesla T4 @ 1590MHz (non-fused vs. fused): 0.751759 ms vs. 0.62971 ms
48+
- fp16 on Quadro RTX 8000 @ 1890MHz (non-fused vs. fused): 0.721144 ms vs. 0.629864 ms
49+
- int8 on Quadro RTX 8000 @ 1890MHz (non-fused vs. fused): 0.379049 ms vs. 0.324764 ms
2550
26-
/**
2751
*/
2852

2953
#include "b2b_gemm_f16t_f16n_f16t_tensor_op_f16_sm75.h"

media/docs/profiler.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,12 @@ $ make cutlass_profiler -j
1515
To limit compilation time, only one tile size (128x128) is instantiated for each data type, math instruction, and layout.
1616
To instantiate all sizes, set the following environment variable when running CMake from an empty `build/` directory.
1717
```bash
18-
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all
18+
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
1919
...
2020
$ make cutlass_profiler -j
2121
```
22+
Enabling the unity build places multiple kernel instances in one compilation unit, thereby reducing size of the compiled
23+
binary and avoiding linker limitations on some platforms.
2224

2325
The CUTLASS Profiler sources are stored in
2426
```bash

media/docs/quickstart.md

+6-3
Original file line numberDiff line numberDiff line change
@@ -403,7 +403,7 @@ $ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=sgemm
403403
Compling only the kernels desired reduces compilation time.
404404

405405
To instantiate kernels of all tile sizes, data types, and alignment constraints, specify
406-
`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
406+
`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
407407

408408
Several recipes are defined below for convenience. They may be combined as a comma-delimited list.
409409

@@ -412,9 +412,12 @@ Several recipes are defined below for convenience. They may be combined as a com
412412
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
413413
```
414414

415-
**Example.** All kernels for NVIDIA Volta, Turing, and Ampere architectures.
415+
**Example.** All kernels for NVIDIA Volta, Turing, and Ampere architectures. Enabling
416+
the "unity build" instantiates multiple kernel instances in each compilation unit, thereby
417+
reducing binary size and avoiding linker limitations on some platforms.
416418
```bash
417-
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all
419+
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all \
420+
-DCUTLASS_UNITY_BUILD_ENABLED=ON
418421
```
419422

420423
**Example.** All GEMM kernels targeting Turing Tensor Cores.

0 commit comments

Comments
 (0)