support group_gemm_offset, group_gemm_offset_swapAB #116
+1,593
−53
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
support group gemm offset type: group_gemm_offset, and group_gemm_offset_swapAB
Perf (num_groups=2, expected_m_per_group= 16, n=4096, k=7168): 36 us | throughput: 53 TFLOPS, 1665 GB/s
Perf (num_groups=4, expected_m_per_group= 16, n=4096, k=7168): 65 us | throughput: 58 TFLOPS, 1813 GB/s
Perf (num_groups=2, expected_m_per_group= 32, n=4096, k=7168): 35 us | throughput: 106 TFLOPS, 1685 GB/s
Perf (num_groups=9, expected_m_per_group= 32, n=4096, k=7168): 141 us | throughput: 120 TFLOPS, 1900 GB/s
Perf (num_groups=2, expected_m_per_group= 32, n=4096, k=7168): 35 us | throughput: 106 TFLOPS, 1689 GB/s
Perf (num_groups=4, expected_m_per_group= 32, n=4096, k=7168): 66 us | throughput: 115 TFLOPS, 1822 GB/s
Perf (num_groups=32, expected_m_per_group= 64, n=4096, k=7168): 485 us | throughput: 248 TFLOPS, 2002 GB/s
Perf (num_groups= 2, expected_m_per_group= 16, n=4096, k=7168): 27 us | throughput: 71 TFLOPS, 2226 GB/s
Perf (num_groups= 4, expected_m_per_group= 16, n=4096, k=7168): 46 us | throughput: 82 TFLOPS, 2587 GB/s
Perf (num_groups= 2, expected_m_per_group= 32, n=4096, k=7168): 28 us | throughput: 134 TFLOPS, 2136 GB/s
Perf (num_groups= 9, expected_m_per_group= 32, n=4096, k=7168): 93 us | throughput: 183 TFLOPS, 2902 GB/s
Perf (num_groups= 2, expected_m_per_group= 32, n=4096, k=7168): 28 us | throughput: 135 TFLOPS, 2143 GB/s
Perf (num_groups= 4, expected_m_per_group= 32, n=4096, k=7168): 49 us | throughput: 152 TFLOPS, 2414 GB/s
Perf (num_groups=32, expected_m_per_group= 64, n=4096, k=7168): 479 us | throughput: 251 TFLOPS, 2029 GB/s