-
Notifications
You must be signed in to change notification settings - Fork 68
Add CuTe Matrix Transpose tutorial #562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@sspintel Thanks for the example. Could you please explain, why do we need this? Can you share the improvement achieved with BMG /Xe2 platform? |
43a06f2 to
3578f8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a comprehensive CuTe Matrix Transpose tutorial, based on Colfax's article, demonstrating various transpose implementation strategies for Intel GPUs using SYCL and CuTe abstractions.
Key changes:
- Implements multiple transpose kernels (naive, SMEM-based, block 2D) with performance benchmarking
- Adds utility functions for random data generation and validation
- Fixes SYCL compatibility issues in platform headers
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| include/cutlass/platform/platform.h | Adds lowest() method to numeric_limits<float> for proper float range support |
| include/cute/util/compat/traits.hpp | Fixes SYCL item template to use correct non-offset variant |
| examples/cute/tutorial/transpose/util.h | Provides benchmarking and validation utilities for transpose operations |
| examples/cute/tutorial/transpose/transpose_sycl.cpp | Implements reference SYCL transpose kernels for comparison |
| examples/cute/tutorial/transpose/transpose_smem.h | Implements CuTe SMEM-based transpose with optional swizzling |
| examples/cute/tutorial/transpose/transpose_naive.h | Implements naive CuTe transpose without SMEM |
| examples/cute/tutorial/transpose/main.cpp | Entry point that benchmarks all transpose implementations |
| examples/cute/tutorial/transpose/copy_smem.h | Implements SMEM copy kernel as baseline |
| examples/cute/tutorial/transpose/copy_direct.h | Implements direct GMEM copy kernel as baseline |
| examples/cute/tutorial/transpose/block_2d_transposed_copy.h | Implements Intel Xe block 2D transposed load operations |
| examples/cute/tutorial/CMakeLists.txt | Registers new transpose tutorial executable |
| examples/common/sycl_cute_common.hpp | Adds std::vector overloads for random_fill and zero_fill |
| CUTLASS_HOST_DEVICE | ||
| static constexpr float infinity() noexcept { return bit_cast<float, int32_t>(0x7f800000);} | ||
| CUTLASS_HOST_DEVICE | ||
| static constexpr float lowest() noexcept { return -bit_cast<float, int32_t>(0x7f7fffff) - 1;} |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The computation of lowest() is incorrect. The expression -bit_cast<float, int32_t>(0x7f7fffff) - 1 applies negation and subtraction to the bit-casted float, not to the integer before casting. It should be bit_cast<float, int32_t>(0xff7fffff) to represent the IEEE 754 bit pattern for the most negative finite float.
| static constexpr float lowest() noexcept { return -bit_cast<float, int32_t>(0x7f7fffff) - 1;} | |
| static constexpr float lowest() noexcept { return bit_cast<float, int32_t>(0xff7fffff);} |
| auto transpose_function = make_layout(tensor_shape_S, LayoutRight{}); | ||
| for (size_t i = 0; i < h_D.size(); ++i) | ||
| if (h_D[i] != h_S[transpose_function(i)]) | ||
| bad++; |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transpose validation logic is incorrect. transpose_function(i) with a LayoutRight layout on tensor_shape_S will compute the same linear index i, not the transposed index. For proper transpose validation, you need to convert linear index i to (row, col) in the destination, swap to (col, row), then compute the linear index in the source.
| auto transpose_function = make_layout(tensor_shape_S, LayoutRight{}); | |
| for (size_t i = 0; i < h_D.size(); ++i) | |
| if (h_D[i] != h_S[transpose_function(i)]) | |
| bad++; | |
| // Correct validation: map destination index to source index via transpose | |
| for (size_t i = 0; i < h_D.size(); ++i) { | |
| // Destination shape: N x M, so row = i / M, col = i % M | |
| size_t row = i / M; | |
| size_t col = i % M; | |
| // Source shape: M x N, so source index = col * N + row | |
| size_t source_index = col * N + row; | |
| if (h_D[i] != h_S[source_index]) | |
| bad++; | |
| } |
|
|
||
| constexpr size_t numIters = 100; | ||
|
|
||
| typedef unsigned int uint; |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using typedef for uint is outdated C++ style. Replace with using uint = unsigned int; for consistency with modern C++ conventions.
| typedef unsigned int uint; | |
| using uint = unsigned int; |
Hi @Antonyvance, The PR is still a WIP and is meant to be an intro to memory access patterns in CuTe (similar to the colfax article in description). The tutorial has samples for gmem copy/naive transpose/smem copy & transpose etc, that demonstrates how one can avoid strided access from/to gmem and encourage the usage of smem for strided accesses. The main code runs these separate kernels and benchmarks the effective bandwidth achieved between them. I will also explore other CuTe concepts like using swizzling to avoid bank conflicts, tv-layouts, sgv-layout, and how to use block 2d copy atoms to achieve copy/transpose. Overall, this is a learning exercise for me and might also serve as an introduction to others learning CuTe. I will add a detailed README at the end when I am done with the implementation. |
Based on Colfax's article: https://research.colfax-intl.com/tutorial-matrix-transpose-in-cutlass/