Skip to content

Conversation

@sspintel
Copy link

@sspintel sspintel commented Oct 15, 2025

@Antonyvance Antonyvance added the examples Label for adding examples, complex kernels development using cutlass or cute APIS label Oct 17, 2025
@Antonyvance
Copy link

Antonyvance commented Oct 17, 2025

@sspintel Thanks for the example. Could you please explain, why do we need this? Can you share the improvement achieved with BMG /Xe2 platform?

@Antonyvance Antonyvance added the information required The PR requires more information to review them properly label Oct 17, 2025
@sspintel sspintel force-pushed the dev/matrix-transpose branch from 43a06f2 to 3578f8e Compare October 29, 2025 05:39
@Antonyvance Antonyvance requested a review from Copilot November 5, 2025 07:01
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a comprehensive CuTe Matrix Transpose tutorial, based on Colfax's article, demonstrating various transpose implementation strategies for Intel GPUs using SYCL and CuTe abstractions.

Key changes:

  • Implements multiple transpose kernels (naive, SMEM-based, block 2D) with performance benchmarking
  • Adds utility functions for random data generation and validation
  • Fixes SYCL compatibility issues in platform headers

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
include/cutlass/platform/platform.h Adds lowest() method to numeric_limits<float> for proper float range support
include/cute/util/compat/traits.hpp Fixes SYCL item template to use correct non-offset variant
examples/cute/tutorial/transpose/util.h Provides benchmarking and validation utilities for transpose operations
examples/cute/tutorial/transpose/transpose_sycl.cpp Implements reference SYCL transpose kernels for comparison
examples/cute/tutorial/transpose/transpose_smem.h Implements CuTe SMEM-based transpose with optional swizzling
examples/cute/tutorial/transpose/transpose_naive.h Implements naive CuTe transpose without SMEM
examples/cute/tutorial/transpose/main.cpp Entry point that benchmarks all transpose implementations
examples/cute/tutorial/transpose/copy_smem.h Implements SMEM copy kernel as baseline
examples/cute/tutorial/transpose/copy_direct.h Implements direct GMEM copy kernel as baseline
examples/cute/tutorial/transpose/block_2d_transposed_copy.h Implements Intel Xe block 2D transposed load operations
examples/cute/tutorial/CMakeLists.txt Registers new transpose tutorial executable
examples/common/sycl_cute_common.hpp Adds std::vector overloads for random_fill and zero_fill

CUTLASS_HOST_DEVICE
static constexpr float infinity() noexcept { return bit_cast<float, int32_t>(0x7f800000);}
CUTLASS_HOST_DEVICE
static constexpr float lowest() noexcept { return -bit_cast<float, int32_t>(0x7f7fffff) - 1;}
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The computation of lowest() is incorrect. The expression -bit_cast<float, int32_t>(0x7f7fffff) - 1 applies negation and subtraction to the bit-casted float, not to the integer before casting. It should be bit_cast<float, int32_t>(0xff7fffff) to represent the IEEE 754 bit pattern for the most negative finite float.

Suggested change
static constexpr float lowest() noexcept { return -bit_cast<float, int32_t>(0x7f7fffff) - 1;}
static constexpr float lowest() noexcept { return bit_cast<float, int32_t>(0xff7fffff);}

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +72
auto transpose_function = make_layout(tensor_shape_S, LayoutRight{});
for (size_t i = 0; i < h_D.size(); ++i)
if (h_D[i] != h_S[transpose_function(i)])
bad++;
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transpose validation logic is incorrect. transpose_function(i) with a LayoutRight layout on tensor_shape_S will compute the same linear index i, not the transposed index. For proper transpose validation, you need to convert linear index i to (row, col) in the destination, swap to (col, row), then compute the linear index in the source.

Suggested change
auto transpose_function = make_layout(tensor_shape_S, LayoutRight{});
for (size_t i = 0; i < h_D.size(); ++i)
if (h_D[i] != h_S[transpose_function(i)])
bad++;
// Correct validation: map destination index to source index via transpose
for (size_t i = 0; i < h_D.size(); ++i) {
// Destination shape: N x M, so row = i / M, col = i % M
size_t row = i / M;
size_t col = i % M;
// Source shape: M x N, so source index = col * N + row
size_t source_index = col * N + row;
if (h_D[i] != h_S[source_index])
bad++;
}

Copilot uses AI. Check for mistakes.

constexpr size_t numIters = 100;

typedef unsigned int uint;
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using typedef for uint is outdated C++ style. Replace with using uint = unsigned int; for consistency with modern C++ conventions.

Suggested change
typedef unsigned int uint;
using uint = unsigned int;

Copilot uses AI. Check for mistakes.
@sspintel
Copy link
Author

sspintel commented Nov 6, 2025

@sspintel Thanks for the example. Could you please explain, why do we need this? Can you share the improvement achieved with BMG /Xe2 platform?

Hi @Antonyvance, The PR is still a WIP and is meant to be an intro to memory access patterns in CuTe (similar to the colfax article in description). The tutorial has samples for gmem copy/naive transpose/smem copy & transpose etc, that demonstrates how one can avoid strided access from/to gmem and encourage the usage of smem for strided accesses. The main code runs these separate kernels and benchmarks the effective bandwidth achieved between them.

I will also explore other CuTe concepts like using swizzling to avoid bank conflicts, tv-layouts, sgv-layout, and how to use block 2d copy atoms to achieve copy/transpose. Overall, this is a learning exercise for me and might also serve as an introduction to others learning CuTe. I will add a detailed README at the end when I am done with the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples Label for adding examples, complex kernels development using cutlass or cute APIS information required The PR requires more information to review them properly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants