-
Notifications
You must be signed in to change notification settings - Fork 68
Example of BF16/FP16 MoE Grouped GEMM with CuTe interface #600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
c0b52e6 to
6c0cee7
Compare
6c0cee7 to
a57a540
Compare
|
what datatype is covered by this PR and what the current performance? |
| } | ||
|
|
||
| reorder(tArA, tCrA); | ||
| reorder(tBrB, tCrB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fp16/bf16, what's the purpose of these two reorders? If we read the data of A and B with demand layout, we don't need to add reorder again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing!
If we read the data of A and B with demand layout, we don't need to add reorder again
In that case, it's a no-op. It's explicitly mentioned in the rearch documentation:
reorder acts as a "pipe" connecting copy and MMA operations (or any other subgroup-scope operations). With reorders, the kernel writer does not need to worry about perfectly matching layouts between copy and MMA atoms. In case the layouts do match perfectly (as make_block_2d_copy_{A,B,C} try to do), the compiler is able to remove the reorder entirely, making it a no-op.
tdeng5
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have folders for MOE, like: 09_bmg_grouped_gemm_fp8, 10_bmg_grouped_gemm_mixed_dtype; do you do similar things? If yes, can we follow the existing naming convention for examples' folder.
Will do, thanks! |
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a Mixture of Experts (MoE) GEMM implementation for Intel GPUs using SYCL and CuTe, enabling MoE computations without collectives that can be called directly from GPU kernels.
Key Changes
- New persistent tile scheduler for MoE GEMM workloads with custom work distribution
- Core MoE GEMM kernels supporting both standard 16-bit floating-point and quantized MXFP4 formats
- Example implementation demonstrating multi-expert GEMM execution with real workload patterns
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/cute/tutorial/moe/moe_tile_scheduler.hpp | Implements persistent tile scheduler adapted for MoE workloads with per-expert work distribution |
| examples/cute/tutorial/moe/moe_grouped_gemm.hpp | Main MoE GEMM orchestration handling expert batching and tensor pointer updates |
| examples/cute/tutorial/moe/moe_gemms.hpp | Device-side GEMM kernels with support for bf16/fp16 and MXFP4 quantized operations |
| examples/cute/tutorial/moe/moe_example.cpp | Host-side launcher and example with realistic multi-layer expert workload patterns |
| examples/cute/tutorial/CMakeLists.txt | Adds build target for MoE GEMM example |
Comments suppressed due to low confidence (1)
examples/cute/tutorial/moe/moe_tile_scheduler.hpp:1
- Corrected spelling of 'Othwerwise' to 'Otherwise' in comment at line 301 of moe_gemms.hpp
/***************************************************************************************************
examples/12_bmg_moe_gemm_cute_interface/12_bmg_moe_gemm_cute_interface.cpp
Outdated
Show resolved
Hide resolved
We need an MXFP4 GEMM that supports packed weights & scales
jiyang1011
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
examples/12_bmg_moe_gemm_cute_interface/12_bmg_moe_gemm_cute_interface.cpp
Outdated
Show resolved
Hide resolved
| @@ -0,0 +1,347 @@ | |||
| /*************************************************************************************************** | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no special in this moe tile scheduler. Why not use the persisent scheduler that cutlass supplied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or @sanchitintel could you help me point out your motivation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jiyang1011, thanks for reviewing the PR!
This file has some custom code, as MoE GEMM is a special case of Grouped GEMM.
The GroupedGEMM API requires creation of a vector of ProblemShape objects for each GEMM problem, which is used in the GroupedGEMM tile-scheduler. If there are 32 groups, then a vector of 32 ProblemShape objects is created.
Since these would not be known at compile time for a framework, they would have to be created at run-time instead.
However, for MoEGEMM, I just provide one dummy shape, and then the custom code in tile scheduler can derive the shape of each GEMM problem.
Also, another reason for having a separate scheduler is that I'll change the scheduling algorithm down the line.
Thanks!
|
Hi @rolandschulz @ratnampa, can you please clarify why the CI is using Thanks! |
-DDPCPP_HOST_COMPILER=g++-13 is causing some issues in #600.
@sanchitintel this for g++ support as the host compiler and not use icpx as host compiler. |
If I change CMakeLists.txt with `if (CMAKE_CXX_COMPILER_ID STREQUAL "IntelLLVM")`, that somehow also works when g++ is used as host compiler, although it should not, so making corresponding changes in C++ instead.
Added corresponding changes in CMake
Avoid extra is_valid() calls after group changes
Fixed a bug for the case in which the first GEMM problem's output has less than xe_core_count WG tiles. In practice, this issue can't happen on BMG with OOB models.
Summary
MoE GEMM without using collectives (uses CuTe interface). Can be called directly from another GPU kernel.
New copy, MMA atoms have been used. Some of the code has been adapted from the new API example.
The MMA code is similar to Grouped GEMM MMA collective code. Epilogue was eliminated by using FP32 -> BF16/FP16 reorder & store
The implementation is extensible & MXFP4/MXFP8 GEMMs can also be added.
Currently, the example uses RowMajor B because ColumnMajor B performance is worse for the underlying vanilla GEMM.
To integrate this code in a framework, users should use appropriate TiledMMAs (with suitable WG, SG tile shapes), copy atoms & rasterization.
Introduction
For Mixture of Experts used in Deep Learning models such as LLMs, the MoE GEMM use-case is something like this - each
expert(corresponding to agroup) has an associatedweightsizedN * K, which essentially a column-majorBmatrix (serving frameworks may change it to RowMajor as well). All theBmatrices are contiguous w.r.t. each other, i.e. their total size isnum_groups * N * K.Mfor each group (an individual GEMM problem) may be different. AllAmatrices are also contiguous w.r.t. each other. Each set of tokens routed to an expert makes up theAmatrix for that group.Since a token may be routed to multiple experts, they're duplicated such that the tokens' activations being routed to an expert are contiguous. This happens before MoE Grouped GEMMs are called, though.
When multiple GEMMs are to be computed, each with its own canonical
A,B,C,Dmatrices, GroupGEMM is useful for ensuring high GPU utilization & preventing launch overhead that'd otherwise occur for multiple GEMM kernel launches. In cutlass, the vanilla GroupGEMM uses a persistent kernel approach - the number of workgroups launched are equal to the number of Xe cores, and they loop through until they have work, (in this case, work, is the mainloop to compute one of the output tiles of any one of the GEMMs we try to compute with the GroupGEMM API).MoEGEMMthus seems to be a natural candidate for leveraging GroupedGEMM. However, GroupedGEMM's API requires providing pointers forA,B,C,Dof each GEMM problem, as well as a vector of input shapes of individual GEMM problems, so the API interface is not suitable for MoE GEMM use-case. The MoE GEMM interface is quite cleaner. Moreover, MoE GEMM doesn't use the canonicalCmatrix.Performance
When M-occupancy is high & all the individual GEMM problems are compute-bound, the throughput is close to peak performance.
When M-dimension of individual GEMM problems is small (typical for decoding), and WG_M, SG_M are also small (e.g. 8), then the implementation attains close to peak memory bandwidth utilization.
MoE GEMM is unlike vanilla GEMM. At runtime, the token distribution, especially for prefill, may be highly skewed, so M-occupancy may be low. If M-occupancy is 60%, that means 40% of the MMA compute is wasteful.
Using smaller WG_M would improve M-occupancy, but using smaller A tiles means more frequent memory transfers but less compute than with larger A WG tiles, so they usually only help when M dim of an individual GEMM problem <= WG_M. Otherwise, they don't let us use the hardware as efficiently.
Requirements
Please use igc 2.20, or newer
Please set minimum GPU frequency value to the max frequency value for benchmarking.
Build instructions
Please do not use
-DDPCPP_HOST_COMPILER=g++-13(for now, I'll later revise the code to make it compatible with g++. It's related to the sycl kernel launch).cc @EikanWang @CaoZhongZ