[feature][Blackwell] Add SM120 float4_e2m1fn FP4 GEMM support.#2171
[feature][Blackwell] Add SM120 float4_e2m1fn FP4 GEMM support.#2171TerminusAkivili wants to merge 1 commit intotile-ai:mainfrom
Conversation
📝 WalkthroughWalkthroughThis PR implements SM120 (CUDA 12.0+) FP4 (float4_e2m1fn) GEMM support across CUDA/TI codegen, lowering, intrinsics, MMA dispatch, layout/macro generation, and provides two runnable examples with host-side FP4 unpacking and validation. ChangesSM120 FP4 GEMM Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/tl_templates/cuda/cuda_fp4.h (1)
166-187: ⚡ Quick winVerify register allocation for
fp4_e2_t values[64]in device code.The 64-element local array is constant-indexed throughout (
values[0]–values[63]), so nvcc at-O2+should scalar-replace it into registers. However, unlike the explicitly-parameterizedmake_fp4_e2_32_twhich guarantees register-only arguments, register spilling to local memory is possible at lower optimisation levels or with larger surrounding register pressure. Consider adding a__forceinline__annotation to maximise inlining and scalar replacement at call sites.Proposed annotation
-template <typename... Args> -TL_DEVICE fp4_e2_64_t make_fp4_e2_64_t(Args... args) { +template <typename... Args> +TL_DEVICE __forceinline__ fp4_e2_64_t make_fp4_e2_64_t(Args... args) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/tl_templates/cuda/cuda_fp4.h` around lines 166 - 187, The local array fp4_e2_t values[64] in make_fp4_e2_64_t may be spilled under some compile conditions; annotate the function to force inlining (e.g., add a __forceinline__/always-inline device inline attribute to make_fp4_e2_64_t) so nvcc can scalar-replace values[0]..values[63] into registers and inline the make_fp4_e2_32_t calls; update the function declaration for make_fp4_e2_64_t accordingly (keeping fp4_e2_t values[64] and the existing make_fp4_e2_32_t usages unchanged).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/backend/cuda/codegen/codegen_cuda.cc`:
- Around line 1973-2003: The FP4 padded shared-memory vector path
(IsFp4PaddedSharedStorage + code using GetFp4PaddedSharedIndex and the
byte_offset lambda when constructing the reinterpret cast for t.lanes()) can
incorrectly span the padded 16-element row boundary; add a guard or split logic:
either assert the logical base alignment (e.g., Ensure base % 16 == 0 for the
requested load/store) or detect when the access crosses a 16-element row by
computing the start and end logical indices (base + offset and base + offset +
t.lanes()-1) and comparing their 16-element row indices (truncdiv(..., 16)); if
it crosses, split the operation into two row-aligned fragments (like the
existing t.lanes()==32 two-fragment approach) and merge them, otherwise keep the
current single contiguous byte reinterpretation; apply the same fix to the other
similar blocks identified (around the other ranges mentioned).
- Around line 4428-4444: The allocator treats only scope == "local" as the path
that emits local backing arrays but FP4 fragments use the semantic storage name
"local.fragment", so allocations for these still hit the unsupported-scope
branch; update the scope checks used around is_int4_scalar_local, the FP4
alignas(16) branch, and the place that prints/omits the storage scope to treat
"local.fragment" as equivalent to "local" (either normalize scope to "local"
earlier or change conditions from scope == "local" to (scope == "local" || scope
== "local.fragment")), ensuring PrintStorageScope/PrintType and the
backing-array emission path handle FP4 fragments the same as regular local
allocations (references: is_int4_scalar_local, op->dtype.is_float4_e2m1fn(),
PrintStorageScope, PrintType, and the "local.fragment" semantic storage).
In `@tilelang/cuda/intrinsics/macro/mma_macro_generator.py`:
- Around line 121-124: The FP4 fast-path in mma_macro_generator.py sets
self.k_dim = 32 without respecting self.chunk, causing micro_size_k to exceed
chunk when chunk < 32; update the FP4 branch in the initializer (the block
setting self.k_dim) to clamp k_dim by self.chunk (e.g., self.k_dim = min(32,
self.chunk)) and add the same clamp/guard in the subclass override (the code
around lines 873–877) so both places respect chunk; optionally emit a clear
ValueError or assertion if chunk < required minimum to fail early with a helpful
message referencing the dtype and chunk size.
---
Nitpick comments:
In `@src/tl_templates/cuda/cuda_fp4.h`:
- Around line 166-187: The local array fp4_e2_t values[64] in make_fp4_e2_64_t
may be spilled under some compile conditions; annotate the function to force
inlining (e.g., add a __forceinline__/always-inline device inline attribute to
make_fp4_e2_64_t) so nvcc can scalar-replace values[0]..values[63] into
registers and inline the make_fp4_e2_32_t calls; update the function declaration
for make_fp4_e2_64_t accordingly (keeping fp4_e2_t values[64] and the existing
make_fp4_e2_32_t usages unchanged).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a09f3145-ce2d-4b0d-bb75-d916a099b2be
📒 Files selected for processing (16)
examples/gemm_fp4/example_gemm_a8w4_sm120.pyexamples/gemm_fp4/example_gemm_fp4_sm120.pysrc/backend/cuda/codegen/codegen_cuda.ccsrc/backend/cuda/codegen/codegen_cuda.hsrc/backend/cuda/op/copy.ccsrc/backend/cuda/op/copy_analysis.ccsrc/tl_templates/cuda/cuda_fp4.hsrc/tl_templates/cuda/gemm_mma.hsrc/tl_templates/cuda/instruction/mma.hsrc/tl_templates/cuda/ldsm.hsrc/transform/lower_ptx_async_copy.ccsrc/transform/ptx_async_copy_injector.htilelang/cuda/intrinsics/layout/mma_layout.pytilelang/cuda/intrinsics/layout/utils.pytilelang/cuda/intrinsics/macro/mma_macro_generator.pytilelang/cuda/op/gemm/gemm_mma.py
a255e60 to
3e5823d
Compare
3e5823d to
7f254a9
Compare
Summary
This PR adds SM120 fragment-MMA GEMM support for
T.float4_e2m1fn, including plain FP4 GEMM and explicit FP8 e4m3 / FP4 mixed GEMM.Supported combinations:
T.float4_e2m1fn x T.float4_e2m1fn -> T.float32T.float8_e4m3fn x T.float4_e2m1fn -> T.float32T.float4_e2m1fn x T.float8_e4m3fn -> T.float32The TileLang-facing API stays dtype-semantic: kernels declare FP4 tensors as
T.float4_e2m1fn. Packed byte storage is handled by lowering/codegen and by host-side example setup, not by usinguint8as a GEMM dtype.guide
The SM120 FP4 path has three contracts that need to line up:
b4x16_p64ldmatrixldsm.h,codegen_cuda.cc,copy.cc,utils.py,mma_layout.pycodegen_cuda.cc,codegen_cuda.hm16n8k32MMA for explicit FP4/FP8 dtype pairsinstruction/mma.h,gemm_mma.h,mma_macro_generator.py,gemm_mma.pyThe main implementation detail is the shared-memory layout: SM120
b4x16_p64uses packed FP4 bytes with a padded shared row layout. The copy and ldmatrix lowering paths therefore compute packed global offsets and padded shared offsets separately, while local fragments keep their declared names and types.Main changes
CUDA templates
src/tl_templates/cuda/ldsm.hptx_ldmatrix_b4x16_x{1,2,4}helpers with architecture guardsrc/tl_templates/cuda/instruction/mma.hcute::SM120_16x8x32_TNdispatch for FP4xFP4, FP8xFP4, and FP4xFP8 to FP32src/tl_templates/cuda/instruction/mma.hsrc/tl_templates/cuda/gemm_mma.hsrc/tl_templates/cuda/cuda_fp4.hCUDA lowering
src/backend/cuda/codegen/codegen_cuda.ccptx_ldmatrix_b4x16_x{1,2,4}for explicitfloat4_e2m1fnldmatrix loadssrc/backend/cuda/codegen/codegen_cuda.ccsrc/backend/cuda/codegen/codegen_cuda.cc_packedaliasessrc/backend/cuda/op/copy.ccsrc/backend/cuda/op/copy.ccsrc/transform/lower_ptx_async_copy.ccsrc/backend/cuda/op/copy_analysis.cc/src/transform/ptx_async_copy_injector.hPython lowering
tilelang/cuda/intrinsics/layout/utils.pyfloat4_e2m1fntilelang/cuda/intrinsics/layout/mma_layout.pytilelang/cuda/intrinsics/macro/mma_macro_generator.pym16n8k32MMA granularitytilelang/cuda/op/gemm/gemm_mma.pyExamples
examples/gemm_fp4/example_gemm_fp4_sm120.pyexamples/gemm_fp4/example_gemm_a8w4_sm120.pyNotes
uint8is only a host storage/interoperability detail in the examples.float4_e2m1fn; existing int4/uint4 ldmatrix offset behavior stays on the existing path.Summary by CodeRabbit
New Features
Improvements