[Example] Add CLC-pipelined 2-CTA GEMM example for sm100#2169
[Example] Add CLC-pipelined 2-CTA GEMM example for sm100#2169ighoshsubho wants to merge 2 commits intotile-ai:mainfrom
Conversation
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR adds a new pipelined CLC-staged GEMM kernel variant and replaces the benchmark harness. The new ChangesPipelined CLC-Staged GEMM Kernel
Sequence Diagram(s)sequenceDiagram
participant Producer as Producer Threads
participant MMA as MMA Compute Threads
participant Scheduler as Scheduler Threads
participant Consumer as Consumer Threads
participant Shared as Shared/Schedule State
rect rgba(135, 206, 250, 0.5)
note over Producer,Scheduler: CLC Stage s
Producer->>Shared: wait schedule_finish[c_cons]
Producer->>Shared: TMA load A, B
Producer->>Shared: arrive schedule_arrive[s_cons]
Scheduler->>Shared: wait prior_completion[s_prod]
Scheduler->>Shared: multicast cancel & set tile_id[s_prod]
MMA->>Shared: wait schedule_arrive[s_cons]
MMA->>Shared: TCGen05 compute → TMEM outputs
Consumer->>Shared: read tile_id[s_cons]
Consumer->>Shared: wait schedule_arrive[s_cons]
Consumer->>Shared: move TMEM → C
end
note over Producer,Consumer: Advance to next CLC stage (cyclic)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py (1)
364-388: 💤 Low valueBenchmark harness LGTM; consider sweeping
clc_stages.Lambdas correctly capture
a/bby default arg, andbase_args/group_sizeare stable across the loop, so the benchmark closures are sound. One nit:clc_stagesis hardcoded to3in two places, so users can't easily reproduce theclc=2/clc>3numbers from the PR description without editing the script. A small sweep (or afor clc in (2, 3, 4):inside the size loop) would make this more useful as a tuning playground.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py` around lines 364 - 388, The benchmark currently hardcodes clc_stages=3 when calling gemm_clc_persistent_2cta_pipelined_clc and in its benchmark lambda; change this to sweep a small set (e.g., for clc in (2,3,4)) inside the M,N,K loop and run the call and its do_bench for each clc value so users can reproduce clc=2 / clc>3 results; update the two places referencing the literal 3 (the call to gemm_clc_persistent_2cta_pipelined_clc and the lambda passed to do_bench) to use the loop variable clc and include clc in the printed TFLOPS line to differentiate results.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py`:
- Around line 364-388: The benchmark currently hardcodes clc_stages=3 when
calling gemm_clc_persistent_2cta_pipelined_clc and in its benchmark lambda;
change this to sweep a small set (e.g., for clc in (2,3,4)) inside the M,N,K
loop and run the call and its do_bench for each clc value so users can reproduce
clc=2 / clc>3 results; update the two places referencing the literal 3 (the call
to gemm_clc_persistent_2cta_pipelined_clc and the lambda passed to do_bench) to
use the loop variable clc and include clc in the printed TFLOPS line to
differentiate results.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 36ee07d6-7fd1-425a-8c1e-269d1635aa9a
📒 Files selected for processing (1)
examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py
|
Hi @ighoshsubho, thanks for your contribution! I'm the author of the original CLC kernel. I remember the original CLC kernel achieves ~1700TFLOPs @(8192, 8192, 8192) on B200. Besides, the torch result reported in this script is actually inaccurate. In fact, GEMM on B200 significantly suffers from power issues, thus the 2nd kernel to run in the example will have a severe performance degration. You can validate this by running the torch kernel only (I remember it was about 1720T, not sure). Back to your kernel, could you please shed more light on the difference compared to the original one? Thanks! |
yeah sure, the point is to issue the next tile's
I will try with cupti backend, also yeah b200 does suffer with performance on GEMMs due to power issues, I will again try some tests and share you some results. I was getting >1700 for clc=3 a day ago on same config cc: @Rachmanino |
|
I roughly understand your point. Thanks again! |
After couple on bench, it did clock near 1700. Let me know if you try it on your side and find something different. cc: @Rachmanino |
Adds
gemm_clc_persistent_2cta_pipelined_clcnext to the existing single-stage CLC kernel. It pipelines the CLC tile-id handshake acrossclc_stagesslots so the next tile'sclusterlaunchcontrol.try_cancelcan issue while the current tile is still being computed.the scheduler runs
clc_stagesahead of the consumer. Slot s is written by scheduler iter s, s+clc_stages, ... and read by consumer iter s+1, s+1+clc_stages, ... — non-overlapping. Theschedule_finishedarrive count is set to 5 (consumer arrives only); the scheduler does not arrive on it. This breaks the circular dependency that would otherwise deadlock for clc_stages > 1.Numbers on B200, bf16, baseline = single-stage CLC kernel. Run the example to reproduce; it sweeps clc_stages ∈ {2, 3, 4} per shape:
(All TFLOPS.) Pipelined wins at 16384³ both vs the baseline (+8%) and vs cuBLAS (+20%). At smaller shapes the single-stage scheduler already overlaps with compute well enough that pipelining doesn't pay off.
Summary by CodeRabbit
New Features
Refactor
Tests