[Example] Add CLC-pipelined 2-CTA GEMM example for sm100 by ighoshsubho · Pull Request #2169 · tile-ai/tilelang

ighoshsubho · 2026-05-08T09:25:04Z

Adds gemm_clc_persistent_2cta_pipelined_clc next to the existing single-stage CLC kernel. It pipelines the CLC tile-id handshake across clc_stages slots so the next tile's clusterlaunchcontrol.try_cancel can issue while the current tile is still being computed.

the scheduler runs clc_stages ahead of the consumer. Slot s is written by scheduler iter s, s+clc_stages, ... and read by consumer iter s+1, s+1+clc_stages, ... — non-overlapping. The schedule_finished arrive count is set to 5 (consumer arrives only); the scheduler does not arrive on it. This breaks the circular dependency that would otherwise deadlock for clc_stages > 1.

Numbers on B200, bf16, baseline = single-stage CLC kernel. Run the example to reproduce; it sweeps clc_stages ∈ {2, 3, 4} per shape:

M=N=K	base	pipe(2)	pipe(3)	pipe(4)	torch
4096	1478	1441	1355	1416	1437
8192	1579	1597	1386	1410	1380
16384	1375	1470	1565	1588	1421

(All TFLOPS.) Pipelined wins at 16384³ both vs the baseline (+8%) and vs cuBLAS (+20%). At smaller shapes the single-stage scheduler already overlaps with compute well enough that pipelining doesn't pay off.

Summary by CodeRabbit

New Features
- Added a configurable pipelined GEMM kernel with multi-stage scheduling for improved throughput.
Refactor
- Replaced the example entrypoint to run multi-size correctness checks and full performance benchmarks.
Tests
- Added automated correctness validation across multiple matrix sizes and a reference bfloat16 matmul for comparisons, including TFLOPS reporting against a PyTorch baseline.

coderabbitai · 2026-05-08T09:25:17Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0049564d-2686-4b3d-bc57-9241b18365d9

📥 Commits

Reviewing files that changed from the base of the PR and between f4bbefc and 48908f3.

📒 Files selected for processing (1)

examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py

🚧 Files skipped from review as they are similar to previous changes (1)

examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py

📝 Walkthrough

Walkthrough

This PR adds a new pipelined CLC-staged GEMM kernel variant and replaces the benchmark harness. The new gemm_clc_persistent_2cta_pipelined_clc kernel generalizes CLC scheduling to support configurable pipeline stages with stage-indexed synchronization. A reference helper and multi-size benchmarking loop are introduced. Existing kernel comments are cleaned up without functional changes.

Changes

Pipelined CLC-Staged GEMM Kernel

Layer / File(s)	Summary
New Kernel Allocations & Data Structures `examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py`	Function signature with `clc_stages` parameter; CLC barriers, synchronization, and state arrays sized by `clc_stages`.
Producer Path with Per-Stage Synchronization `examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py`	Producer threads compute per-iteration stage indices and use stage-specific barriers while performing TMA loads.
MMA Compute & CLC Scheduler Paths `examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py`	MMA threads synchronize on stage-specific schedule arrival and compute; scheduler threads operate per CLC stage with multicast cancellation and tile-id derivation.
Consumer/Epilogue Path with Per-Stage Tile Selection `examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py`	Consumer threads read per-stage tile-id, synchronize on schedule arrival, and move TMEM fragments into C output.
Reference Implementation & Benchmarking Harness `examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py`	New `ref_program()` helper and `__main__` loop run correctness checks and TFLOPS benchmarking across multiple problem sizes.
Existing Kernel Comment Cleanup `examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py`	Comment annotations removed from four thread-partition branches in `gemm_clc_persistent_2cta` without logic changes.

Sequence Diagram(s)

sequenceDiagram
  participant Producer as Producer Threads
  participant MMA as MMA Compute Threads
  participant Scheduler as Scheduler Threads
  participant Consumer as Consumer Threads
  participant Shared as Shared/Schedule State

  rect rgba(135, 206, 250, 0.5)
  note over Producer,Scheduler: CLC Stage s
  Producer->>Shared: wait schedule_finish[c_cons]
  Producer->>Shared: TMA load A, B
  Producer->>Shared: arrive schedule_arrive[s_cons]
  Scheduler->>Shared: wait prior_completion[s_prod]
  Scheduler->>Shared: multicast cancel & set tile_id[s_prod]
  MMA->>Shared: wait schedule_arrive[s_cons]
  MMA->>Shared: TCGen05 compute → TMEM outputs
  Consumer->>Shared: read tile_id[s_cons]
  Consumer->>Shared: wait schedule_arrive[s_cons]
  Consumer->>Shared: move TMEM → C
  end

  note over Producer,Consumer: Advance to next CLC stage (cyclic)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

tile-ai/tilelang#2029: Introduces CLC-based GEMM scheduling that this PR extends with pipelined stages.

Suggested reviewers

LeiWang1999

Poem

🐰 A kernel blooms with stages, clear and true,
Per-CLC indices weave through and through,
Producers load, MMAs hum with might,
Schedulers pick tiles, consumers write—
Pipelined hops bring TFLOPS to light! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a CLC-pipelined 2-CTA GEMM example kernel for sm100 architecture.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-08T09:25:20Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai

🧹 Nitpick comments (1)

examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py (1)
364-388: 💤 Low value

Benchmark harness LGTM; consider sweeping clc_stages.

Lambdas correctly capture a/b by default arg, and base_args/group_size are stable across the loop, so the benchmark closures are sound. One nit: clc_stages is hardcoded to 3 in two places, so users can't easily reproduce the clc=2 / clc>3 numbers from the PR description without editing the script. A small sweep (or a for clc in (2, 3, 4): inside the size loop) would make this more useful as a tuning playground.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py` around lines 364 - 388, The
benchmark currently hardcodes clc_stages=3 when calling
gemm_clc_persistent_2cta_pipelined_clc and in its benchmark lambda; change this
to sweep a small set (e.g., for clc in (2,3,4)) inside the M,N,K loop and run
the call and its do_bench for each clc value so users can reproduce clc=2 /
clc>3 results; update the two places referencing the literal 3 (the call to
gemm_clc_persistent_2cta_pipelined_clc and the lambda passed to do_bench) to use
the loop variable clc and include clc in the printed TFLOPS line to
differentiate results.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py`:
- Around line 364-388: The benchmark currently hardcodes clc_stages=3 when
calling gemm_clc_persistent_2cta_pipelined_clc and in its benchmark lambda;
change this to sweep a small set (e.g., for clc in (2,3,4)) inside the M,N,K
loop and run the call and its do_bench for each clc value so users can reproduce
clc=2 / clc>3 results; update the two places referencing the literal 3 (the call
to gemm_clc_persistent_2cta_pipelined_clc and the lambda passed to do_bench) to
use the loop variable clc and include clc in the printed TFLOPS line to
differentiate results.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 36ee07d6-7fd1-425a-8c1e-269d1635aa9a

📥 Commits

Reviewing files that changed from the base of the PR and between a797e51 and f4bbefc.

📒 Files selected for processing (1)

examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py

Rachmanino · 2026-05-08T11:07:37Z

Hi @ighoshsubho, thanks for your contribution!

I'm the author of the original CLC kernel. I remember the original CLC kernel achieves ~1700TFLOPs @(8192, 8192, 8192) on B200. Besides, the torch result reported in this script is actually inaccurate. In fact, GEMM on B200 significantly suffers from power issues, thus the 2nd kernel to run in the example will have a severe performance degration. You can validate this by running the torch kernel only (I remember it was about 1720T, not sure).

Back to your kernel, could you please shed more light on the difference compared to the original one? Thanks!

ighoshsubho · 2026-05-08T11:35:34Z

Back to your kernel, could you please shed more light on the difference compared to the original one? Thanks!

yeah sure, the point is to issue the next tile's clusterlaunchcontrol.try_cancel while the current tile is still computing, so the CLC handshake latency is hidden. The single-stage baseline only overlaps the CLC handshake with the tile's compute; this version overlaps it with clc_stages tiles in flight.

So scheduler runs clc_stages - 1 ahead: at iter k it writes slot k % clc_stages; consumers (TMA / MMA / epilogue) at iter k read the slot scheduler wrote at iter k - 1. So slot s is written at s, s + clc_stages, ... and read at s + 1, s + 1 + clc_stages, ..., which are non-overlapping.
schedule_finished arrive count is 5 (not 7); the scheduler does NOT arrive on it. Per cycle, the 5 consumer arrives (producer 2 + MMA 1 + epilogue 2) flip the parity alone. This breaks the circular dependency that would otherwise deadlock for clc_stages > 1.

In fact, GEMM on B200 significantly suffers from power issues

I will try with cupti backend, also yeah b200 does suffer with performance on GEMMs due to power issues, I will again try some tests and share you some results. I was getting >1700 for clc=3 a day ago on same config

cc: @Rachmanino

Rachmanino · 2026-05-08T11:43:45Z

I roughly understand your point. Thanks again!

ighoshsubho · 2026-05-08T12:40:36Z

shape	torch	base	pipe(clc=2)	pipe(clc=3)	pipe(clc=4)
4096³	1514	1419	1425	1388	1404
8192³	1637	1657	1571	1563	1660
16384³	1564	1543	1672	1636	1582

After couple on bench, it did clock near 1700. Let me know if you try it on your side and find something different.

cc: @Rachmanino

examples/gemm_sm100: add CLC-pipelined 2-CTA GEMM

f4bbefc

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

clc pipeline sweep

48908f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Example] Add CLC-pipelined 2-CTA GEMM example for sm100#2169

[Example] Add CLC-pipelined 2-CTA GEMM example for sm100#2169
ighoshsubho wants to merge 2 commits intotile-ai:mainfrom
ighoshsubho:examples/gemm_clc_pipelined

ighoshsubho commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Rachmanino commented May 8, 2026 •

edited

Loading

Uh oh!

ighoshsubho commented May 8, 2026 •

edited

Loading

Uh oh!

Rachmanino commented May 8, 2026

Uh oh!

ighoshsubho commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ighoshsubho commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Rachmanino commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ighoshsubho commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rachmanino commented May 8, 2026

Uh oh!

ighoshsubho commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ighoshsubho commented May 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Rachmanino commented May 8, 2026 •

edited

Loading

ighoshsubho commented May 8, 2026 •

edited

Loading