Skip to content

[Example] Add CLC-pipelined 2-CTA GEMM example for sm100#2169

Open
ighoshsubho wants to merge 2 commits intotile-ai:mainfrom
ighoshsubho:examples/gemm_clc_pipelined
Open

[Example] Add CLC-pipelined 2-CTA GEMM example for sm100#2169
ighoshsubho wants to merge 2 commits intotile-ai:mainfrom
ighoshsubho:examples/gemm_clc_pipelined

Conversation

@ighoshsubho
Copy link
Copy Markdown

@ighoshsubho ighoshsubho commented May 8, 2026

Adds gemm_clc_persistent_2cta_pipelined_clc next to the existing single-stage CLC kernel. It pipelines the CLC tile-id handshake across clc_stages slots so the next tile's clusterlaunchcontrol.try_cancel can issue while the current tile is still being computed.

the scheduler runs clc_stages ahead of the consumer. Slot s is written by scheduler iter s, s+clc_stages, ... and read by consumer iter s+1, s+1+clc_stages, ... — non-overlapping. The schedule_finished arrive count is set to 5 (consumer arrives only); the scheduler does not arrive on it. This breaks the circular dependency that would otherwise deadlock for clc_stages > 1.

Numbers on B200, bf16, baseline = single-stage CLC kernel. Run the example to reproduce; it sweeps clc_stages ∈ {2, 3, 4} per shape:

M=N=K base pipe(2) pipe(3) pipe(4) torch
4096 1478 1441 1355 1416 1437
8192 1579 1597 1386 1410 1380
16384 1375 1470 1565 1588 1421

(All TFLOPS.) Pipelined wins at 16384³ both vs the baseline (+8%) and vs cuBLAS (+20%). At smaller shapes the single-stage scheduler already overlaps with compute well enough that pipelining doesn't pay off.

Summary by CodeRabbit

  • New Features

    • Added a configurable pipelined GEMM kernel with multi-stage scheduling for improved throughput.
  • Refactor

    • Replaced the example entrypoint to run multi-size correctness checks and full performance benchmarks.
  • Tests

    • Added automated correctness validation across multiple matrix sizes and a reference bfloat16 matmul for comparisons, including TFLOPS reporting against a PyTorch baseline.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0049564d-2686-4b3d-bc57-9241b18365d9

📥 Commits

Reviewing files that changed from the base of the PR and between f4bbefc and 48908f3.

📒 Files selected for processing (1)
  • examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py

📝 Walkthrough

Walkthrough

This PR adds a new pipelined CLC-staged GEMM kernel variant and replaces the benchmark harness. The new gemm_clc_persistent_2cta_pipelined_clc kernel generalizes CLC scheduling to support configurable pipeline stages with stage-indexed synchronization. A reference helper and multi-size benchmarking loop are introduced. Existing kernel comments are cleaned up without functional changes.

Changes

Pipelined CLC-Staged GEMM Kernel

Layer / File(s) Summary
New Kernel Allocations & Data Structures
examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py
Function signature with clc_stages parameter; CLC barriers, synchronization, and state arrays sized by clc_stages.
Producer Path with Per-Stage Synchronization
examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py
Producer threads compute per-iteration stage indices and use stage-specific barriers while performing TMA loads.
MMA Compute & CLC Scheduler Paths
examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py
MMA threads synchronize on stage-specific schedule arrival and compute; scheduler threads operate per CLC stage with multicast cancellation and tile-id derivation.
Consumer/Epilogue Path with Per-Stage Tile Selection
examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py
Consumer threads read per-stage tile-id, synchronize on schedule arrival, and move TMEM fragments into C output.
Reference Implementation & Benchmarking Harness
examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py
New ref_program() helper and __main__ loop run correctness checks and TFLOPS benchmarking across multiple problem sizes.
Existing Kernel Comment Cleanup
examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py
Comment annotations removed from four thread-partition branches in gemm_clc_persistent_2cta without logic changes.

Sequence Diagram(s)

sequenceDiagram
  participant Producer as Producer Threads
  participant MMA as MMA Compute Threads
  participant Scheduler as Scheduler Threads
  participant Consumer as Consumer Threads
  participant Shared as Shared/Schedule State

  rect rgba(135, 206, 250, 0.5)
  note over Producer,Scheduler: CLC Stage s
  Producer->>Shared: wait schedule_finish[c_cons]
  Producer->>Shared: TMA load A, B
  Producer->>Shared: arrive schedule_arrive[s_cons]
  Scheduler->>Shared: wait prior_completion[s_prod]
  Scheduler->>Shared: multicast cancel & set tile_id[s_prod]
  MMA->>Shared: wait schedule_arrive[s_cons]
  MMA->>Shared: TCGen05 compute → TMEM outputs
  Consumer->>Shared: read tile_id[s_cons]
  Consumer->>Shared: wait schedule_arrive[s_cons]
  Consumer->>Shared: move TMEM → C
  end

  note over Producer,Consumer: Advance to next CLC stage (cyclic)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • tile-ai/tilelang#2029: Introduces CLC-based GEMM scheduling that this PR extends with pipelined stages.

Suggested reviewers

  • LeiWang1999

Poem

🐰 A kernel blooms with stages, clear and true,
Per-CLC indices weave through and through,
Producers load, MMAs hum with might,
Schedulers pick tiles, consumers write—
Pipelined hops bring TFLOPS to light!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a CLC-pipelined 2-CTA GEMM example kernel for sm100 architecture.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py (1)

364-388: 💤 Low value

Benchmark harness LGTM; consider sweeping clc_stages.

Lambdas correctly capture a/b by default arg, and base_args/group_size are stable across the loop, so the benchmark closures are sound. One nit: clc_stages is hardcoded to 3 in two places, so users can't easily reproduce the clc=2 / clc>3 numbers from the PR description without editing the script. A small sweep (or a for clc in (2, 3, 4): inside the size loop) would make this more useful as a tuning playground.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py` around lines 364 - 388, The
benchmark currently hardcodes clc_stages=3 when calling
gemm_clc_persistent_2cta_pipelined_clc and in its benchmark lambda; change this
to sweep a small set (e.g., for clc in (2,3,4)) inside the M,N,K loop and run
the call and its do_bench for each clc value so users can reproduce clc=2 /
clc>3 results; update the two places referencing the literal 3 (the call to
gemm_clc_persistent_2cta_pipelined_clc and the lambda passed to do_bench) to use
the loop variable clc and include clc in the printed TFLOPS line to
differentiate results.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py`:
- Around line 364-388: The benchmark currently hardcodes clc_stages=3 when
calling gemm_clc_persistent_2cta_pipelined_clc and in its benchmark lambda;
change this to sweep a small set (e.g., for clc in (2,3,4)) inside the M,N,K
loop and run the call and its do_bench for each clc value so users can reproduce
clc=2 / clc>3 results; update the two places referencing the literal 3 (the call
to gemm_clc_persistent_2cta_pipelined_clc and the lambda passed to do_bench) to
use the loop variable clc and include clc in the printed TFLOPS line to
differentiate results.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 36ee07d6-7fd1-425a-8c1e-269d1635aa9a

📥 Commits

Reviewing files that changed from the base of the PR and between a797e51 and f4bbefc.

📒 Files selected for processing (1)
  • examples/gemm_sm100/gemm_tcgen5mma_ws_clc.py

@Rachmanino
Copy link
Copy Markdown
Collaborator

Rachmanino commented May 8, 2026

Hi @ighoshsubho, thanks for your contribution!

I'm the author of the original CLC kernel. I remember the original CLC kernel achieves ~1700TFLOPs @(8192, 8192, 8192) on B200. Besides, the torch result reported in this script is actually inaccurate. In fact, GEMM on B200 significantly suffers from power issues, thus the 2nd kernel to run in the example will have a severe performance degration. You can validate this by running the torch kernel only (I remember it was about 1720T, not sure).

Back to your kernel, could you please shed more light on the difference compared to the original one? Thanks!

@ighoshsubho
Copy link
Copy Markdown
Author

ighoshsubho commented May 8, 2026

Back to your kernel, could you please shed more light on the difference compared to the original one? Thanks!

yeah sure, the point is to issue the next tile's clusterlaunchcontrol.try_cancel while the current tile is still computing, so the CLC handshake latency is hidden. The single-stage baseline only overlaps the CLC handshake with the tile's compute; this version overlaps it with clc_stages tiles in flight.

  • So scheduler runs clc_stages - 1 ahead: at iter k it writes slot k % clc_stages; consumers (TMA / MMA / epilogue) at iter k read the slot scheduler wrote at iter k - 1. So slot s is written at s, s + clc_stages, ... and read at s + 1, s + 1 + clc_stages, ..., which are non-overlapping.
  • schedule_finished arrive count is 5 (not 7); the scheduler does NOT arrive on it. Per cycle, the 5 consumer arrives (producer 2 + MMA 1 + epilogue 2) flip the parity alone. This breaks the circular dependency that would otherwise deadlock for clc_stages > 1.

In fact, GEMM on B200 significantly suffers from power issues

I will try with cupti backend, also yeah b200 does suffer with performance on GEMMs due to power issues, I will again try some tests and share you some results. I was getting >1700 for clc=3 a day ago on same config

cc: @Rachmanino

@Rachmanino
Copy link
Copy Markdown
Collaborator

I roughly understand your point. Thanks again!

@ighoshsubho
Copy link
Copy Markdown
Author

shape torch base pipe(clc=2) pipe(clc=3) pipe(clc=4)
4096³ 1514 1419 1425 1388 1404
8192³ 1637 1657 1571 1563 1660
16384³ 1564 1543 1672 1636 1582

After couple on bench, it did clock near 1700. Let me know if you try it on your side and find something different.

cc: @Rachmanino

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants