Summary
Cycle-level profiling of the WS GQA forward kernel's IntraWGOverlap + InterWG scheduling pipeline, measured on H200 (locked freq, GPU 7). Shape: B=4 S=4096 H=64 Hkv=8 D=128 fp16, block_m=128 block_n=128, non-causal.
Goal: quantify where time goes inside the steady-state K-loop iteration, and compare with FA3.
Measurement methods
1. clock64 instrumentation (CUDA Core side)
Insert clock64() probes at key points in WG1's steady-state loop via TileLang T.call_extern("int64", "clk::read_clock"), accumulate via atomicAdd across all CTAs.
Probed regions in the original pipeline:
t0 → issue QK[N] (8× wgmma_ss)
rescale O
barrier_wait(v_full)
issue PV[N-1] (8× wgmma_rs)
t1 → wait_wgmma<1> (QK done)
t2 → softmax (reduce_max, exp2f, reduce_sum, rescale ls)
t3 → wait_wgmma<0> (PV done)
t4
2. NCU metrics (Tensor Core utilization)
ncu --kernel-name regex:"main_kernel" \
--metrics sm__pipe_tensor_op_hmma_cycles_active.avg.pct_of_peak_sustained_elapsed,\
sm__pipe_tensor_op_hmma_cycles_active.avg,\
sm__cycles_elapsed.avg \
--launch-skip 5 --launch-count 1 python3 bench.py
3. In-pipeline PV measurement
Move wait_wgmma<0> before softmax (kills IntraWGOverlap for measurement only) to directly time PV execution on TC:
issue QK → issue PV → wait<1> → wait<0> → softmax
↑ TC PV = this wait
TileOPs results (threadbind kernel, _test_ws_fa3_v2_threadbind.py)
Full pipeline (clock64)
| Region |
Cycles |
% |
| CC issue (QK+rescale+waitV+PV) |
657 |
31.7% |
| wait<1> (QK done on TC) |
59 |
2.9% |
| softmax |
1347 |
65.1% |
| wait<0> (PV residual) |
7 |
0.3% |
| TOTAL |
2070 |
100% |
wait<0> ≈ 0 → PV finishes before softmax ends → softmax is the pacing bottleneck.
TC GEMM times
| Metric |
Value |
Source |
| TC utilization |
65% |
NCU hmma_cycles_active.pct_of_peak_sustained_elapsed |
| TC PV (wgmma_rs) |
450 cyc |
In-pipeline wait<0> measurement |
| TC QK (wgmma_ss) |
≈222 cyc |
Derived: (2070 × 0.65)/2 − 450 |
| TC per WG (QK+PV) |
≈672 cyc |
|
| TC idle per iter |
≈726 cyc |
During softmax, no WGMMA issued |
NCU stall breakdown
| Stall |
Ratio |
| long_scoreboard |
3.09 |
| wait |
1.57 |
| short_scoreboard |
0.32 |
| math_pipe_throttle |
0.02 |
Timeline diagram

Three rows: WG1 CC, WG2 CC, shared Tensor Core.
- Red (softmax) dominates CC time at 65%
- TC has 726-cycle idle gap each iteration while both WGs do softmax
- Scheduler barrier (yellow arrows) serializes WG1/WG2 WGMMA issue
Key findings
- Softmax is the bottleneck (65% of loop, 1347 cycles). TC is 65% utilized with 35% idle time.
- IntraWGOverlap works correctly: PV GEMM executes on TC in parallel with softmax on CC. wait<0> residual ≈ 0.
- InterWG overlap works: scheduler barrier alternates WGMMA issue between WG1/WG2. One WG's softmax overlaps with the other's WGMMA.
- Softmax bottleneck comes from
AllReduce (cross-warp reduce_max + reduce_sum with named barrier sync).
- long_scoreboard = 3.09 (vs FA3's 0.19) — this is the K/V TMA bandwidth gap, separate from the softmax bottleneck. FA3 uses cluster multicast TMA to halve this.
TODO: FA3 comparison
Run the same NCU + clock analysis on FA3's kernel for the same shape to get:
- FA3 TC utilization
- FA3 softmax time
- FA3 stall breakdown
- Side-by-side comparison to identify where the remaining 6-16% gap comes from
Files
_bench_clock_intra_wg.py — full pipeline clock64 instrumentation
_bench_clock_tc_gemm.py — no-softmax TC GEMM measurement (dual WG)
_bench_clock_tc_single_wg.py — single WG TC GEMM (no contention)
_bench_clock_tc_in_pipeline.py — in-pipeline PV measurement (wait<0> before softmax)
_plot_intra_wg_timeline.py — timeline visualization
_intra_wg_timeline.png — output diagram
All on branch fix/ws-fa3-v2-epilogue-fence.
Summary
Cycle-level profiling of the WS GQA forward kernel's IntraWGOverlap + InterWG scheduling pipeline, measured on H200 (locked freq, GPU 7). Shape:
B=4 S=4096 H=64 Hkv=8 D=128 fp16, block_m=128 block_n=128, non-causal.Goal: quantify where time goes inside the steady-state K-loop iteration, and compare with FA3.
Measurement methods
1. clock64 instrumentation (CUDA Core side)
Insert
clock64()probes at key points in WG1's steady-state loop via TileLangT.call_extern("int64", "clk::read_clock"), accumulate viaatomicAddacross all CTAs.Probed regions in the original pipeline:
2. NCU metrics (Tensor Core utilization)
ncu --kernel-name regex:"main_kernel" \ --metrics sm__pipe_tensor_op_hmma_cycles_active.avg.pct_of_peak_sustained_elapsed,\ sm__pipe_tensor_op_hmma_cycles_active.avg,\ sm__cycles_elapsed.avg \ --launch-skip 5 --launch-count 1 python3 bench.py3. In-pipeline PV measurement
Move
wait_wgmma<0>before softmax (kills IntraWGOverlap for measurement only) to directly time PV execution on TC:TileOPs results (threadbind kernel,
_test_ws_fa3_v2_threadbind.py)Full pipeline (clock64)
wait<0> ≈ 0 → PV finishes before softmax ends → softmax is the pacing bottleneck.
TC GEMM times
hmma_cycles_active.pct_of_peak_sustained_elapsedNCU stall breakdown
Timeline diagram
Three rows: WG1 CC, WG2 CC, shared Tensor Core.
Key findings
AllReduce(cross-warp reduce_max + reduce_sum with named barrier sync).TODO: FA3 comparison
Run the same NCU + clock analysis on FA3's kernel for the same shape to get:
Files
_bench_clock_intra_wg.py— full pipeline clock64 instrumentation_bench_clock_tc_gemm.py— no-softmax TC GEMM measurement (dual WG)_bench_clock_tc_single_wg.py— single WG TC GEMM (no contention)_bench_clock_tc_in_pipeline.py— in-pipeline PV measurement (wait<0> before softmax)_plot_intra_wg_timeline.py— timeline visualization_intra_wg_timeline.png— output diagramAll on branch
fix/ws-fa3-v2-epilogue-fence.