Skip to content

Cycle-level profiling: IntraWG/InterWG overlap breakdown (TileOPs vs FA3) #12

@superAngGao

Description

@superAngGao

Summary

Cycle-level profiling of the WS GQA forward kernel's IntraWGOverlap + InterWG scheduling pipeline, measured on H200 (locked freq, GPU 7). Shape: B=4 S=4096 H=64 Hkv=8 D=128 fp16, block_m=128 block_n=128, non-causal.

Goal: quantify where time goes inside the steady-state K-loop iteration, and compare with FA3.

Measurement methods

1. clock64 instrumentation (CUDA Core side)

Insert clock64() probes at key points in WG1's steady-state loop via TileLang T.call_extern("int64", "clk::read_clock"), accumulate via atomicAdd across all CTAs.

Probed regions in the original pipeline:

t0 → issue QK[N]  (8× wgmma_ss)
     rescale O
     barrier_wait(v_full)
     issue PV[N-1] (8× wgmma_rs)
t1 → wait_wgmma<1>  (QK done)
t2 → softmax (reduce_max, exp2f, reduce_sum, rescale ls)
t3 → wait_wgmma<0>  (PV done)
t4

2. NCU metrics (Tensor Core utilization)

ncu --kernel-name regex:"main_kernel" \
  --metrics sm__pipe_tensor_op_hmma_cycles_active.avg.pct_of_peak_sustained_elapsed,\
            sm__pipe_tensor_op_hmma_cycles_active.avg,\
            sm__cycles_elapsed.avg \
  --launch-skip 5 --launch-count 1 python3 bench.py

3. In-pipeline PV measurement

Move wait_wgmma<0> before softmax (kills IntraWGOverlap for measurement only) to directly time PV execution on TC:

issue QK → issue PV → wait<1> → wait<0> → softmax
                                 ↑ TC PV = this wait

TileOPs results (threadbind kernel, _test_ws_fa3_v2_threadbind.py)

Full pipeline (clock64)

Region Cycles %
CC issue (QK+rescale+waitV+PV) 657 31.7%
wait<1> (QK done on TC) 59 2.9%
softmax 1347 65.1%
wait<0> (PV residual) 7 0.3%
TOTAL 2070 100%

wait<0> ≈ 0 → PV finishes before softmax ends → softmax is the pacing bottleneck.

TC GEMM times

Metric Value Source
TC utilization 65% NCU hmma_cycles_active.pct_of_peak_sustained_elapsed
TC PV (wgmma_rs) 450 cyc In-pipeline wait<0> measurement
TC QK (wgmma_ss) ≈222 cyc Derived: (2070 × 0.65)/2 − 450
TC per WG (QK+PV) ≈672 cyc
TC idle per iter ≈726 cyc During softmax, no WGMMA issued

NCU stall breakdown

Stall Ratio
long_scoreboard 3.09
wait 1.57
short_scoreboard 0.32
math_pipe_throttle 0.02

Timeline diagram

IntraWG timeline

Three rows: WG1 CC, WG2 CC, shared Tensor Core.

  • Red (softmax) dominates CC time at 65%
  • TC has 726-cycle idle gap each iteration while both WGs do softmax
  • Scheduler barrier (yellow arrows) serializes WG1/WG2 WGMMA issue

Key findings

  1. Softmax is the bottleneck (65% of loop, 1347 cycles). TC is 65% utilized with 35% idle time.
  2. IntraWGOverlap works correctly: PV GEMM executes on TC in parallel with softmax on CC. wait<0> residual ≈ 0.
  3. InterWG overlap works: scheduler barrier alternates WGMMA issue between WG1/WG2. One WG's softmax overlaps with the other's WGMMA.
  4. Softmax bottleneck comes from AllReduce (cross-warp reduce_max + reduce_sum with named barrier sync).
  5. long_scoreboard = 3.09 (vs FA3's 0.19) — this is the K/V TMA bandwidth gap, separate from the softmax bottleneck. FA3 uses cluster multicast TMA to halve this.

TODO: FA3 comparison

Run the same NCU + clock analysis on FA3's kernel for the same shape to get:

  • FA3 TC utilization
  • FA3 softmax time
  • FA3 stall breakdown
  • Side-by-side comparison to identify where the remaining 6-16% gap comes from

Files

  • _bench_clock_intra_wg.py — full pipeline clock64 instrumentation
  • _bench_clock_tc_gemm.py — no-softmax TC GEMM measurement (dual WG)
  • _bench_clock_tc_single_wg.py — single WG TC GEMM (no contention)
  • _bench_clock_tc_in_pipeline.py — in-pipeline PV measurement (wait<0> before softmax)
  • _plot_intra_wg_timeline.py — timeline visualization
  • _intra_wg_timeline.png — output diagram

All on branch fix/ws-fa3-v2-epilogue-fence.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions