Cycle-level profiling: IntraWG/InterWG overlap breakdown (TileOPs vs FA3)

## Summary

Cycle-level profiling of the WS GQA forward kernel's IntraWGOverlap + InterWG scheduling pipeline, measured on H200 (locked freq, GPU 7). Shape: `B=4 S=4096 H=64 Hkv=8 D=128 fp16, block_m=128 block_n=128, non-causal`.

Goal: quantify where time goes inside the steady-state K-loop iteration, and compare with FA3.

## Measurement methods

### 1. clock64 instrumentation (CUDA Core side)

Insert `clock64()` probes at key points in WG1's steady-state loop via TileLang `T.call_extern("int64", "clk::read_clock")`, accumulate via `atomicAdd` across all CTAs.

Probed regions in the original pipeline:
```
t0 → issue QK[N]  (8× wgmma_ss)
     rescale O
     barrier_wait(v_full)
     issue PV[N-1] (8× wgmma_rs)
t1 → wait_wgmma<1>  (QK done)
t2 → softmax (reduce_max, exp2f, reduce_sum, rescale ls)
t3 → wait_wgmma<0>  (PV done)
t4
```

### 2. NCU metrics (Tensor Core utilization)

```bash
ncu --kernel-name regex:"main_kernel" \
  --metrics sm__pipe_tensor_op_hmma_cycles_active.avg.pct_of_peak_sustained_elapsed,\
            sm__pipe_tensor_op_hmma_cycles_active.avg,\
            sm__cycles_elapsed.avg \
  --launch-skip 5 --launch-count 1 python3 bench.py
```

### 3. In-pipeline PV measurement

Move `wait_wgmma<0>` before softmax (kills IntraWGOverlap for measurement only) to directly time PV execution on TC:
```
issue QK → issue PV → wait<1> → wait<0> → softmax
                                 ↑ TC PV = this wait
```

## TileOPs results (threadbind kernel, `_test_ws_fa3_v2_threadbind.py`)

### Full pipeline (clock64)

| Region | Cycles | % |
|---|---|---|
| CC issue (QK+rescale+waitV+PV) | 657 | 31.7% |
| wait\<1\> (QK done on TC) | 59 | 2.9% |
| **softmax** | **1347** | **65.1%** |
| wait\<0\> (PV residual) | 7 | 0.3% |
| **TOTAL** | **2070** | 100% |

wait\<0\> ≈ 0 → **PV finishes before softmax ends → softmax is the pacing bottleneck.**

### TC GEMM times

| Metric | Value | Source |
|---|---|---|
| TC utilization | **65%** | NCU `hmma_cycles_active.pct_of_peak_sustained_elapsed` |
| TC PV (wgmma_rs) | **450 cyc** | In-pipeline wait\<0\> measurement |
| TC QK (wgmma_ss) | **≈222 cyc** | Derived: (2070 × 0.65)/2 − 450 |
| TC per WG (QK+PV) | **≈672 cyc** | |
| TC idle per iter | **≈726 cyc** | During softmax, no WGMMA issued |

### NCU stall breakdown

| Stall | Ratio |
|---|---|
| long_scoreboard | 3.09 |
| wait | 1.57 |
| short_scoreboard | 0.32 |
| math_pipe_throttle | 0.02 |

### Timeline diagram

![IntraWG timeline](_intra_wg_timeline.png)

Three rows: WG1 CC, WG2 CC, shared Tensor Core.
- Red (softmax) dominates CC time at 65%
- TC has 726-cycle idle gap each iteration while both WGs do softmax
- Scheduler barrier (yellow arrows) serializes WG1/WG2 WGMMA issue

## Key findings

1. **Softmax is the bottleneck** (65% of loop, 1347 cycles). TC is 65% utilized with 35% idle time.
2. **IntraWGOverlap works correctly**: PV GEMM executes on TC in parallel with softmax on CC. wait\<0\> residual ≈ 0.
3. **InterWG overlap works**: scheduler barrier alternates WGMMA issue between WG1/WG2. One WG's softmax overlaps with the other's WGMMA.
4. **Softmax bottleneck comes from `AllReduce`** (cross-warp reduce_max + reduce_sum with named barrier sync).
5. **long_scoreboard = 3.09** (vs FA3's 0.19) — this is the K/V TMA bandwidth gap, separate from the softmax bottleneck. FA3 uses cluster multicast TMA to halve this.

## TODO: FA3 comparison

Run the same NCU + clock analysis on FA3's kernel for the same shape to get:
- FA3 TC utilization
- FA3 softmax time
- FA3 stall breakdown
- Side-by-side comparison to identify where the remaining 6-16% gap comes from

## Files

- `_bench_clock_intra_wg.py` — full pipeline clock64 instrumentation
- `_bench_clock_tc_gemm.py` — no-softmax TC GEMM measurement (dual WG)
- `_bench_clock_tc_single_wg.py` — single WG TC GEMM (no contention)
- `_bench_clock_tc_in_pipeline.py` — in-pipeline PV measurement (wait\<0\> before softmax)
- `_plot_intra_wg_timeline.py` — timeline visualization
- `_intra_wg_timeline.png` — output diagram

All on branch `fix/ws-fa3-v2-epilogue-fence`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cycle-level profiling: IntraWG/InterWG overlap breakdown (TileOPs vs FA3) #12

Summary

Measurement methods

1. clock64 instrumentation (CUDA Core side)

2. NCU metrics (Tensor Core utilization)

3. In-pipeline PV measurement

TileOPs results (threadbind kernel, `_test_ws_fa3_v2_threadbind.py`)

Full pipeline (clock64)

TC GEMM times

NCU stall breakdown

Timeline diagram

Key findings

TODO: FA3 comparison

Files

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Region	Cycles	%
CC issue (QK+rescale+waitV+PV)	657	31.7%
wait<1> (QK done on TC)	59	2.9%
softmax	1347	65.1%
wait<0> (PV residual)	7	0.3%
TOTAL	2070	100%

Metric	Value	Source
TC utilization	65%	NCU `hmma_cycles_active.pct_of_peak_sustained_elapsed`
TC PV (wgmma_rs)	450 cyc	In-pipeline wait<0> measurement
TC QK (wgmma_ss)	≈222 cyc	Derived: (2070 × 0.65)/2 − 450
TC per WG (QK+PV)	≈672 cyc
TC idle per iter	≈726 cyc	During softmax, no WGMMA issued

Stall	Ratio
long_scoreboard	3.09
wait	1.57
short_scoreboard	0.32
math_pipe_throttle	0.02

Cycle-level profiling: IntraWG/InterWG overlap breakdown (TileOPs vs FA3) #12

Description

Summary

Measurement methods

1. clock64 instrumentation (CUDA Core side)

2. NCU metrics (Tensor Core utilization)

3. In-pipeline PV measurement

TileOPs results (threadbind kernel, _test_ws_fa3_v2_threadbind.py)

Full pipeline (clock64)

TC GEMM times

NCU stall breakdown

Timeline diagram

Key findings

TODO: FA3 comparison

Files

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

TileOPs results (threadbind kernel, `_test_ws_fa3_v2_threadbind.py`)