3-WG WS GEMM 2-Tile: Full-size fragments, correct & fast

## Summary

Successfully implemented 3-WG warp-specialized GEMM where each consumer WG computes a **full output tile** (not half-rows). One CTA covers 2 adjacent M-tiles with grid halved accordingly.

## Design

```
Grid: ceildiv(N, block_N) × ceildiv(M, 2*block_M)
      ↑ each CTA covers 2 M-tiles

CTA (threads=384):
  WG0 (producer): TMA load A1, A2, B (shared)
  WG1 (consumer 1): C[row_base_1] += A_shared_1 @ B_shared  (full block_M × block_N)
  WG2 (consumer 2): C[row_base_2] += A_shared_2 @ B_shared  (full block_M × block_N)
```

Key insight: **don't split a tile's rows between WGs — give each WG its own complete tile**. This avoids the half_m fragment bug and preserves full WGMMA efficiency.

## Results

| Config | max_diff | Status |
|--------|----------|--------|
| M=1024 N=1024 K=1024 | 0.0614 | ✅ |
| M=2048 N=2048 K=1024 | 0.0618 | ✅ |
| M=512 N=512 K=512 | 0.0313 | ✅ |

## Why This Works (vs previous failures)

| Approach | Fragment size | Status | Issue |
|----------|--------------|--------|-------|
| `T.ws(1, 2)` merged | block_M | ❌ | TileLang WGMMA lowering bug (#5) |
| Separate WGs, half_m fragments | half_M | ❌ | Fragment/WGMMA conflict at half size |
| Separate WGs, full block_M, same tile | block_M | ✅ | Works but redundant compute |
| Separate WGs, full block_M, split-KV | block_M | ✅ | Works but slow (4x shared mem) |
| **Separate WGs, full block_M, 2 tiles per CTA** | **block_M** | **✅** | **Correct & efficient** |

Each consumer uses the same full block_M × block_N fragment size that works with single-WG. The only difference from the 2-WG baseline is that the grid is halved and the producer loads 2 A-tiles + 1 shared B-tile.

## Implications for GQA Attention

Same pattern applies:
- Grid: `ceildiv(seq_len, 2*block_m)` instead of `ceildiv(seq_len, block_m)`
- WG1: full attention for Q[2*bx*block_m : (2*bx+1)*block_m]
- WG2: full attention for Q[(2*bx+1)*block_m : (2*bx+2)*block_m]  
- Both share K/V tiles from producer
- No combine needed — each consumer writes its own output independently
- Full block_m fragments everywhere

## File
`_test_ws_3wg_gemm_2tile.py`

## Related
- #4 — GQA WS parent issue
- #5 — `T.ws(1,2)` merged consumer bug
- #6 — Split-KV approach (correct but slow)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3-WG WS GEMM 2-Tile: Full-size fragments, correct & fast #7

Summary

Design

Results

Why This Works (vs previous failures)

Implications for GQA Attention

File

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Config	max_diff	Status
M=1024 N=1024 K=1024	0.0614	✅
M=2048 N=2048 K=1024	0.0618	✅
M=512 N=512 K=512	0.0313	✅

Approach	Fragment size	Status	Issue
`T.ws(1, 2)` merged	block_M	❌	TileLang WGMMA lowering bug (#5)
Separate WGs, half_m fragments	half_M	❌	Fragment/WGMMA conflict at half size
Separate WGs, full block_M, same tile	block_M	✅	Works but redundant compute
Separate WGs, full block_M, split-KV	block_M	✅	Works but slow (4x shared mem)
Separate WGs, full block_M, 2 tiles per CTA	block_M	✅	Correct & efficient

3-WG WS GEMM 2-Tile: Full-size fragments, correct & fast #7

Description

Summary

Design

Results

Why This Works (vs previous failures)

Implications for GQA Attention

File

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions