| Operator | Impl | Bottleneck | Best Speedup | Status | Details |
|---|---|---|---|---|---|
| topk_softmax | HIP | Sync | 15.04× | V1 done | Warp-level DPP fallback replacing BlockReduce |
| moe_fused_gate | HIP | Memory+Sync | 5.3× | V6 done | Fast-path dispatch to biased_grouped_topk |
| topk_per_row | HIP | Memory | 2.07× | V4 done | Persistent multi-block radix-10 |
| gated_rmsnorm_quant | HIP | Memory | — | Baseline optimal | Compiler already auto-vectorizes; 5 attempts all regressed |
Each operator directory contains 3 file types:
kernels/<op>/
├── baseline.cu # Baseline source (copied from aiter)
├── v{N}_{tag}.cu # Optimized version (e.g., v4_persistent_radix10.cu)
└── notes.md # Optimization record (bottleneck analysis + version history + measured data)
Fallback path for non-power-of-2 expert counts. Replaces BlockReduce + __syncthreads with padding + warp-level multithread_reduce. Most significant at large batch sizes (16384 tokens / 48 experts: 260μs → 17μs).
DeepSeek V3 configuration (256 experts, 8 groups, topk=8). Host-level dispatch directly to the existing biased_grouped_topk kernel, switching from O(topk × VPT) iterative argmax to O(log²N) bitonic sort.
Row-wise TopK for 60K elements / K=2048. Rewritten from single-block 11-bit radix to multi-block persistent 10-bit radix. Small batch leverages CU parallelism; large batch falls back to single block to avoid cross-block sync overhead.
Fused operator (RMSNorm + SiLU + FP8 quantization). Compiler already auto-vectorizes to global_load_dwordx4, measured at 54% of HBM bandwidth ceiling (after accounting for unavoidable data-dependency re-reads). 5 manual optimization attempts (buffer descriptor, inline asm, unroll, thread layout, wave-level reduce) all caused regressions.