Skip to content

Latest commit

 

History

History
37 lines (24 loc) · 2.14 KB

File metadata and controls

37 lines (24 loc) · 2.14 KB

kernel-forge Optimization Leaderboard

Operator Impl Bottleneck Best Speedup Status Details
topk_softmax HIP Sync 15.04× V1 done Warp-level DPP fallback replacing BlockReduce
moe_fused_gate HIP Memory+Sync 5.3× V6 done Fast-path dispatch to biased_grouped_topk
topk_per_row HIP Memory 2.07× V4 done Persistent multi-block radix-10
gated_rmsnorm_quant HIP Memory Baseline optimal Compiler already auto-vectorizes; 5 attempts all regressed

Directory Convention

Each operator directory contains 3 file types:

kernels/<op>/
├── baseline.cu          # Baseline source (copied from aiter)
├── v{N}_{tag}.cu        # Optimized version (e.g., v4_persistent_radix10.cu)
└── notes.md             # Optimization record (bottleneck analysis + version history + measured data)

Per-Operator Summaries

topk_softmax — V1: warp-level DPP fallback

Fallback path for non-power-of-2 expert counts. Replaces BlockReduce + __syncthreads with padding + warp-level multithread_reduce. Most significant at large batch sizes (16384 tokens / 48 experts: 260μs → 17μs).

moe_fused_gate — V6: fast-path dispatch

DeepSeek V3 configuration (256 experts, 8 groups, topk=8). Host-level dispatch directly to the existing biased_grouped_topk kernel, switching from O(topk × VPT) iterative argmax to O(log²N) bitonic sort.

topk_per_row — V4: persistent multi-block radix-10

Row-wise TopK for 60K elements / K=2048. Rewritten from single-block 11-bit radix to multi-block persistent 10-bit radix. Small batch leverages CU parallelism; large batch falls back to single block to avoid cross-block sync overhead.

gated_rmsnorm_quant — Baseline already optimal

Fused operator (RMSNorm + SiLU + FP8 quantization). Compiler already auto-vectorizes to global_load_dwordx4, measured at 54% of HBM bandwidth ceiling (after accounting for unavoidable data-dependency re-reads). 5 manual optimization attempts (buffer descriptor, inline asm, unroll, thread layout, wave-level reduce) all caused regressions.