kernel-forge Optimization Leaderboard

Operator	Impl	Bottleneck	Best Speedup	Status	Details
topk_softmax	HIP	Sync	15.04×	V1 done	Warp-level DPP fallback replacing BlockReduce
moe_fused_gate	HIP	Memory+Sync	5.3×	V6 done	Fast-path dispatch to biased_grouped_topk
topk_per_row	HIP	Memory	2.07×	V4 done	Persistent multi-block radix-10
gated_rmsnorm_quant	HIP	Memory	—	Baseline optimal	Compiler already auto-vectorizes; 5 attempts all regressed

Directory Convention

Each operator directory contains 3 file types:

kernels/<op>/
├── baseline.cu          # Baseline source (copied from aiter)
├── v{N}_{tag}.cu        # Optimized version (e.g., v4_persistent_radix10.cu)
└── notes.md             # Optimization record (bottleneck analysis + version history + measured data)

Per-Operator Summaries

topk_softmax — V1: warp-level DPP fallback

Fallback path for non-power-of-2 expert counts. Replaces BlockReduce + __syncthreads with padding + warp-level multithread_reduce. Most significant at large batch sizes (16384 tokens / 48 experts: 260μs → 17μs).

moe_fused_gate — V6: fast-path dispatch

DeepSeek V3 configuration (256 experts, 8 groups, topk=8). Host-level dispatch directly to the existing biased_grouped_topk kernel, switching from O(topk × VPT) iterative argmax to O(log²N) bitonic sort.

topk_per_row — V4: persistent multi-block radix-10

Row-wise TopK for 60K elements / K=2048. Rewritten from single-block 11-bit radix to multi-block persistent 10-bit radix. Small batch leverages CU parallelism; large batch falls back to single block to avoid cross-block sync overhead.

gated_rmsnorm_quant — Baseline already optimal

Fused operator (RMSNorm + SiLU + FP8 quantization). Compiler already auto-vectorizes to global_load_dwordx4, measured at 54% of HBM bandwidth ceiling (after accounting for unavoidable data-dependency re-reads). 5 manual optimization attempts (buffer descriptor, inline asm, unroll, thread layout, wave-level reduce) all caused regressions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel-forge Optimization Leaderboard

Directory Convention

Per-Operator Summaries

topk_softmax — V1: warp-level DPP fallback

moe_fused_gate — V6: fast-path dispatch

topk_per_row — V4: persistent multi-block radix-10

gated_rmsnorm_quant — Baseline already optimal

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

kernel-forge Optimization Leaderboard

Directory Convention

Per-Operator Summaries

topk_softmax — V1: warp-level DPP fallback

moe_fused_gate — V6: fast-path dispatch

topk_per_row — V4: persistent multi-block radix-10

gated_rmsnorm_quant — Baseline already optimal