TileLang DeepSeek-V4 porting by MirkoDeVita98 · Pull Request #123 · huawei-csl/pto-dsl

MirkoDeVita98 · 2026-04-26T19:38:11Z

Adds PTO DSL ports of the six custom kernels used by DeepSeek-V4, plus
benchmarks and documentation. Each kernel follows the standard examples-tree
layout (compile.sh + run_*.py) and is exercised by
examples/validate_all_examples.py.

What's new — `examples/aot/deepseek_v4/`

Kernel	Pipe	What it does
`act_quant/`	vector	per-row absmax fp16 → int8 quant
`fp4_act_quant/`	vector	per-row fp16 → mxfp4 (e2m1) quant
`fp8_gemm/`	cube + vec	per-channel fp8 GEMM
`fp4_gemm/`	cube + vec	per-channel fp4 GEMM
`hc_split_sinkhorn/`	vector	fused MoE-router head: pre/post sigmoid + 20-iter Sinkhorn
`sparse_attn/`	vector	FlashAttention with indexed top-k KV gather + per-head sink logit

Each folder ships a <name>_builder.py, <name>_util.py, caller.cpp,
compile.sh, run_<name>.py, README.md, and .gitignore.

Notable design choices

fp8_gemm / fp4_gemm keep the matmul pure (cube fp32 → fp16) and
fold the per-channel Sa rescale into a host-side pre-scale of A,
leaving only Sb on the vector pipe. Avoids two extra cube fragments
per tile; matches reference within 5 × 10⁻³.
hc_split_sinkhorn runs all three heads (pre / post / 20-iter
Sinkhorn over [n, 4, 4]) inside one vector_section. Up to 18×
faster than eager PyTorch on small batches (n ≤ 1024).
sparse_attn is pure vector_section FlashAttention. Per-head
softmax stats are stored as full [H, D] tiles replicated across D
to dodge a col-major⇄row-major reshape alias the auto-sync analysis
can miss. KV is gathered one position at a time via pto.load_scalar
→ dynamic pto.slice_view → pto.load. Beats a hand-written
torch.gather + npu_fused_infer_attention_score baseline 1.2–1.6×
across the small/medium shapes typical of this op.

How to build, test and benchmark

See examples/aot/deepseek_v4/OVERVIEW.md.
TL;DR:

# build + correctness-check every kernel
python examples/validate_all_examples.py

# benchmarks
python examples/aot/deepseek_v4/sparse_attn/bench_sparse_attn.py
python examples/aot/deepseek_v4/hc_split_sinkhorn/bench_hc_split_sinkhorn.py

Sample bench results

sparse_attn, vs torch.gather + npu_fused_infer_attention_score (MQA):

  B   M     N    K     pto us     ref us   fused us   pto/ref  pto/fused
------------------------------------------------------------------------
  1   1   128   64     161.15     533.05     265.03     3.31x      1.64x
  1   4   256  128     209.56    1692.93     252.36     8.08x      1.20x
  4   4  1024  128     207.77    6071.60     246.57    29.22x      1.19x
  8   8  2048  128     304.49   24658.49     244.67    80.98x      0.80x

hc_split_sinkhorn, vs eager PyTorch reference:

      n     pto us     ref us  speedup
----------------------------------------
     64     173.27    2803.42   16.18x
   1024     218.70    2761.33   12.63x
  16384    1786.32    2741.09    1.53x

…s format

tilelang deepseek v4 kernels porting

311fb06

MirkoDeVita98 marked this pull request as draft April 26, 2026 19:42

mirkodevita added 3 commits April 26, 2026 19:51

tilelang deepseek v4 kernels removed pytest to adapt to other example…

56818d3

…s format

added padding and sentinel in spare attn eexample

1ef4d36

added more interesting shapes in benchmark of sparse attention

ce858e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TileLang DeepSeek-V4 porting#123

TileLang DeepSeek-V4 porting#123
MirkoDeVita98 wants to merge 4 commits into
mainfrom
deepseek_v4

MirkoDeVita98 commented Apr 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MirkoDeVita98 commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's new — examples/aot/deepseek_v4/

Notable design choices

How to build, test and benchmark

Sample bench results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MirkoDeVita98 commented Apr 26, 2026 •

edited

Loading

What's new — `examples/aot/deepseek_v4/`