Skip to content

TileLang DeepSeek-V4 porting#123

Draft
MirkoDeVita98 wants to merge 4 commits into
mainfrom
deepseek_v4
Draft

TileLang DeepSeek-V4 porting#123
MirkoDeVita98 wants to merge 4 commits into
mainfrom
deepseek_v4

Conversation

@MirkoDeVita98
Copy link
Copy Markdown
Collaborator

@MirkoDeVita98 MirkoDeVita98 commented Apr 26, 2026

Adds PTO DSL ports of the six custom kernels used by DeepSeek-V4, plus
benchmarks and documentation. Each kernel follows the standard examples-tree
layout (compile.sh + run_*.py) and is exercised by
examples/validate_all_examples.py.

What's new — examples/aot/deepseek_v4/

Kernel Pipe What it does
act_quant/ vector per-row absmax fp16 → int8 quant
fp4_act_quant/ vector per-row fp16 → mxfp4 (e2m1) quant
fp8_gemm/ cube + vec per-channel fp8 GEMM
fp4_gemm/ cube + vec per-channel fp4 GEMM
hc_split_sinkhorn/ vector fused MoE-router head: pre/post sigmoid + 20-iter Sinkhorn
sparse_attn/ vector FlashAttention with indexed top-k KV gather + per-head sink logit

Each folder ships a <name>_builder.py, <name>_util.py, caller.cpp,
compile.sh, run_<name>.py, README.md, and .gitignore.

Notable design choices

  • fp8_gemm / fp4_gemm keep the matmul pure (cube fp32 → fp16) and
    fold the per-channel Sa rescale into a host-side pre-scale of A,
    leaving only Sb on the vector pipe. Avoids two extra cube fragments
    per tile; matches reference within 5 × 10⁻³.
  • hc_split_sinkhorn runs all three heads (pre / post / 20-iter
    Sinkhorn over [n, 4, 4]) inside one vector_section. Up to 18×
    faster than eager PyTorch on small batches (n ≤ 1024).
  • sparse_attn is pure vector_section FlashAttention. Per-head
    softmax stats are stored as full [H, D] tiles replicated across D
    to dodge a col-major⇄row-major reshape alias the auto-sync analysis
    can miss. KV is gathered one position at a time via pto.load_scalar
    → dynamic pto.slice_viewpto.load. Beats a hand-written
    torch.gather + npu_fused_infer_attention_score baseline 1.2–1.6×
    across the small/medium shapes typical of this op.

How to build, test and benchmark

See examples/aot/deepseek_v4/OVERVIEW.md.
TL;DR:

# build + correctness-check every kernel
python examples/validate_all_examples.py

# benchmarks
python examples/aot/deepseek_v4/sparse_attn/bench_sparse_attn.py
python examples/aot/deepseek_v4/hc_split_sinkhorn/bench_hc_split_sinkhorn.py

Sample bench results

sparse_attn, vs torch.gather + npu_fused_infer_attention_score (MQA):

  B   M     N    K     pto us     ref us   fused us   pto/ref  pto/fused
------------------------------------------------------------------------
  1   1   128   64     161.15     533.05     265.03     3.31x      1.64x
  1   4   256  128     209.56    1692.93     252.36     8.08x      1.20x
  4   4  1024  128     207.77    6071.60     246.57    29.22x      1.19x
  8   8  2048  128     304.49   24658.49     244.67    80.98x      0.80x

hc_split_sinkhorn, vs eager PyTorch reference:

      n     pto us     ref us  speedup
----------------------------------------
     64     173.27    2803.42   16.18x
   1024     218.70    2761.33   12.63x
  16384    1786.32    2741.09    1.53x

@MirkoDeVita98 MirkoDeVita98 marked this pull request as draft April 26, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant