Skip to content

Optimize DeepSeek V4 qkv_proj_rope decode (S=2): partial-sum reduces, amax fold, K-tile/stage tuning#339

Merged
zhangqi-chen merged 2 commits into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-qkv-proj-rope-grouped-chunking
May 21, 2026
Merged

Optimize DeepSeek V4 qkv_proj_rope decode (S=2): partial-sum reduces, amax fold, K-tile/stage tuning#339
zhangqi-chen merged 2 commits into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-qkv-proj-rope-grouped-chunking

Conversation

@wangqin1723-max
Copy link
Copy Markdown
Contributor

@wangqin1723-max wangqin1723-max commented May 20, 2026

Summary

Continues the grouped-chunking series (Opt J-P) with five new optimizations targeting residual serial AIV reduces and cube K-loop tuning. S=2 (T=128) wall-clock 3-run median 624us → 545us (−12.6%); cumulative vs pre-tuning baseline 1868us → 545us (−70.8%).

New optimizations (on top of the prior Opt A+B+E+G+J+K+L+M-revert+N+O+P endpoint)

  • Opt Sattn_norm_rms serial reduce → 2-way parallel partial-sum + final reduce. chunked_loop_optimizer is required on the partial scope to fit the 192KB Vec UB at S=2. PARTIALS=2 (not 4+) is intentional: it keeps the FP32 add deterministic, preserving q validation across devices.
  • Opt T — fold qr_quant_amax into qr_norm_apply. Each qr_norm_apply task additionally writes a per-task partial amax to qr_amax_partial[Q_BLOCKS, T]; the residual qr_quant_amax scope shrinks from 30us → 1.7us. Partial amax is computed on qr_normed_bf16 — bit-identical to the value the original scope would have re-read via GM — so the INT8 quant scale is unchanged (qr validation atol=1 holds).
  • Opt Uqr_rms 2-way parallel partial-sum (mirrors Opt S). qr_fp32 is already FP32 so the inner loop is cast-free; the chunked_loop_optimizer cost from Opt S doesn't apply.
  • Opt VQ_PROJ_CHUNK 256 → 512 (K-tile). qproj_matmul per-task Exec 74us → 56us (−25%). N-tile (Q_PROJ_OUT_CHUNK) is intentionally left at 128 — the prior Opt B run on this kernel confirmed Q_PROJ_OUT_CHUNK=256 triggers ACL_ERROR_RT_AICORE_TIMEOUT (CANN template limit).
  • Opt Xqr_proj_matmul / kv_proj_matmul K-loop pl.pipeline(stage=2 → stage=4) per docs/performance-tuning.md Part 2 §1. Both have D_BLOCKS=32, enough iter count for 4-deep ping-pong. qproj_matmul skipped (post-Opt V it has Q_PROJ_BLOCKS=2, too few iters).

Incidental cleanup

  • Renamed Python locals to break pypto AST's implicit init_values chain across sibling scopes: d0rms_d0 / apply_d0 / qr_d0, qr_chunkqr_norm_chunk. Without this, removing the original serial attn_norm_rms scope broke downstream scopes with Variable 'd0_inlineNN' used outside its defining scope SSA errors.
  • Trimmed comment block redundancy and historical tuning narrative from earlier commits.

Test plan

  • q / kv / qr / qr_scale validation PASS at documented tolerances
  • 3-run wall-clock median on python models/deepseek/v4/qkv_proj_rope.py -p a2a3 --enable-l2-swimlane (S=2, T=128, FLASH config)
  • Re-validated across 3 different a2a3 devices (5, 6, 8)
  • CI: pre-commit + unit-tests + sim/a2a3 PR-changed-file runs

Restructure scopes to amortise per-task launch overhead and lift Vec/Cube
utilisation on decode:

- Split fused attn_norm into serial RMS reduce + parallel ATTN_NORM_GROUP
  apply; drop the FP32 norm intermediate, write token_x_bf16 directly.
- Split qr_rms apply from RMS reduce and chunk by QR_NORM_GROUP.
- Decouple qr_quant into amax-reduce + parallel apply over Q_LORA chunks.
- Chunk qproj_matmul by Q_PROJ_GROUP and decouple qproj_dequant via a global
  INT32 col_acc_all staging buffer, letting dequant run at a larger
  Q_PROJ_DEQUANT_GROUP without slowing matmul.
- Split per-head fused RMS+NOPE+RoPE into q_head_rms_nope + q_head_rope so
  the RoPE scope stays within the 192KB Vec UB budget at T=128 (S=2).
- Pull q_rope_reassemble/q_rope_write out of the per-head loop and chunk
  by HEAD_GROUP via a [H*T, ROPE_DIM] pair staging buffer.
- Chunk kv_proj_matmul by KV_PROJ_GROUP.

Tuning constants: ROPE_CHUNK 32->64, Q_PROJ_CHUNK 128->256,
QUANT_APPLY_CHUNK 256. Adds divisibility asserts for each new group.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR refactors the DeepSeek-V4 fused single-token QKV attention pipeline by decoupling compute stages, updating tiling/grouping parameters, and reorganizing RMSNorm and RoPE into separate grouped passes with intermediate staging buffers. Quantization and Q-projection matmul/dequant are split into independent stages; per-head RoPE is reassembled via cross-head staging; and KV RoPE reassembly uses a full FP32 buffer.

Changes

DeepSeek-V4 QKV Projection and RoPE Pipeline Refactoring

Layer / File(s) Summary
Tiling and Grouping Constants
models/deepseek/v4/qkv_proj_rope.py
Chunk/group sizes and derived block counts updated; new grouping parameters (Q_PROJ_GROUP, Q_PROJ_DEQUANT_GROUP, QUANT_APPLY_CHUNK) and adjusted QUANT_* chunking/divisibility asserts.
Attention RMS denom reduction and grouped normalization
models/deepseek/v4/qkv_proj_rope.py
Fused attn_norm replaced with FP32 partial RMS denominator reductions followed by grouped parallel normalization and FP32→BF16 cast into token_x_bf16.
QR compute and RMS reduction
models/deepseek/v4/qkv_proj_rope.py
Computes qr_fp32 via grouped Q-block matmuls with pipelined accumulation, then reduces QR RMS using partial-sum pattern to produce qr_inv_rms.
QR normalization, amax/scale, and INT8 apply
models/deepseek/v4/qkv_proj_rope.py
Fuses normalization+BF16 cast while computing per-task amax partials, reduces to INT8 scales, and applies quantization in a parallel chunked stage to write qr (INT8).
Q-projection INT32 matmul and decoupled dequant
models/deepseek/v4/qkv_proj_rope.py
Performs grouped W8A8C16 INT32 accumulations into col_acc_all, then a separate dequant stage (INT32→FP32 with per-channel scales) to emit q_proj_fp32.
Per-head RMSNorm, cross-head RoPE staging, and reassemble
models/deepseek/v4/qkv_proj_rope.py
Splits RMSNorm/NOPE BF16 construction and inverse-RMS recording, computes RoPE into q_rope_pair_stage in HEAD_GROUP chunks, then reassembles via cube-style matmuls and writes final BF16 q.
KV RoPE full-FP32 reassembly and BF16 cast
models/deepseek/v4/qkv_proj_rope.py
Reassembles KV RoPE into a full FP32 kv_rope_full buffer and casts from that buffer into the BF16 kv output slice.
Test harness arg reorder
models/deepseek/v4/qkv_proj_rope.py
Reorders run_jit tolerance/compare/config arguments and moves a Q-projection quant round-off comment; comparison thresholds unchanged.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#332: Directly overlaps by refactoring the same qkv_proj_rope pipeline stages, intermediate buffers, and dequant/reassembly flow.
  • hw-native-sys/pypto-lib#234: Related to INT8 qr + qr_scale quant/dequant path adjustments and golden computation for Q-projection.
  • hw-native-sys/pypto-lib#298: Modifies codegen/pl.at annotations in the same kernel area and may interact with fused-stage scheduling metadata.

Suggested labels

enhancement

Poem

🐰 Pipelines hop in tidy rows,
Chunks and stages, where logic flows,
RoPE and norms split by design,
Buffers cross-head, bits align—
A rabbit cheers: new paths compose!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately summarizes the main changes: optimization of DeepSeek V4 qkv_proj_rope decode through partial-sum reduces, amax folding, and K-tile/stage tuning, which aligns with the file changes and objectives.
Description check ✅ Passed The pull request description clearly relates to the changeset, detailing specific optimizations (Opt S through X) and structural changes to the qkv_proj_rope decode pipeline with measurable performance improvements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/deepseek/v4/qkv_proj_rope.py`:
- Around line 49-52: Add direct guards that H is divisible by grouping factors:
insert assert H % HEAD_GROUP == 0 with a clear message near the existing
HEAD_BLOCKS check so the grouped q-RoPE loops (pl.parallel(0, H, HEAD_GROUP) /
pl.range(HEAD_GROUP)) cannot index past H; likewise add assert H % Q_PROJ_GROUP
== 0 next to the Q_PROJ_GROUP/Q_PROJ_OUT_CHUNK assertions (and repeat the same
explicit H % GROUP checks in the other similar block around lines 277-318) so
grouped loop stride invariants are enforced explicitly.
- Line 48: Add explicit divisibility guards for the new Q_LORA tiling
assumptions: assert Q_LORA % Q_PROJ_CHUNK == 0 (so Q_PROJ_BLOCKS = Q_LORA //
Q_PROJ_CHUNK won't drop a tail), assert Q_LORA % QUANT_CHUNK == 0 (so inner
quant tiles don't run past range), and assert (Q_LORA // QUANT_CHUNK) %
QUANT_APPLY_CHUNK == 0 (so each outer group contains a full QUANT_APPLY_CHUNK of
QUANT_CHUNK tiles). Place these grouped asserts near the top of the module
(around the existing QUANT_APPLY_CHUNK definition) and/or at the start of
qproj_matmul and qr_quant_apply to clearly document and enforce the contracts
referenced in lines ~48, 174-181 and 202-217.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9d300a81-7e87-4c00-b76e-18e22fc6e355

📥 Commits

Reviewing files that changed from the base of the PR and between 058894f and e7af204.

📒 Files selected for processing (1)
  • models/deepseek/v4/qkv_proj_rope.py

Comment thread models/deepseek/v4/qkv_proj_rope.py
Comment on lines 49 to +52
assert (H * HEAD_DIM) % (HEAD_CHUNK * HEAD_GROUP) == 0, \
"HEAD_BLOCKS must be divisible by HEAD_GROUP"
assert ((H * HEAD_DIM) // Q_PROJ_OUT_CHUNK) % Q_PROJ_GROUP == 0, \
"Q_PROJ_HEAD_BLOCKS must be divisible by Q_PROJ_GROUP"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Guard HEAD_GROUP against H directly.

The grouped q-RoPE loops step in head units (pl.parallel(0, H, HEAD_GROUP) + pl.range(HEAD_GROUP)), so the invariant they need is H % HEAD_GROUP == 0. The current HEAD_BLOCKS assert is only an indirect proxy and can pass for configs where the last h_inner still indexes past H.

Suggested guard addition
+assert H % HEAD_GROUP == 0, \
+    "H must be divisible by HEAD_GROUP"
 assert (H * HEAD_DIM) % (HEAD_CHUNK * HEAD_GROUP) == 0, \
     "HEAD_BLOCKS must be divisible by HEAD_GROUP"

Also applies to: 277-318

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/qkv_proj_rope.py` around lines 49 - 52, Add direct guards
that H is divisible by grouping factors: insert assert H % HEAD_GROUP == 0 with
a clear message near the existing HEAD_BLOCKS check so the grouped q-RoPE loops
(pl.parallel(0, H, HEAD_GROUP) / pl.range(HEAD_GROUP)) cannot index past H;
likewise add assert H % Q_PROJ_GROUP == 0 next to the
Q_PROJ_GROUP/Q_PROJ_OUT_CHUNK assertions (and repeat the same explicit H % GROUP
checks in the other similar block around lines 277-318) so grouped loop stride
invariants are enforced explicitly.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance optimizations to the qkv_proj_rope module by increasing chunk sizes and implementing task-grouping for normalization, projection, and quantization stages. Key changes include the decoupling of query projection dequantization into a separate parallel pass and the use of staging buffers to manage memory constraints and reduce launch overhead. Review feedback suggests further refining the placement of tensor initializations within loops to prevent potential compiler-induced state dependencies and simplifying redundant loops where chunk sizes now match full dimensions.


# Pre-declared outside pl.range to satisfy pypto's loop-carried init_values
# threading. The dummy value is overwritten by pl.matmul on the first iter.
col_acc = pl.create_tensor([T, Q_PROJ_OUT_CHUNK], dtype=pl.INT32)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The col_acc tensor is created outside the h_inner loop but inside the pl.parallel scope. While the logic correctly resets col_acc on the first iteration of the qb loop (when qr_proj_col0 == 0), reusing the same tensor object across serial iterations of h_inner can sometimes lead to unexpected loop-carried state dependencies in the PyPTO compiler if not explicitly handled. Since this is intended as a per-head-block scratch buffer, consider moving the creation inside the h_inner loop or ensuring the compiler correctly hoists it as a task-local scratch.

wkv_chunk = wkv[d0 : d0 + D_CHUNK, kv_col0 : kv_col0 + KV_CHUNK]
kv_acc = pl.matmul_acc(kv_acc, kv_x_chunk_bf16, wkv_chunk)
kv_fp32[:, kv_col0 : kv_col0 + KV_CHUNK] = kv_acc
kv_acc = pl.create_tensor([T, KV_CHUNK], dtype=pl.FP32)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the qproj_matmul scope, kv_acc is created outside the k_inner loop. Although it is correctly initialized via pl.matmul when d0 == 0, reusing the tensor across serial iterations of k_inner might trigger loop-carried state analysis. Moving the creation inside the k_inner loop would make the task-local nature of the buffer more explicit to the optimizer.

kv_rope_full = pl.create_tensor([T, ROPE_DIM], dtype=pl.FP32)
with pl.at(level=pl.Level.CORE_GROUP, name_hint="kv_rope_reassemble"):
for rope_col in pl.pipeline(0, ROPE_DIM, ROPE_CHUNK, stage=2):
for rope_col in pl.range(0, ROPE_DIM, ROPE_CHUNK):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The kv_rope_reassemble scope still uses a pl.range loop over ROPE_CHUNK, even though ROPE_CHUNK is now equal to ROPE_DIM (64). This makes the loop run only once. For consistency with the optimized q_rope_reassemble (lines 299-313), which processes the full ROPE_DIM at once, you could remove this loop and the associated slicing to simplify the code.

@wangqin1723-max wangqin1723-max changed the title Optimize DeepSeek V4 qkv_proj_rope decode via grouped chunking Optimize DeepSeek V4 qkv_proj_rope decode (S=2): partial-sum reduces, amax fold, K-tile/stage tuning May 21, 2026
@wangqin1723-max wangqin1723-max force-pushed the perf/dsv4-qkv-proj-rope-grouped-chunking branch from bf43f45 to 6cfd357 Compare May 21, 2026 08:23
…es, amax fold, K-tile/stage tuning

S=2 (T=128) wall-clock 3-run median: 624us -> 545us (-12.6%).
Cumulative since pre-tuning baseline: 1868us -> 545us (-70.8%).

Five accepted optimizations on top of the existing grouped-chunking endpoint:

- Opt S: attn_norm_rms serial -> 2-way parallel partial sum + final reduce
  (chunked_loop_optimizer required on the partial scope to fit Vec UB;
   PARTIALS=2 keeps FP32 add deterministic across devices)
- Opt T: fold qr_quant_amax into qr_norm_apply via a per-task partial-amax
  buffer; the residual qr_quant_amax scope shrinks 30us -> 1.7us. Partial
  amax is computed on qr_normed_bf16 (bit-identical to the original GM
  round-trip) so the INT8 quant scale is unchanged
- Opt U: qr_rms 2-way parallel partial sum (mirrors Opt S; qr_fp32 is
  already FP32 so no chunked_loop_optimizer cast-split cost)
- Opt V: Q_PROJ_CHUNK 256 -> 512 (K-tile only; N-tile already known to
  trigger ACL_ERROR_RT_AICORE_TIMEOUT per prior Opt B). qproj_matmul
  per-task Exec 74us -> 56us (-25%)
- Opt X: qr_proj_matmul / kv_proj_matmul K-loop pl.pipeline stage 2 -> 4
  (D_BLOCKS=32 has enough iters for 4-deep ping-pong)

Side effects from renaming a few Python local names (d0 -> rms_d0 /
apply_d0 / qr_d0, qr_chunk -> qr_norm_chunk) to break the implicit pypto
AST init_values chain that links same-named locals across sibling scopes.

Validation PASS on q / kv / qr / qr_scale across 3 runs on 3 different
devices.
@zhangqi-chen zhangqi-chen merged commit 825a785 into hw-native-sys:main May 21, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants