Optimize DeepSeek V4 qkv_proj_rope decode (S=2): partial-sum reduces, amax fold, K-tile/stage tuning by wangqin1723-max · Pull Request #339 · hw-native-sys/pypto-lib

wangqin1723-max · 2026-05-20T09:41:41Z

Summary

Continues the grouped-chunking series (Opt J-P) with five new optimizations targeting residual serial AIV reduces and cube K-loop tuning. S=2 (T=128) wall-clock 3-run median 624us → 545us (−12.6%); cumulative vs pre-tuning baseline 1868us → 545us (−70.8%).

New optimizations (on top of the prior Opt A+B+E+G+J+K+L+M-revert+N+O+P endpoint)

Opt S — attn_norm_rms serial reduce → 2-way parallel partial-sum + final reduce. chunked_loop_optimizer is required on the partial scope to fit the 192KB Vec UB at S=2. PARTIALS=2 (not 4+) is intentional: it keeps the FP32 add deterministic, preserving q validation across devices.
Opt T — fold qr_quant_amax into qr_norm_apply. Each qr_norm_apply task additionally writes a per-task partial amax to qr_amax_partial[Q_BLOCKS, T]; the residual qr_quant_amax scope shrinks from 30us → 1.7us. Partial amax is computed on qr_normed_bf16 — bit-identical to the value the original scope would have re-read via GM — so the INT8 quant scale is unchanged (qr validation atol=1 holds).
Opt U — qr_rms 2-way parallel partial-sum (mirrors Opt S). qr_fp32 is already FP32 so the inner loop is cast-free; the chunked_loop_optimizer cost from Opt S doesn't apply.
Opt V — Q_PROJ_CHUNK 256 → 512 (K-tile). qproj_matmul per-task Exec 74us → 56us (−25%). N-tile (Q_PROJ_OUT_CHUNK) is intentionally left at 128 — the prior Opt B run on this kernel confirmed Q_PROJ_OUT_CHUNK=256 triggers ACL_ERROR_RT_AICORE_TIMEOUT (CANN template limit).
Opt X — qr_proj_matmul / kv_proj_matmul K-loop pl.pipeline(stage=2 → stage=4) per docs/performance-tuning.md Part 2 §1. Both have D_BLOCKS=32, enough iter count for 4-deep ping-pong. qproj_matmul skipped (post-Opt V it has Q_PROJ_BLOCKS=2, too few iters).

Incidental cleanup

Renamed Python locals to break pypto AST's implicit init_values chain across sibling scopes: d0 → rms_d0 / apply_d0 / qr_d0, qr_chunk → qr_norm_chunk. Without this, removing the original serial attn_norm_rms scope broke downstream scopes with Variable 'd0_inlineNN' used outside its defining scope SSA errors.
Trimmed comment block redundancy and historical tuning narrative from earlier commits.

Test plan

q / kv / qr / qr_scale validation PASS at documented tolerances
3-run wall-clock median on python models/deepseek/v4/qkv_proj_rope.py -p a2a3 --enable-l2-swimlane (S=2, T=128, FLASH config)
Re-validated across 3 different a2a3 devices (5, 6, 8)
CI: pre-commit + unit-tests + sim/a2a3 PR-changed-file runs

Restructure scopes to amortise per-task launch overhead and lift Vec/Cube utilisation on decode: - Split fused attn_norm into serial RMS reduce + parallel ATTN_NORM_GROUP apply; drop the FP32 norm intermediate, write token_x_bf16 directly. - Split qr_rms apply from RMS reduce and chunk by QR_NORM_GROUP. - Decouple qr_quant into amax-reduce + parallel apply over Q_LORA chunks. - Chunk qproj_matmul by Q_PROJ_GROUP and decouple qproj_dequant via a global INT32 col_acc_all staging buffer, letting dequant run at a larger Q_PROJ_DEQUANT_GROUP without slowing matmul. - Split per-head fused RMS+NOPE+RoPE into q_head_rms_nope + q_head_rope so the RoPE scope stays within the 192KB Vec UB budget at T=128 (S=2). - Pull q_rope_reassemble/q_rope_write out of the per-head loop and chunk by HEAD_GROUP via a [H*T, ROPE_DIM] pair staging buffer. - Chunk kv_proj_matmul by KV_PROJ_GROUP. Tuning constants: ROPE_CHUNK 32->64, Q_PROJ_CHUNK 128->256, QUANT_APPLY_CHUNK 256. Adds divisibility asserts for each new group.

coderabbitai · 2026-05-20T09:41:56Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR refactors the DeepSeek-V4 fused single-token QKV attention pipeline by decoupling compute stages, updating tiling/grouping parameters, and reorganizing RMSNorm and RoPE into separate grouped passes with intermediate staging buffers. Quantization and Q-projection matmul/dequant are split into independent stages; per-head RoPE is reassembled via cross-head staging; and KV RoPE reassembly uses a full FP32 buffer.

Changes

DeepSeek-V4 QKV Projection and RoPE Pipeline Refactoring

Layer / File(s)	Summary
Tiling and Grouping Constants `models/deepseek/v4/qkv_proj_rope.py`	Chunk/group sizes and derived block counts updated; new grouping parameters (Q_PROJ_GROUP, Q_PROJ_DEQUANT_GROUP, QUANT_APPLY_CHUNK) and adjusted QUANT_* chunking/divisibility asserts.
Attention RMS denom reduction and grouped normalization `models/deepseek/v4/qkv_proj_rope.py`	Fused attn_norm replaced with FP32 partial RMS denominator reductions followed by grouped parallel normalization and FP32→BF16 cast into `token_x_bf16`.
QR compute and RMS reduction `models/deepseek/v4/qkv_proj_rope.py`	Computes `qr_fp32` via grouped Q-block matmuls with pipelined accumulation, then reduces QR RMS using partial-sum pattern to produce `qr_inv_rms`.
QR normalization, amax/scale, and INT8 apply `models/deepseek/v4/qkv_proj_rope.py`	Fuses normalization+BF16 cast while computing per-task amax partials, reduces to INT8 scales, and applies quantization in a parallel chunked stage to write `qr` (INT8).
Q-projection INT32 matmul and decoupled dequant `models/deepseek/v4/qkv_proj_rope.py`	Performs grouped W8A8C16 INT32 accumulations into `col_acc_all`, then a separate dequant stage (INT32→FP32 with per-channel scales) to emit `q_proj_fp32`.
Per-head RMSNorm, cross-head RoPE staging, and reassemble `models/deepseek/v4/qkv_proj_rope.py`	Splits RMSNorm/NOPE BF16 construction and inverse-RMS recording, computes RoPE into `q_rope_pair_stage` in HEAD_GROUP chunks, then reassembles via cube-style matmuls and writes final BF16 `q`.
KV RoPE full-FP32 reassembly and BF16 cast `models/deepseek/v4/qkv_proj_rope.py`	Reassembles KV RoPE into a full FP32 `kv_rope_full` buffer and casts from that buffer into the BF16 `kv` output slice.
Test harness arg reorder `models/deepseek/v4/qkv_proj_rope.py`	Reorders `run_jit` tolerance/compare/config arguments and moves a Q-projection quant round-off comment; comparison thresholds unchanged.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hw-native-sys/pypto-lib#332: Directly overlaps by refactoring the same qkv_proj_rope pipeline stages, intermediate buffers, and dequant/reassembly flow.
hw-native-sys/pypto-lib#234: Related to INT8 qr + qr_scale quant/dequant path adjustments and golden computation for Q-projection.
hw-native-sys/pypto-lib#298: Modifies codegen/pl.at annotations in the same kernel area and may interact with fused-stage scheduling metadata.

Suggested labels

enhancement

Poem

🐰 Pipelines hop in tidy rows,
Chunks and stages, where logic flows,
RoPE and norms split by design,
Buffers cross-head, bits align—
A rabbit cheers: new paths compose!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately summarizes the main changes: optimization of DeepSeek V4 qkv_proj_rope decode through partial-sum reduces, amax folding, and K-tile/stage tuning, which aligns with the file changes and objectives.
Description check	✅ Passed	The pull request description clearly relates to the changeset, detailing specific optimizations (Opt S through X) and structural changes to the qkv_proj_rope decode pipeline with measurable performance improvements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@models/deepseek/v4/qkv_proj_rope.py`:
- Around line 49-52: Add direct guards that H is divisible by grouping factors:
insert assert H % HEAD_GROUP == 0 with a clear message near the existing
HEAD_BLOCKS check so the grouped q-RoPE loops (pl.parallel(0, H, HEAD_GROUP) /
pl.range(HEAD_GROUP)) cannot index past H; likewise add assert H % Q_PROJ_GROUP
== 0 next to the Q_PROJ_GROUP/Q_PROJ_OUT_CHUNK assertions (and repeat the same
explicit H % GROUP checks in the other similar block around lines 277-318) so
grouped loop stride invariants are enforced explicitly.
- Line 48: Add explicit divisibility guards for the new Q_LORA tiling
assumptions: assert Q_LORA % Q_PROJ_CHUNK == 0 (so Q_PROJ_BLOCKS = Q_LORA //
Q_PROJ_CHUNK won't drop a tail), assert Q_LORA % QUANT_CHUNK == 0 (so inner
quant tiles don't run past range), and assert (Q_LORA // QUANT_CHUNK) %
QUANT_APPLY_CHUNK == 0 (so each outer group contains a full QUANT_APPLY_CHUNK of
QUANT_CHUNK tiles). Place these grouped asserts near the top of the module
(around the existing QUANT_APPLY_CHUNK definition) and/or at the start of
qproj_matmul and qr_quant_apply to clearly document and enforce the contracts
referenced in lines ~48, 174-181 and 202-217.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9d300a81-7e87-4c00-b76e-18e22fc6e355

📥 Commits

Reviewing files that changed from the base of the PR and between 058894f and e7af204.

📒 Files selected for processing (1)

models/deepseek/v4/qkv_proj_rope.py

coderabbitai · 2026-05-20T09:45:48Z

 assert (H * HEAD_DIM) % (HEAD_CHUNK * HEAD_GROUP) == 0, \
    "HEAD_BLOCKS must be divisible by HEAD_GROUP"
+assert ((H * HEAD_DIM) // Q_PROJ_OUT_CHUNK) % Q_PROJ_GROUP == 0, \
+    "Q_PROJ_HEAD_BLOCKS must be divisible by Q_PROJ_GROUP"


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Guard HEAD_GROUP against H directly.

The grouped q-RoPE loops step in head units (pl.parallel(0, H, HEAD_GROUP) + pl.range(HEAD_GROUP)), so the invariant they need is H % HEAD_GROUP == 0. The current HEAD_BLOCKS assert is only an indirect proxy and can pass for configs where the last h_inner still indexes past H.

Suggested guard addition

+assert H % HEAD_GROUP == 0, \ + "H must be divisible by HEAD_GROUP" assert (H * HEAD_DIM) % (HEAD_CHUNK * HEAD_GROUP) == 0, \ "HEAD_BLOCKS must be divisible by HEAD_GROUP"

Also applies to: 277-318

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/qkv_proj_rope.py` around lines 49 - 52, Add direct guards that H is divisible by grouping factors: insert assert H % HEAD_GROUP == 0 with a clear message near the existing HEAD_BLOCKS check so the grouped q-RoPE loops (pl.parallel(0, H, HEAD_GROUP) / pl.range(HEAD_GROUP)) cannot index past H; likewise add assert H % Q_PROJ_GROUP == 0 next to the Q_PROJ_GROUP/Q_PROJ_OUT_CHUNK assertions (and repeat the same explicit H % GROUP checks in the other similar block around lines 277-318) so grouped loop stride invariants are enforced explicitly.

gemini-code-assist

Code Review

This pull request introduces significant performance optimizations to the qkv_proj_rope module by increasing chunk sizes and implementing task-grouping for normalization, projection, and quantization stages. Key changes include the decoupling of query projection dequantization into a separate parallel pass and the use of staging buffers to manage memory constraints and reduce launch overhead. Review feedback suggests further refining the placement of tensor initializations within loops to prevent potential compiler-induced state dependencies and simplifying redundant loops where chunk sizes now match full dimensions.

gemini-code-assist · 2026-05-20T09:47:13Z

-
+            # Pre-declared outside pl.range to satisfy pypto's loop-carried init_values
+            # threading. The dummy value is overwritten by pl.matmul on the first iter.
+            col_acc = pl.create_tensor([T, Q_PROJ_OUT_CHUNK], dtype=pl.INT32)


The col_acc tensor is created outside the h_inner loop but inside the pl.parallel scope. While the logic correctly resets col_acc on the first iteration of the qb loop (when qr_proj_col0 == 0), reusing the same tensor object across serial iterations of h_inner can sometimes lead to unexpected loop-carried state dependencies in the PyPTO compiler if not explicitly handled. Since this is intended as a per-head-block scratch buffer, consider moving the creation inside the h_inner loop or ensuring the compiler correctly hoists it as a task-local scratch.

gemini-code-assist · 2026-05-20T09:47:13Z

-                wkv_chunk = wkv[d0 : d0 + D_CHUNK, kv_col0 : kv_col0 + KV_CHUNK]
-                kv_acc = pl.matmul_acc(kv_acc, kv_x_chunk_bf16, wkv_chunk)
-            kv_fp32[:, kv_col0 : kv_col0 + KV_CHUNK] = kv_acc
+            kv_acc = pl.create_tensor([T, KV_CHUNK], dtype=pl.FP32)


Similar to the qproj_matmul scope, kv_acc is created outside the k_inner loop. Although it is correctly initialized via pl.matmul when d0 == 0, reusing the tensor across serial iterations of k_inner might trigger loop-carried state analysis. Moving the creation inside the k_inner loop would make the task-local nature of the buffer more explicit to the optimizer.

gemini-code-assist · 2026-05-20T09:47:13Z

+    kv_rope_full = pl.create_tensor([T, ROPE_DIM], dtype=pl.FP32)
    with pl.at(level=pl.Level.CORE_GROUP, name_hint="kv_rope_reassemble"):
-        for rope_col in pl.pipeline(0, ROPE_DIM, ROPE_CHUNK, stage=2):
+        for rope_col in pl.range(0, ROPE_DIM, ROPE_CHUNK):


The kv_rope_reassemble scope still uses a pl.range loop over ROPE_CHUNK, even though ROPE_CHUNK is now equal to ROPE_DIM (64). This makes the loop run only once. For consistency with the optimized q_rope_reassemble (lines 299-313), which processes the full ROPE_DIM at once, you could remove this loop and the associated slicing to simplify the code.

…es, amax fold, K-tile/stage tuning S=2 (T=128) wall-clock 3-run median: 624us -> 545us (-12.6%). Cumulative since pre-tuning baseline: 1868us -> 545us (-70.8%). Five accepted optimizations on top of the existing grouped-chunking endpoint: - Opt S: attn_norm_rms serial -> 2-way parallel partial sum + final reduce (chunked_loop_optimizer required on the partial scope to fit Vec UB; PARTIALS=2 keeps FP32 add deterministic across devices) - Opt T: fold qr_quant_amax into qr_norm_apply via a per-task partial-amax buffer; the residual qr_quant_amax scope shrinks 30us -> 1.7us. Partial amax is computed on qr_normed_bf16 (bit-identical to the original GM round-trip) so the INT8 quant scale is unchanged - Opt U: qr_rms 2-way parallel partial sum (mirrors Opt S; qr_fp32 is already FP32 so no chunked_loop_optimizer cast-split cost) - Opt V: Q_PROJ_CHUNK 256 -> 512 (K-tile only; N-tile already known to trigger ACL_ERROR_RT_AICORE_TIMEOUT per prior Opt B). qproj_matmul per-task Exec 74us -> 56us (-25%) - Opt X: qr_proj_matmul / kv_proj_matmul K-loop pl.pipeline stage 2 -> 4 (D_BLOCKS=32 has enough iters for 4-deep ping-pong) Side effects from renaming a few Python local names (d0 -> rms_d0 / apply_d0 / qr_d0, qr_chunk -> qr_norm_chunk) to break the implicit pypto AST init_values chain that links same-named locals across sibling scopes. Validation PASS on q / kv / qr / qr_scale across 3 runs on 3 different devices.

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

wangqin1723-max changed the title ~~Optimize DeepSeek V4 qkv_proj_rope decode via grouped chunking~~ Optimize DeepSeek V4 qkv_proj_rope decode (S=2): partial-sum reduces, amax fold, K-tile/stage tuning May 21, 2026

wangqin1723-max force-pushed the perf/dsv4-qkv-proj-rope-grouped-chunking branch from bf43f45 to 6cfd357 Compare May 21, 2026 08:23

zhangqi-chen merged commit 825a785 into hw-native-sys:main May 21, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize DeepSeek V4 qkv_proj_rope decode (S=2): partial-sum reduces, amax fold, K-tile/stage tuning#339

Optimize DeepSeek V4 qkv_proj_rope decode (S=2): partial-sum reduces, amax fold, K-tile/stage tuning#339
zhangqi-chen merged 2 commits into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-qkv-proj-rope-grouped-chunking

wangqin1723-max commented May 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wangqin1723-max commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New optimizations (on top of the prior Opt A+B+E+G+J+K+L+M-revert+N+O+P endpoint)

Incidental cleanup

Test plan

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wangqin1723-max commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading