Update DeepSeek V4 ratio-128 compressor to align with ratio-4 by bumble0918 · Pull Request #270 · hw-native-sys/pypto-lib

bumble0918 · 2026-05-13T16:54:11Z

Replace half-vector RoPE with selector-based RoPE via even/odd select matrices
Switch to online softmax (m/l/o accumulator) for pool
Merge kv and score projection into single fused block, transpose weight layout from [OUT_DIM, D] to [D, OUT_DIM]
Add BF16 intermediate precision in RMSNorm
Add runtime rotate scalar for conditional Hadamard
Add kv_cache output for compressed KV storage
Use bf16_allclose_or_ulp for kv_cache comparison
Pull model constants from config module

- Replace half-vector RoPE with selector-based RoPE via even/odd select matrices - Switch to online softmax (m/l/o accumulator) for pool - Merge kv and score projection into single fused block, transpose weight layout from [OUT_DIM, D] to [D, OUT_DIM] - Add BF16 intermediate precision in RMSNorm - Add runtime rotate scalar for conditional Hadamard - Add kv_cache output for compressed KV storage - Use bf16_allclose_or_ulp for kv_cache comparison - Pull model constants from config module

coderabbitai · 2026-05-13T16:54:30Z

📝 Walkthrough

Walkthrough

This PR rewrites the DeepSeek-V4 KV compressor kernel for the ratio=128 non-overlap decode path. The kernel transitions from a stateless single-output design to an incremental state-based architecture that conditionally compresses KV data using online softmax+pool, selector-based RoPE transformations, and indexed KV-cache writes. Test infrastructure and validation are fully updated to match.

Changes

DeepSeek-V4 compressor kernel rewrite

Layer / File(s)	Summary
Kernel contract and constant definitions `models/deepseek/v4/compressor_ratio128.py` (lines 11–48)	Module docstring reflects non-overlapping state layout and online softmax pooling. New constants `MAX_SEQ_LEN`, `IDX_KV_LEN`, `FP32_NEG_INF` introduced; `OUT_CHUNK` and derived block-count parameters updated for larger output tiling and KV-cache indexing support.
Core compressor kernel implementation `models/deepseek/v4/compressor_ratio128.py` (lines 50–244)	Kernel API and behavior completely rewritten: always scatters current timestep's projected `kv` and `score` into `kv_state`/`score_state` at slot `start_pos % COMPRESS_RATIO`; conditionally executes online softmax+pool, RMSNorm, selector-based RoPE (using `even_select`/`odd_select` matrices instead of cosine/sine), optional Hadamard rotation, and `kv_cache` write only when `(start_pos + 1) % COMPRESS_RATIO == 0`. Returns tuple `(kv, kv_state, score_state, kv_cache)` replacing prior single-tensor output.
Testing infrastructure and validation `models/deepseek/v4/compressor_ratio128.py` (lines 247–423)	`compressor_test` wrapper updated to expose new outputs and inputs; `golden_compressor` reference implementation now reflects incremental state scatter, conditional softmax+pool, and selector-based RoPE with KV-cache updates and early-return logic; `build_tensor_specs` adds selector matrix specs and KV-cache output; test runner imports `bf16_allclose_or_ulp`, tightens numeric tolerances, and adds custom comparison function for `kv_cache` validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

hw-native-sys/pypto-lib#264: Validates kv_cache with the new bf16_allclose_or_ulp() comparator in compressor validation, directly matching this PR's addition of that BF16 ULP-based validator for KV-cache checking.
hw-native-sys/pypto-lib#243: Targets the same DeepSeek-V4 ratio=128 non-overlap decode-compressor logic with incremental KV/score state and conditional softmax+pool flow alignment.
hw-native-sys/pypto-lib#260: Implements the same compressor architecture overhaul (incremental state, selector-based RoPE, KV-cache indexing) for the ratio=4 variant, establishing a consistent pattern across compressor implementations.

Poem

🐰 A kernel springs forth, state by state,
No fixed outputs here—compressed, not late,
Online softmax pools the slots so neat,
Selectors dance with RoPE's beat,
Cache writes bloom where tokens complete! 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and concisely describes the main change: updating the DeepSeek V4 ratio-128 compressor to align with ratio-4 architecture.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description directly references the key changes in the changeset, including selector-based RoPE, online softmax, weight transpose, BF16 precision, kv_cache output, and config updates.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request refactors the DeepSeek-V4 KV Compressor (ratio=128 non-overlap) to implement online softmax pooling and selector-based RoPE. Key changes include the addition of KV cache support, Hadamard rotation logic, and updated RMSNorm with BF16 intermediates. The compressor function signature and the corresponding Torch reference were updated to accommodate these architectural changes. Feedback suggests using pl.parallel instead of pl.range for the Hadamard multiplication loop to enhance performance.

gemini-code-assist · 2026-05-13T17:00:35Z

+        if rotate:
+            for o0 in pl.range(0, HEAD_DIM, OUT_CHUNK):


The rotate branch for Hadamard multiplication uses pl.range, whereas the else branch uses pl.parallel. Since the iterations across o0 are independent and write to distinct slices of kv_flat, using pl.parallel in the rotate branch would improve performance by allowing the compiler to parallelize these operations across available compute units.

Suggested change

if rotate:

for o0 in pl.range(0, HEAD_DIM, OUT_CHUNK):

if rotate:

for o0 in pl.parallel(0, HEAD_DIM, OUT_CHUNK):

with pl.at(level=pl.Level.CORE_GROUP, name_hint="kv_hadamard"):

coderabbitai

🧹 Nitpick comments (2)

models/deepseek/v4/compressor_ratio128.py (2)
358-367: 💤 Low value

Local M shadows the module-level M from config.

M is imported at the top of the file (from config import DEMO as M) and used as the model-config alias. The local M = torch.zeros(...) inside init_odd_select / init_even_select shadows it within these functions. Currently the bodies don't reference the global M, so this is benign, but the name reuse is a footgun if these initializers ever need to read a model constant. Renaming to mat (or sel) avoids any future surprise.
♻️ Suggested rename
     def init_odd_select():
-        M = torch.zeros((ROPE_HEAD_DIM, ROPE_HEAD_DIM // 2))
+        mat = torch.zeros((ROPE_HEAD_DIM, ROPE_HEAD_DIM // 2))
         for i in range(ROPE_HEAD_DIM // 2):
-            M[2*i+1, i] = 1
-        return M
+            mat[2*i+1, i] = 1
+        return mat
     def init_even_select():
-        M = torch.zeros((ROPE_HEAD_DIM, ROPE_HEAD_DIM // 2))
+        mat = torch.zeros((ROPE_HEAD_DIM, ROPE_HEAD_DIM // 2))
         for i in range(ROPE_HEAD_DIM // 2):
-            M[2*i, i] = 1
-        return M
+            mat[2*i, i] = 1
+        return mat
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/compressor_ratio128.py` around lines 358 - 367, The local
variable name M inside init_odd_select and init_even_select shadows the
module-level config alias M; rename the local tensor (e.g., to mat or sel) and
update all references inside those functions (the torch.zeros(...) assignment
and the indexed assignments like mat[2*i+1, i] = 1 / mat[2*i, i] = 1) so the
functions no longer declare a local M and cannot accidentally shadow the
imported config symbol.
219-230: 💤 Low value

Hoist the loop-invariant read; loop form inconsistency has no semantic impact today.

The rotate and no-rotate branches use different loop forms (pl.range vs pl.parallel), but no semantic difference exists between them in pypto.language today—both iterations serialize at the compiler level due to loop-carried assemble dependencies, regardless of annotation. The parallelization opportunity here is a known issue proposed for future optimization via disjointness analysis, not a current gap.

The more valuable change is hoisting kv_proj_tile = normed_kv[:, 0 : HEAD_DIM] out of the loop body, as it is loop-invariant and recomputed needlessly each iteration:
♻️ Suggested refactor
     if rotate:
+        kv_proj_tile = normed_kv[:, 0 : HEAD_DIM]
         for o0 in pl.range(0, HEAD_DIM, OUT_CHUNK):
             with pl.at(level=pl.Level.CORE_GROUP, name_hint="kv_hadamard"):
-                kv_proj_tile = normed_kv[:, 0 : HEAD_DIM]
                 hadamard_tile = hadamard[0 : HEAD_DIM, o0 : o0 + OUT_CHUNK]
                 kv_hadamard_acc = pl.matmul(kv_proj_tile, hadamard_tile, out_dtype=pl.FP32)
                 kv_flat = pl.assemble(kv_flat, kv_hadamard_acc, [0, o0])
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/compressor_ratio128.py` around lines 219 - 230, The rotate
branch recomputes the loop-invariant slice kv_proj_tile = normed_kv[:, 0 :
HEAD_DIM] on every iteration and uses a different loop form (pl.range) than the
non-rotate branch (pl.parallel); hoist the invariant slice out of the for-loop
in the rotate path so kv_proj_tile is computed once before the loop, keep the
existing pl.matmul into kv_hadamard_acc and pl.assemble into kv_flat inside the
loop, and ensure the assemble indices and types (OUT_CHUNK, HEAD_DIM, pl.FP32)
remain identical to preserve semantics.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@models/deepseek/v4/compressor_ratio128.py`:
- Around line 358-367: The local variable name M inside init_odd_select and
init_even_select shadows the module-level config alias M; rename the local
tensor (e.g., to mat or sel) and update all references inside those functions
(the torch.zeros(...) assignment and the indexed assignments like mat[2*i+1, i]
= 1 / mat[2*i, i] = 1) so the functions no longer declare a local M and cannot
accidentally shadow the imported config symbol.
- Around line 219-230: The rotate branch recomputes the loop-invariant slice
kv_proj_tile = normed_kv[:, 0 : HEAD_DIM] on every iteration and uses a
different loop form (pl.range) than the non-rotate branch (pl.parallel); hoist
the invariant slice out of the for-loop in the rotate path so kv_proj_tile is
computed once before the loop, keep the existing pl.matmul into kv_hadamard_acc
and pl.assemble into kv_flat inside the loop, and ensure the assemble indices
and types (OUT_CHUNK, HEAD_DIM, pl.FP32) remain identical to preserve semantics.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 15e1763a-f549-451c-838c-7bb6f721b13a

📥 Commits

Reviewing files that changed from the base of the PR and between 64668f8 and 4a83a8a.

📒 Files selected for processing (1)

models/deepseek/v4/compressor_ratio128.py

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

zhangqi-chen merged commit 3d5af80 into hw-native-sys:main May 14, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update DeepSeek V4 ratio-128 compressor to align with ratio-4#270

Update DeepSeek V4 ratio-128 compressor to align with ratio-4#270
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
bumble0918:feature/2026-05-13

bumble0918 commented May 13, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bumble0918 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bumble0918 commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading