Skip to content

Update DeepSeek V4 ratio-128 compressor to align with ratio-4#270

Merged
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
bumble0918:feature/2026-05-13
May 14, 2026
Merged

Update DeepSeek V4 ratio-128 compressor to align with ratio-4#270
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
bumble0918:feature/2026-05-13

Conversation

@bumble0918
Copy link
Copy Markdown
Contributor

@bumble0918 bumble0918 commented May 13, 2026

#198

  • Replace half-vector RoPE with selector-based RoPE via even/odd select matrices
  • Switch to online softmax (m/l/o accumulator) for pool
  • Merge kv and score projection into single fused block, transpose weight layout from [OUT_DIM, D] to [D, OUT_DIM]
  • Add BF16 intermediate precision in RMSNorm
  • Add runtime rotate scalar for conditional Hadamard
  • Add kv_cache output for compressed KV storage
  • Use bf16_allclose_or_ulp for kv_cache comparison
  • Pull model constants from config module

- Replace half-vector RoPE with selector-based RoPE via
  even/odd select matrices
- Switch to online softmax (m/l/o accumulator) for pool
- Merge kv and score projection into single fused block,
  transpose weight layout from [OUT_DIM, D] to [D, OUT_DIM]
- Add BF16 intermediate precision in RMSNorm
- Add runtime rotate scalar for conditional Hadamard
- Add kv_cache output for compressed KV storage
- Use bf16_allclose_or_ulp for kv_cache comparison
- Pull model constants from config module
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR rewrites the DeepSeek-V4 KV compressor kernel for the ratio=128 non-overlap decode path. The kernel transitions from a stateless single-output design to an incremental state-based architecture that conditionally compresses KV data using online softmax+pool, selector-based RoPE transformations, and indexed KV-cache writes. Test infrastructure and validation are fully updated to match.

Changes

DeepSeek-V4 compressor kernel rewrite

Layer / File(s) Summary
Kernel contract and constant definitions
models/deepseek/v4/compressor_ratio128.py (lines 11–48)
Module docstring reflects non-overlapping state layout and online softmax pooling. New constants MAX_SEQ_LEN, IDX_KV_LEN, FP32_NEG_INF introduced; OUT_CHUNK and derived block-count parameters updated for larger output tiling and KV-cache indexing support.
Core compressor kernel implementation
models/deepseek/v4/compressor_ratio128.py (lines 50–244)
Kernel API and behavior completely rewritten: always scatters current timestep's projected kv and score into kv_state/score_state at slot start_pos % COMPRESS_RATIO; conditionally executes online softmax+pool, RMSNorm, selector-based RoPE (using even_select/odd_select matrices instead of cosine/sine), optional Hadamard rotation, and kv_cache write only when (start_pos + 1) % COMPRESS_RATIO == 0. Returns tuple (kv, kv_state, score_state, kv_cache) replacing prior single-tensor output.
Testing infrastructure and validation
models/deepseek/v4/compressor_ratio128.py (lines 247–423)
compressor_test wrapper updated to expose new outputs and inputs; golden_compressor reference implementation now reflects incremental state scatter, conditional softmax+pool, and selector-based RoPE with KV-cache updates and early-return logic; build_tensor_specs adds selector matrix specs and KV-cache output; test runner imports bf16_allclose_or_ulp, tightens numeric tolerances, and adds custom comparison function for kv_cache validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#264: Validates kv_cache with the new bf16_allclose_or_ulp() comparator in compressor validation, directly matching this PR's addition of that BF16 ULP-based validator for KV-cache checking.
  • hw-native-sys/pypto-lib#243: Targets the same DeepSeek-V4 ratio=128 non-overlap decode-compressor logic with incremental KV/score state and conditional softmax+pool flow alignment.
  • hw-native-sys/pypto-lib#260: Implements the same compressor architecture overhaul (incremental state, selector-based RoPE, KV-cache indexing) for the ratio=4 variant, establishing a consistent pattern across compressor implementations.

Poem

🐰 A kernel springs forth, state by state,
No fixed outputs here—compressed, not late,
Online softmax pools the slots so neat,
Selectors dance with RoPE's beat,
Cache writes bloom where tokens complete! 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and concisely describes the main change: updating the DeepSeek V4 ratio-128 compressor to align with ratio-4 architecture.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description directly references the key changes in the changeset, including selector-based RoPE, online softmax, weight transpose, BF16 precision, kv_cache output, and config updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the DeepSeek-V4 KV Compressor (ratio=128 non-overlap) to implement online softmax pooling and selector-based RoPE. Key changes include the addition of KV cache support, Hadamard rotation logic, and updated RMSNorm with BF16 intermediates. The compressor function signature and the corresponding Torch reference were updated to accommodate these architectural changes. Feedback suggests using pl.parallel instead of pl.range for the Hadamard multiplication loop to enhance performance.

Comment on lines +219 to +220
if rotate:
for o0 in pl.range(0, HEAD_DIM, OUT_CHUNK):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The rotate branch for Hadamard multiplication uses pl.range, whereas the else branch uses pl.parallel. Since the iterations across o0 are independent and write to distinct slices of kv_flat, using pl.parallel in the rotate branch would improve performance by allowing the compiler to parallelize these operations across available compute units.

Suggested change
if rotate:
for o0 in pl.range(0, HEAD_DIM, OUT_CHUNK):
if rotate:
for o0 in pl.parallel(0, HEAD_DIM, OUT_CHUNK):
with pl.at(level=pl.Level.CORE_GROUP, name_hint="kv_hadamard"):

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
models/deepseek/v4/compressor_ratio128.py (2)

358-367: 💤 Low value

Local M shadows the module-level M from config.

M is imported at the top of the file (from config import DEMO as M) and used as the model-config alias. The local M = torch.zeros(...) inside init_odd_select / init_even_select shadows it within these functions. Currently the bodies don't reference the global M, so this is benign, but the name reuse is a footgun if these initializers ever need to read a model constant. Renaming to mat (or sel) avoids any future surprise.

♻️ Suggested rename
     def init_odd_select():
-        M = torch.zeros((ROPE_HEAD_DIM, ROPE_HEAD_DIM // 2))
+        mat = torch.zeros((ROPE_HEAD_DIM, ROPE_HEAD_DIM // 2))
         for i in range(ROPE_HEAD_DIM // 2):
-            M[2*i+1, i] = 1
-        return M
+            mat[2*i+1, i] = 1
+        return mat
     def init_even_select():
-        M = torch.zeros((ROPE_HEAD_DIM, ROPE_HEAD_DIM // 2))
+        mat = torch.zeros((ROPE_HEAD_DIM, ROPE_HEAD_DIM // 2))
         for i in range(ROPE_HEAD_DIM // 2):
-            M[2*i, i] = 1
-        return M
+            mat[2*i, i] = 1
+        return mat
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/compressor_ratio128.py` around lines 358 - 367, The local
variable name M inside init_odd_select and init_even_select shadows the
module-level config alias M; rename the local tensor (e.g., to mat or sel) and
update all references inside those functions (the torch.zeros(...) assignment
and the indexed assignments like mat[2*i+1, i] = 1 / mat[2*i, i] = 1) so the
functions no longer declare a local M and cannot accidentally shadow the
imported config symbol.

219-230: 💤 Low value

Hoist the loop-invariant read; loop form inconsistency has no semantic impact today.

The rotate and no-rotate branches use different loop forms (pl.range vs pl.parallel), but no semantic difference exists between them in pypto.language today—both iterations serialize at the compiler level due to loop-carried assemble dependencies, regardless of annotation. The parallelization opportunity here is a known issue proposed for future optimization via disjointness analysis, not a current gap.

The more valuable change is hoisting kv_proj_tile = normed_kv[:, 0 : HEAD_DIM] out of the loop body, as it is loop-invariant and recomputed needlessly each iteration:

♻️ Suggested refactor
     if rotate:
+        kv_proj_tile = normed_kv[:, 0 : HEAD_DIM]
         for o0 in pl.range(0, HEAD_DIM, OUT_CHUNK):
             with pl.at(level=pl.Level.CORE_GROUP, name_hint="kv_hadamard"):
-                kv_proj_tile = normed_kv[:, 0 : HEAD_DIM]
                 hadamard_tile = hadamard[0 : HEAD_DIM, o0 : o0 + OUT_CHUNK]
                 kv_hadamard_acc = pl.matmul(kv_proj_tile, hadamard_tile, out_dtype=pl.FP32)
                 kv_flat = pl.assemble(kv_flat, kv_hadamard_acc, [0, o0])
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/compressor_ratio128.py` around lines 219 - 230, The rotate
branch recomputes the loop-invariant slice kv_proj_tile = normed_kv[:, 0 :
HEAD_DIM] on every iteration and uses a different loop form (pl.range) than the
non-rotate branch (pl.parallel); hoist the invariant slice out of the for-loop
in the rotate path so kv_proj_tile is computed once before the loop, keep the
existing pl.matmul into kv_hadamard_acc and pl.assemble into kv_flat inside the
loop, and ensure the assemble indices and types (OUT_CHUNK, HEAD_DIM, pl.FP32)
remain identical to preserve semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@models/deepseek/v4/compressor_ratio128.py`:
- Around line 358-367: The local variable name M inside init_odd_select and
init_even_select shadows the module-level config alias M; rename the local
tensor (e.g., to mat or sel) and update all references inside those functions
(the torch.zeros(...) assignment and the indexed assignments like mat[2*i+1, i]
= 1 / mat[2*i, i] = 1) so the functions no longer declare a local M and cannot
accidentally shadow the imported config symbol.
- Around line 219-230: The rotate branch recomputes the loop-invariant slice
kv_proj_tile = normed_kv[:, 0 : HEAD_DIM] on every iteration and uses a
different loop form (pl.range) than the non-rotate branch (pl.parallel); hoist
the invariant slice out of the for-loop in the rotate path so kv_proj_tile is
computed once before the loop, keep the existing pl.matmul into kv_hadamard_acc
and pl.assemble into kv_flat inside the loop, and ensure the assemble indices
and types (OUT_CHUNK, HEAD_DIM, pl.FP32) remain identical to preserve semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 15e1763a-f549-451c-838c-7bb6f721b13a

📥 Commits

Reviewing files that changed from the base of the PR and between 64668f8 and 4a83a8a.

📒 Files selected for processing (1)
  • models/deepseek/v4/compressor_ratio128.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants