Update DeepSeek V4 ratio-128 compressor to align with ratio-4#270
Conversation
- Replace half-vector RoPE with selector-based RoPE via even/odd select matrices - Switch to online softmax (m/l/o accumulator) for pool - Merge kv and score projection into single fused block, transpose weight layout from [OUT_DIM, D] to [D, OUT_DIM] - Add BF16 intermediate precision in RMSNorm - Add runtime rotate scalar for conditional Hadamard - Add kv_cache output for compressed KV storage - Use bf16_allclose_or_ulp for kv_cache comparison - Pull model constants from config module
📝 WalkthroughWalkthroughThis PR rewrites the DeepSeek-V4 KV compressor kernel for the ratio=128 non-overlap decode path. The kernel transitions from a stateless single-output design to an incremental state-based architecture that conditionally compresses KV data using online softmax+pool, selector-based RoPE transformations, and indexed KV-cache writes. Test infrastructure and validation are fully updated to match. ChangesDeepSeek-V4 compressor kernel rewrite
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the DeepSeek-V4 KV Compressor (ratio=128 non-overlap) to implement online softmax pooling and selector-based RoPE. Key changes include the addition of KV cache support, Hadamard rotation logic, and updated RMSNorm with BF16 intermediates. The compressor function signature and the corresponding Torch reference were updated to accommodate these architectural changes. Feedback suggests using pl.parallel instead of pl.range for the Hadamard multiplication loop to enhance performance.
| if rotate: | ||
| for o0 in pl.range(0, HEAD_DIM, OUT_CHUNK): |
There was a problem hiding this comment.
The rotate branch for Hadamard multiplication uses pl.range, whereas the else branch uses pl.parallel. Since the iterations across o0 are independent and write to distinct slices of kv_flat, using pl.parallel in the rotate branch would improve performance by allowing the compiler to parallelize these operations across available compute units.
| if rotate: | |
| for o0 in pl.range(0, HEAD_DIM, OUT_CHUNK): | |
| if rotate: | |
| for o0 in pl.parallel(0, HEAD_DIM, OUT_CHUNK): | |
| with pl.at(level=pl.Level.CORE_GROUP, name_hint="kv_hadamard"): |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
models/deepseek/v4/compressor_ratio128.py (2)
358-367: 💤 Low valueLocal
Mshadows the module-levelMfromconfig.
Mis imported at the top of the file (from config import DEMO as M) and used as the model-config alias. The localM = torch.zeros(...)insideinit_odd_select/init_even_selectshadows it within these functions. Currently the bodies don't reference the globalM, so this is benign, but the name reuse is a footgun if these initializers ever need to read a model constant. Renaming tomat(orsel) avoids any future surprise.♻️ Suggested rename
def init_odd_select(): - M = torch.zeros((ROPE_HEAD_DIM, ROPE_HEAD_DIM // 2)) + mat = torch.zeros((ROPE_HEAD_DIM, ROPE_HEAD_DIM // 2)) for i in range(ROPE_HEAD_DIM // 2): - M[2*i+1, i] = 1 - return M + mat[2*i+1, i] = 1 + return mat def init_even_select(): - M = torch.zeros((ROPE_HEAD_DIM, ROPE_HEAD_DIM // 2)) + mat = torch.zeros((ROPE_HEAD_DIM, ROPE_HEAD_DIM // 2)) for i in range(ROPE_HEAD_DIM // 2): - M[2*i, i] = 1 - return M + mat[2*i, i] = 1 + return mat🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/compressor_ratio128.py` around lines 358 - 367, The local variable name M inside init_odd_select and init_even_select shadows the module-level config alias M; rename the local tensor (e.g., to mat or sel) and update all references inside those functions (the torch.zeros(...) assignment and the indexed assignments like mat[2*i+1, i] = 1 / mat[2*i, i] = 1) so the functions no longer declare a local M and cannot accidentally shadow the imported config symbol.
219-230: 💤 Low valueHoist the loop-invariant read; loop form inconsistency has no semantic impact today.
The rotate and no-rotate branches use different loop forms (
pl.rangevspl.parallel), but no semantic difference exists between them in pypto.language today—both iterations serialize at the compiler level due to loop-carried assemble dependencies, regardless of annotation. The parallelization opportunity here is a known issue proposed for future optimization via disjointness analysis, not a current gap.The more valuable change is hoisting
kv_proj_tile = normed_kv[:, 0 : HEAD_DIM]out of the loop body, as it is loop-invariant and recomputed needlessly each iteration:♻️ Suggested refactor
if rotate: + kv_proj_tile = normed_kv[:, 0 : HEAD_DIM] for o0 in pl.range(0, HEAD_DIM, OUT_CHUNK): with pl.at(level=pl.Level.CORE_GROUP, name_hint="kv_hadamard"): - kv_proj_tile = normed_kv[:, 0 : HEAD_DIM] hadamard_tile = hadamard[0 : HEAD_DIM, o0 : o0 + OUT_CHUNK] kv_hadamard_acc = pl.matmul(kv_proj_tile, hadamard_tile, out_dtype=pl.FP32) kv_flat = pl.assemble(kv_flat, kv_hadamard_acc, [0, o0])🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/compressor_ratio128.py` around lines 219 - 230, The rotate branch recomputes the loop-invariant slice kv_proj_tile = normed_kv[:, 0 : HEAD_DIM] on every iteration and uses a different loop form (pl.range) than the non-rotate branch (pl.parallel); hoist the invariant slice out of the for-loop in the rotate path so kv_proj_tile is computed once before the loop, keep the existing pl.matmul into kv_hadamard_acc and pl.assemble into kv_flat inside the loop, and ensure the assemble indices and types (OUT_CHUNK, HEAD_DIM, pl.FP32) remain identical to preserve semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@models/deepseek/v4/compressor_ratio128.py`:
- Around line 358-367: The local variable name M inside init_odd_select and
init_even_select shadows the module-level config alias M; rename the local
tensor (e.g., to mat or sel) and update all references inside those functions
(the torch.zeros(...) assignment and the indexed assignments like mat[2*i+1, i]
= 1 / mat[2*i, i] = 1) so the functions no longer declare a local M and cannot
accidentally shadow the imported config symbol.
- Around line 219-230: The rotate branch recomputes the loop-invariant slice
kv_proj_tile = normed_kv[:, 0 : HEAD_DIM] on every iteration and uses a
different loop form (pl.range) than the non-rotate branch (pl.parallel); hoist
the invariant slice out of the for-loop in the rotate path so kv_proj_tile is
computed once before the loop, keep the existing pl.matmul into kv_hadamard_acc
and pl.assemble into kv_flat inside the loop, and ensure the assemble indices
and types (OUT_CHUNK, HEAD_DIM, pl.FP32) remain identical to preserve semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 15e1763a-f549-451c-838c-7bb6f721b13a
📒 Files selected for processing (1)
models/deepseek/v4/compressor_ratio128.py
#198