Update DeepSeek attention fixtures#256
Conversation
- Pass sparse attention local RoPE selector tensors from SWA and HCA examples - Initialize KV cache fixtures with deterministic non-zero data - Align SWA and HCA decode precision tolerances after NPU validation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds local sparse‑RoPE selector tensors and sizing constants, wires them through HCA and SWA into sparse_attn (and golden references), refactors CSA to split ori/cmp KV pools with deterministic top‑k and INT8 quantization, updates tensor initializers to seeded normalized caches, and adjusts JIT tolerances. ChangesSparse RoPE Local Selector Integration & CSA refactor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements sparse RoPE chunking for HCA and SWA attention mechanisms by introducing local selection tensors and related constants. The test suites for both modules were updated to use randomized, normalized KV caches instead of zero-initialized ones, and the precision tolerances were adjusted accordingly. I have no feedback to provide.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
models/deepseek/v4/attention_hca.py (1)
577-593: ⚡ Quick winConsider extracting duplicated test helpers to a shared module.
The functions
init_even_select_local,init_odd_select_local, andinit_normalized_cacheare duplicated identically inattention_swa.py(lines 444-460). Extracting these to a shared test utilities module would improve maintainability and ensure consistency if these initialization strategies evolve.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/attention_hca.py` around lines 577 - 593, The three helper functions init_even_select_local, init_odd_select_local, and init_normalized_cache are duplicated; extract them into a shared test utilities module (e.g., tests.utils or models.deepseek.utils) and replace the local definitions in attention_hca.py and attention_swa.py with imports from that module; ensure the shared module exposes the same function names and signatures, update any relative imports, and run tests to verify behavior unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@models/deepseek/v4/attention_hca.py`:
- Around line 577-593: The three helper functions init_even_select_local,
init_odd_select_local, and init_normalized_cache are duplicated; extract them
into a shared test utilities module (e.g., tests.utils or models.deepseek.utils)
and replace the local definitions in attention_hca.py and attention_swa.py with
imports from that module; ensure the shared module exposes the same function
names and signatures, update any relative imports, and run tests to verify
behavior unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 95a9a9ef-798e-405b-a78a-dd2b6dcb6616
📒 Files selected for processing (2)
models/deepseek/v4/attention_hca.pymodels/deepseek/v4/attention_swa.py
- Remove fixed seeds from SWA and HCA KV cache initialization - Keep normalized non-zero cache values for decode attention precision coverage
151d74d to
cf59218
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
models/deepseek/v4/attention_csa_draft.py (1)
228-243: 💤 Low valuePass an actual identity (or drop the parameter) instead of an uninitialized tensor named
hadamard_identity.
hadamard_identityis created viapl.create_tensor(...)with no initializer and then handed tocompressor(...). It works today only becausecompressor_ratio4.compressorhasROTATE=Falseand never reads thehadamardargument — but the name implies it’s an identity matrix and the golden_compressor explicitly passestorch.eye(HEAD_DIM, dtype=torch.bfloat16)for the same slot. Ifcompressorever consumeshadamard, this fixture will silently drive it with garbage.♻️ One option: explicitly initialize, matching the golden
cmp_out = pl.create_tensor([B, HEAD_DIM], dtype=pl.BF16) - hadamard_identity = pl.create_tensor([HEAD_DIM, HEAD_DIM], dtype=pl.BF16) + # `compressor_ratio4` has ROTATE=False and ignores `hadamard`, but the + # golden passes torch.eye; keep the kernel side consistent so a future + # ROTATE=True flip does not silently use uninitialized memory. + hadamard_identity = pl.full([HEAD_DIM, HEAD_DIM], dtype=pl.BF16, value=0.0) + # (or build an actual identity via an eye-like helper if available)Alternatively, drop the
hadamardparameter from the inlinecompressorsignature since the ratio-4 path is fixed atROTATE=False.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/attention_csa_draft.py` around lines 228 - 243, The uninitialized tensor hadamard_identity passed into compressor should be a real identity matrix (or the parameter removed); replace the pl.create_tensor([HEAD_DIM, HEAD_DIM], dtype=pl.BF16) placeholder with an explicit identity tensor matching HEAD_DIM and dtype (same semantics as golden_compressor's torch.eye(..., dtype=torch.bfloat16)) so compressor(x_mixed, ..., hadamard_identity, ...) receives a valid identity, or if you prefer and ROTATE is guaranteed False, remove the hadamard argument from the compressor signature and all call sites (including the inline compressor and any compressor_ratio4.compressor usages).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@models/deepseek/v4/attention_csa_draft.py`:
- Around line 228-243: The uninitialized tensor hadamard_identity passed into
compressor should be a real identity matrix (or the parameter removed); replace
the pl.create_tensor([HEAD_DIM, HEAD_DIM], dtype=pl.BF16) placeholder with an
explicit identity tensor matching HEAD_DIM and dtype (same semantics as
golden_compressor's torch.eye(..., dtype=torch.bfloat16)) so compressor(x_mixed,
..., hadamard_identity, ...) receives a valid identity, or if you prefer and
ROTATE is guaranteed False, remove the hadamard argument from the compressor
signature and all call sites (including the inline compressor and any
compressor_ratio4.compressor usages).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4cde4e06-c00b-4506-af1a-ebe8b23d5f0c
📒 Files selected for processing (4)
models/deepseek/v4/attention_csa_draft.pymodels/deepseek/v4/attention_hca.pymodels/deepseek/v4/attention_swa.pymodels/deepseek/v4/compressor_ratio4.py
🚧 Files skipped from review as they are similar to previous changes (2)
- models/deepseek/v4/attention_hca.py
- models/deepseek/v4/attention_swa.py
Summary
Related Issues
None