Skip to content

Add FLASH config support to DSv4 CSA attention#296

Merged
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
sjduan:feat/dsv4-csa-flash-support
May 15, 2026
Merged

Add FLASH config support to DSv4 CSA attention#296
zhangqi-chen merged 1 commit into
hw-native-sys:mainfrom
sjduan:feat/dsv4-csa-flash-support

Conversation

@sjduan
Copy link
Copy Markdown
Contributor

@sjduan sjduan commented May 15, 2026

Summary

  • Change attention_csa.py to use FLASH config instead of DEMO
  • Rename compressor to indexer_compressor in indexer_compressor.py
  • Update indexer.py to use indexer_compressor and set IDX_TOPK from config (M.index_topk)
  • Update sparse_attn.py to align with FLASH config parameters

Dependencies:

Repo Commit
pypto cf8b954a (ci: bump Ascend CANN version from 8.5.0 to 9.0.0 #1374)
simpler a94d5140 (Add: tool-smoke gate inside each DFX scene test #771)
pto-isa 4b0e4c8e ([Bugfix] A5 ttrans and tconcatidx bugfix for Blue Zone Assembly Line)

- Change attention_csa.py to use FLASH config instead of DEMO
- Rename compressor to indexer_compressor in indexer_compressor.py
- Update indexer.py to use indexer_compressor and set IDX_TOPK from config
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The PR consolidates configuration, naming, and computation changes across the DeepSeek v4 attention stack. attention_csa.py switches to FLASH configuration; indexer_compressor.py and indexer.py rename and wire the compressor module with config-driven constants; sparse_attn.py vectorizes head-block attention computation.

Changes

DeepSeek v4 Refactoring and Optimization

Layer / File(s) Summary
Configuration migration to FLASH
models/deepseek/v4/attention_csa.py
Module constant M switches from config.DEMO to config.FLASH, propagating FLASH configuration values through all derived compile-time constants and tensor shape calculations.
Compressor module refactoring and indexer integration
models/deepseek/v4/indexer_compressor.py, models/deepseek/v4/indexer.py
Compressor kernel renamed from compressor to indexer_compressor with updated imports and call sites; IDX_TOPK constant updated to use M.index_topk from model config instead of hardcoded value.
Sparse attention head-block vectorization
models/deepseek/v4/sparse_attn.py
Attention projection loop changes from single-head iteration to MATMUL_ROW_PAD-sized head blocks; q_batch/kv_batch creation updated for block parallelism; sink bias computation vectorized with attn_stage_row reshaped from [1, 1] to [MATMUL_ROW_PAD, 1].

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#289: Matches the CSA orchestration and config-driven changes to attention_csa.py and compressor wiring in the indexer pipeline.
  • hw-native-sys/pypto-lib#265: Introduces the indexer-compressor integration path that this PR's compressor renaming and wiring directly builds upon.
  • hw-native-sys/pypto-lib#256: Updates sparse_attn.py fixture and RoPE threading in the same attention computation path affected by this PR's head-block vectorization.

Poem

🐰 A rabbit hops through configs bright,
From DEMO's shade to FLASH's light!
The compressor dons a fancier name,
While heads now dance in vectored flame—
Attention optimized, efficiency gained! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically summarizes the main change: adding FLASH config support to DSv4 CSA attention components.
Description check ✅ Passed The description is directly related to the changeset, providing clear bullet points of the modifications made across four files to support FLASH configuration.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
models/deepseek/v4/sparse_attn.py (1)

179-220: ⚡ Quick win

Guard the head-block tail.

This loop now assumes H is a multiple of MATMUL_ROW_PAD. The q_flat/attn_sink slices and the final assemble are all unconditional h0 : h0 + MATMUL_ROW_PAD, so the last block will go out of bounds if that stops being true.

Suggested guard
 MATMUL_ROW_PAD = 16
+assert H % MATMUL_ROW_PAD == 0, (
+    f"num_attention_heads={H} must be divisible by MATMUL_ROW_PAD={MATMUL_ROW_PAD}"
+)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/sparse_attn.py` around lines 179 - 220, The loop over h0
assumes H is divisible by MATMUL_ROW_PAD causing out-of-bounds slices on q_flat,
attn_sink and oi_out; modify the loop body in the parallel block that uses
q_flat, kv_topk_batch, attn_sink and oi_out (symbols: H, MATMUL_ROW_PAD, q_flat,
kv_topk_batch, attn_sink, oi_out, attn_stage_row) to compute a tail_len =
min(MATMUL_ROW_PAD, H - h0) and use that length for all slices and reshapes (or
explicitly pad temporary buffers to MATMUL_ROW_PAD and mask results) so the
final cast/assemble only indexes 0:tail_len where needed; ensure all per-block
computations (q_batch, kv_batch, oi, li, mi, sink_bias, oi_out and the final
attn_stage_row write) respect tail_len to avoid OOB accesses.
models/deepseek/v4/indexer.py (1)

35-35: ⚡ Quick win

Assert index_topk fits the score buffer.

Now that IDX_TOPK comes from config, the later top-k path assumes IDX_TOPK <= SCORE_LEN without checking it. A larger value will overrun the sorted-pair slice contract and the fixed [1, IDX_TOPK] scratch shape.

Suggested guard
 IDX_TOPK = M.index_topk
+assert 0 <= IDX_TOPK <= SCORE_LEN, (
+    f"index_topk={IDX_TOPK} must satisfy 0 <= index_topk <= SCORE_LEN={SCORE_LEN}"
+)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/indexer.py` at line 35, The code sets IDX_TOPK =
M.index_topk but never validates it against the fixed score buffer length,
risking buffer overruns; add a guard where IDX_TOPK is derived (near IDX_TOPK /
M.index_topk) that checks IDX_TOPK <= SCORE_LEN and either clamp it (IDX_TOPK =
min(M.index_topk, SCORE_LEN)) or raise a clear ValueError/Assertion if
M.index_topk > SCORE_LEN, and update any dependent assumptions about the
sorted-pair slice/scratch shape ([1, IDX_TOPK]) accordingly so callers using
IDX_TOPK cannot exceed the SCORE_LEN buffer.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@models/deepseek/v4/indexer.py`:
- Line 35: The code sets IDX_TOPK = M.index_topk but never validates it against
the fixed score buffer length, risking buffer overruns; add a guard where
IDX_TOPK is derived (near IDX_TOPK / M.index_topk) that checks IDX_TOPK <=
SCORE_LEN and either clamp it (IDX_TOPK = min(M.index_topk, SCORE_LEN)) or raise
a clear ValueError/Assertion if M.index_topk > SCORE_LEN, and update any
dependent assumptions about the sorted-pair slice/scratch shape ([1, IDX_TOPK])
accordingly so callers using IDX_TOPK cannot exceed the SCORE_LEN buffer.

In `@models/deepseek/v4/sparse_attn.py`:
- Around line 179-220: The loop over h0 assumes H is divisible by MATMUL_ROW_PAD
causing out-of-bounds slices on q_flat, attn_sink and oi_out; modify the loop
body in the parallel block that uses q_flat, kv_topk_batch, attn_sink and oi_out
(symbols: H, MATMUL_ROW_PAD, q_flat, kv_topk_batch, attn_sink, oi_out,
attn_stage_row) to compute a tail_len = min(MATMUL_ROW_PAD, H - h0) and use that
length for all slices and reshapes (or explicitly pad temporary buffers to
MATMUL_ROW_PAD and mask results) so the final cast/assemble only indexes
0:tail_len where needed; ensure all per-block computations (q_batch, kv_batch,
oi, li, mi, sink_bias, oi_out and the final attn_stage_row write) respect
tail_len to avoid OOB accesses.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 09000a80-d75d-4290-97f7-467ffa40b819

📥 Commits

Reviewing files that changed from the base of the PR and between aee5258 and 4e615d9.

📒 Files selected for processing (4)
  • models/deepseek/v4/attention_csa.py
  • models/deepseek/v4/indexer.py
  • models/deepseek/v4/indexer_compressor.py
  • models/deepseek/v4/sparse_attn.py

@zhangqi-chen zhangqi-chen merged commit 4f8388c into hw-native-sys:main May 15, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants