Fix scalar-indexed slices in V3.2 / Qwen3-32B + sparse_attn rope refactor#288
Conversation
Rewrite scalar-leading-axis subscripts `x[scalar, expr]` to range-slice `x[scalar : scalar + 1, expr]` across 28 sites in two decode kernels. Extends the moe_expert fix (hw-native-sys#281) from rank-3 weights to rank-2 rope and per-batch projection tensors. For offsets that were originally `ctx_len - 1`, introduce `pos = ctx_len - 1` first — the form `rope_cos[ctx_len - 1 : ctx_len, ...]` does not compile because pypto IR does not fold `(ctx_len) - (ctx_len - 1) = 1` into a static row dim, and downstream `pl.col_expand_mul` rejects it (filed as pypto#1377). Verified on a2a3: - qwen3_32b_decode: compile + runtime + golden PASS (17.4s) - deepseek_v3_2_decode_front: compile + runtime + golden PASS (4/4)
Split the single `cfa_proj_rope_assemble` scope into `_matmul` and `_combine` scopes, with FP32 intermediate buffers for the even/odd interleave streams. The BF16 add + cast that was inlined inside the matmul loop now runs once over the full ROPE_DIM tile after the even/odd matmul outputs are assembled.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThree model kernels (Qwen3 32B, DeepSeek V3.2, DeepSeek V4) refactor RoPE tensor indexing from scalar/1D patterns to batched singleton slicing ( ChangesRoPE tensor shape refactoring across decode kernels
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates indexing patterns across DeepSeek and Qwen3 decoding scripts, replacing scalar subscripts with range slices to ensure compatibility with the IR compiler's dimension folding. Additionally, the DeepSeek v4 sparse attention implementation was refactored to use intermediate FP32 buffers for RoPE interleave assembly, consolidating addition and casting operations to improve hardware utilization. I have no feedback to provide as the existing review comments were primarily explanatory and did not identify issues or required actions.
Summary
x[scalar, expr]to range-slice formx[scalar : scalar + 1, expr]inside two@pl.functiondecode kernels (deepseek_v3_2_decode_front.py,qwen3_32b_decode.py). Extends the moe_expert fix (Fix: DeepSeek V4 moe_expert scalar-indexed 3-D slice compile failures #281) from rank-3 weights to rank-2 rope tables and per-batch projections.ctx_len - 1, introducepos = ctx_len - 1first — the formrope_cos[ctx_len - 1 : ctx_len, ...]does not compile because pypto IR doesn't fold(ctx_len) - (ctx_len - 1) = 1into a static row dim (filed as [Bug] IR does not foldi - (i-1) = 1in range-slice bounds, breaks col_expand_mul static row-dim check pypto#1377).sparse_attnrope-interleave assemble: splitcfa_proj_rope_assembleinto_matmul+_combinescopes with FP32 intermediate buffers; BF16 add + cast now runs once perH * ROPE_DIMtile instead of per matmul chunk.Verified on a2a3:
qwen3_32b_decode: compile + runtime + golden PASS (17.4s)deepseek_v3_2_decode_front: compile + runtime + golden PASS (4/4 outputs)Related Issues
Related: hw-native-sys/pypto#1377