Route prefix-cache-hit prefill through sink ASM MHA kernel#1345
Merged
Conversation
On gfx1250 with ATOM_USE_UNIFIED_ATTN, a prefix-cache hit during prefill fell back to the Triton unified_attention path instead of the sink ASM varlen kernel, because _can_attempt_prefill_sink_asm bailed on has_cached and on max_seqlen_q != max_seqlen_k. The gfx1250 sink varlen ASM kernel (fmha_fwd_with_sink_varlen_asm) actually handles bottom-right causal for sq < sk (chunked-prefill), and cu_seqlens_q/ cu_seqlens_k already carry the per-request new-token vs cached+new lengths. Verified on gfx1250 against a bottom-right causal + per-head sink reference (single/multi-batch, GQA, sq=1) within bf16 tolerance, and end-to-end on gpt-oss-120b (full-attention layers take the ASM path on a cache hit; the forced-Triton path never gathers). Changes: - _can_attempt_prefill_sink_asm: drop the has_cached and max_seqlen_q == max_seqlen_k gates. - prefill_attention: gather the cached+new KV into a dense packed tensor here, where the ASM varlen kernel consumes it. Each prefill backend now prepares its own KV: the ASM path gathers; the Triton path reads the paged cache directly via block_table and never gathers. - rope_cache: no longer gathers, so dispatch_backend sees q/k with matching token counts (sq == sk) and _can_use_prefill_sink_asm's shape check stays valid. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bb24447 to
be3520a
Compare
valarLip
approved these changes
Jun 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On gfx1250 with ATOM_USE_UNIFIED_ATTN, a prefix-cache hit during prefill fell back to the Triton unified_attention path instead of the sink ASM varlen kernel, because _can_attempt_prefill_sink_asm bailed on has_cached and on max_seqlen_q != max_seqlen_k.
The gfx1250 sink varlen ASM kernel (fmha_fwd_with_sink_varlen_asm) actually handles bottom-right causal for sq < sk (chunked-prefill); rope_cache already gathers cached+new KV into a dense packed [total_kv, ...] tensor that the kernel consumes, and cu_seqlens_q/cu_seqlens_k carry the per-request new-token vs cached+new lengths. Verified on gfx1250 against a bottom-right causal + per-head sink reference (single/multi-batch, GQA, sq=1) within bf16 tolerance.
Changes:
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist