[WIP] Add Chunked Shared Attention with Dense Biases #784

kzawora-intel · 2026-01-07T16:33:53Z

This PR adds chunked processing for shared blocks in unified attention, mainly to deal with memory issues when you have a lot of shared blocks.

The problem was that the full shared bias tensor [query_len, num_shared_blocks, block_size] can get huge with many shared blocks (e.g. for 8k query_len and 10k shared blocks, you'd have (8192*10000*128*2)/2^30=19.53 GiB!!!) and we were hitting OOMs.

The fix is to process shared blocks in chunks:

Instead of computing the full bias upfront, we now compute it per-chunk during attention
Each chunk processes chunk_size blocks (default 64), generates bias for just those blocks, computes partial attention, then uses flash-attention-style online softmax to merge results
- Again, for for 8k query_len and 10k shared blocks, you'd have now just (8192*64*128*2)/2^20=128 MiB of memory needed for biases, rather than the 19.53 GiB you initially would
The online merge is the standard max_new = max(max_global, max_chunk) + rescale dance

Dense bias approach

The other issue we've had here for a long time is shared bias generation. Previously we were passing variable-length coordinate arrays [token_idx, block_idx, usage] to the bias generator. We first created per-block block bias with block usages, and used token and block coordinates to scatter it to a big [max_num_batched_tokens, max_num_shared_blocks, block_size] tensor. It was pretty horrible - not just because of slow scatters, but mostly because of dynamic coordinate dimension (max_num_shared_tokens) that can't be easily derived from bucketed dimensions.

Now we use a "dense scatter on CPU, broadcast on HPU" approach:

CPU side: scatter into dense [query_len, num_shared_blocks] tensor (any shape works on CPU)
Transfer fixed-shape tensor to HPU
HPU side: generate bias via broadcast comparison (static shapes, graph-friendly)

So, basically, now we're creating the bias tensor (target_qlen, target_shared_blocks, block_size) by comparing & broadcasting relatively-reasonably-sized [query_len, num_shared_blocks] tensor with [num_shared_blocks] tensor, and the scatter operation goes to CPU, where dynamic indexing is (relatively) cheap. And we don't have to deal with max_num_shared_tokens that can range all the way from 2 to a bazillion, yay.

I'm not sure if there's a reason not to use the dense biases across the board (even for non-chunked shared attn), so I've enabled it by default. I left unified_attn_dense_shared_bias flag if anyone wants to disable it explicitly and use the old behavior.

What's new

SharedBlockChunkedBiasData: holds dense block_usages for chunk-wise bias generation
_partial_attn_shared_chunked(): chunked processing with online softmax merging
_partial_attn_shared_core(): extracted inner loop for reuse
HPUSharedBiasGeneratorDense: generates bias from dense block_usages via broadcast

Notes

GSM8K accuracy is the same (tested with chunk_size=8 and 64).
Performance is largely untested.
Warmup time will be likely negatively affected in large shared scenarios.
For whatever reason, this doesn't play nicely with unified_attn_softmax_fa2. I was testing with it off (VLLM_UNIFIED_ATTN_SOFTMAX_FA2=false). It works functionally, but gets me to OOM much sooner than I'd expect it to.
For whatever reason again, I can't fully get past warmup and I'm getting some weird crashes on later iterations. To be investigated. No point in merging this until it's resolved - that's like the main thing I'm trying to tackle here
This is still pretty much a PoC that's very much WIP. I like TLA (three-letter acronyms). EOS. CYA.

Signed-off-by: Konrad Zawora <[email protected]>

github-actions · 2026-01-07T16:34:03Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copilot

Pull request overview

This PR introduces chunked processing for shared blocks in unified attention to address memory issues when handling large numbers of shared blocks. The main motivation is that the full shared bias tensor [query_len, num_shared_blocks, block_size] can become prohibitively large (e.g., 19.53 GiB for 8k query length and 10k shared blocks), leading to out-of-memory errors.

Key changes:

Chunked attention processing with online softmax merging to reduce memory footprint from ~19.53 GiB to ~128 MiB for large scenarios
Dense bias generation approach that performs scatter operations on CPU and broadcasts on HPU, avoiding dynamic-length coordinate arrays
New SharedBlockChunkedBiasData class and _partial_attn_shared_chunked() function for chunk-wise processing

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
vllm_gaudi/v1/worker/hpu_model_runner.py	Updates unified config to check for both regular and chunked shared bias; removes extraneous blank line
vllm_gaudi/extension/unified_batch.py	Adds dense bias generator, implements chunked processing logic with configurable chunk size, and adds defensive checks for optional functions
vllm_gaudi/extension/unified.py	Implements core chunked attention logic with new helper functions, refactors shared attention to support both full and chunked modes, and optimizes tensor caching

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-07T16:34:54Z

vllm_gaudi/extension/unified_batch.py

+    # With chunked dense generation, we only allocate (target_qlen, target_shared_blocks) for block_usages
+    # instead of the full (target_qlen, target_shared_blocks, block_size) bias tensor.
+    # Bias is generated per chunk: (target_qlen, chunk_size, block_size)
+    default_chunk_size = 32  # Process up to 64 blocks at a time for shared attention


Comment states '64 blocks' but the variable is set to 32. The comment should match the actual value.

Suggested change

default_chunk_size = 32 # Process up to 64 blocks at a time for shared attention

default_chunk_size = 32 # Process up to 32 blocks at a time for shared attention

Copilot · 2026-01-07T16:34:55Z

vllm_gaudi/extension/unified_batch.py

-                continue
-            num_spec_tokens = len([i for i in spec_tokens if i != -1])
-            num_scheduled_tokens[idx] = num_spec_tokens + 1
+        if scheduled_spec_decode_tokens is not None:


The conditional check for scheduled_spec_decode_tokens has been added, but the indentation of the for-loop starting at line 580 was not updated. This changes the logic: the loop now only executes when scheduled_spec_decode_tokens is not None, whereas previously it always executed. Ensure the indentation is correct based on the intended behavior.

Copilot · 2026-01-07T16:34:55Z

vllm_gaudi/extension/unified_batch.py

+        if get_cumsum_and_arange is not None:
+            cu_num_tokens, _ = get_cumsum_and_arange(num_scheduled_tokens)
+            query_start_loc_np = query_start_loc_cpu.numpy()
+            query_start_loc_np[0] = 0
+            query_start_loc_np[1:num_reqs + 1] = cu_num_tokens


Lines 630-632 access query_start_loc_np but this is only defined inside the conditional block. If get_cumsum_and_arange is None, these lines will fail with an undefined variable error. The indentation should be corrected so that lines 630-632 are inside the conditional block.

Copilot · 2026-01-07T16:34:55Z

vllm_gaudi/extension/unified.py

+    During chunked attention, we slice block_usages[:, chunk_start:chunk_end] and
+    generate bias for each chunk on-the-fly.
+    """
+    block_usages: torch.tensor  # Dense: [num_query_tokens, num_shared_blocks]


Type hint uses lowercase torch.tensor which is not the correct type. Should be torch.Tensor (capitalized) for proper type hinting.

Copilot · 2026-01-07T16:34:56Z

vllm_gaudi/extension/unified.py

+def _partial_attn_shared_core(query: torch.tensor,
+                              key: torch.tensor,
+                              value: torch.tensor,
+                              bias: torch.tensor,
+                              fmin: torch.tensor,


Type hints use lowercase torch.tensor throughout the function signature. Should be torch.Tensor (capitalized) for consistency with Python type hinting conventions.

Signed-off-by: Konrad Zawora <[email protected]>

github-actions · 2026-01-07T17:00:56Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

Signed-off-by: Konrad Zawora <[email protected]>

github-actions · 2026-01-07T17:05:07Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

Signed-off-by: Konrad Zawora <[email protected]>

github-actions · 2026-01-07T17:13:37Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

github-actions · 2026-01-14T06:41:47Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

kzawora-intel added 2 commits January 7, 2026 12:35

Handle spec decode optionals in unified batch

689431e

Signed-off-by: Konrad Zawora <[email protected]>

[WIP] Chunked Shared Attention with Dense Biases

0068dd0

Signed-off-by: Konrad Zawora <[email protected]>

Copilot AI review requested due to automatic review settings January 7, 2026 16:33

kzawora-intel requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners January 7, 2026 16:33

Copilot AI reviewed Jan 7, 2026

View reviewed changes

kzawora-intel marked this pull request as draft January 7, 2026 16:34

cleanup

b4751ce

Signed-off-by: Konrad Zawora <[email protected]>

remove FA changes

1f0f17a

Signed-off-by: Konrad Zawora <[email protected]>

pad shared blocks to chunk size

019a1cf

Signed-off-by: Konrad Zawora <[email protected]>

kzawora-intel mentioned this pull request Jan 7, 2026

Draft: Add FlashAttention online merge in Unified Attention #785

Draft

Merge branch 'main' into private/kzawora/chunked_shared_attn

8039d03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Add Chunked Shared Attention with Dense Biases #784

[WIP] Add Chunked Shared Attention with Dense Biases #784

kzawora-intel commented Jan 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	default_chunk_size = 32 # Process up to 64 blocks at a time for shared attention
	default_chunk_size = 32 # Process up to 32 blocks at a time for shared attention

[WIP] Add Chunked Shared Attention with Dense Biases #784

Are you sure you want to change the base?

[WIP] Add Chunked Shared Attention with Dense Biases #784

Conversation

kzawora-intel commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dense bias approach

What's new

Notes

Uh oh!

github-actions bot commented Jan 7, 2026

🚧 CI Blocked

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 7, 2026

🚧 CI Blocked

Uh oh!

github-actions bot commented Jan 7, 2026

🚧 CI Blocked

Uh oh!

github-actions bot commented Jan 7, 2026

🚧 CI Blocked

Uh oh!

github-actions bot commented Jan 14, 2026

🚧 CI Blocked

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kzawora-intel commented Jan 7, 2026 •

edited

Loading