Draft: Add FlashAttention online merge in Unified Attention #785

kzawora-intel · 2026-01-07T17:36:23Z

Further experiments on top of #784 - I wanted to check if we can avoid some OOMs by performing FlashAttention rescaling online rather than after computing all the parts - should save us memory on some intermediate buffers. Accuracy is surprisingly okay-ish, but I haven't tested this too thouroughly.

Signed-off-by: Konrad Zawora <[email protected]>

github-actions · 2026-01-07T17:36:36Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copilot

Pull request overview

This PR implements FlashAttention online merge in Unified Attention to reduce memory consumption by performing rescaling incrementally rather than after computing all attention parts. The changes introduce chunked processing for shared blocks with online bias generation, avoiding materialization of large bias tensors.

Key changes:

Added online merge algorithm that incrementally combines attention results using flash-attention style rescaling
Implemented chunked processing for shared blocks with per-chunk bias generation from dense block_usages
Introduced dense bias generation path that scatters on CPU and broadcasts on HPU to avoid dynamic shapes

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
vllm_gaudi/v1/worker/hpu_model_runner.py	Updated unified config to detect both shared_bias and shared_bias_chunked, removed blank line
vllm_gaudi/extension/unified_batch.py	Added dense bias generation, chunked processing logic, and new SharedBiasGeneratorDense class
vllm_gaudi/extension/unified.py	Implemented online merge algorithm, chunked shared attention computation, and updated entry points
vllm_gaudi/extension/features.py	Added unified_attn_dense_shared_bias feature flag

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-07T17:37:51Z

vllm_gaudi/extension/unified_batch.py

+        if get_cumsum_and_arange is not None:
+            cu_num_tokens, _ = get_cumsum_and_arange(num_scheduled_tokens)


The condition check for get_cumsum_and_arange should be performed before calling it. However, this introduces a problem: if get_cumsum_and_arange is None, then cu_num_tokens remains undefined, but lines 746-747 attempt to use it. This will cause a NameError at runtime when get_cumsum_and_arange is None.

Copilot · 2026-01-07T17:37:51Z

vllm_gaudi/extension/unified.py

+    if scaled_query_latent is not None:
+        shared = partial_attn_shared(query=scaled_query_latent,


The conditional check for scaled_query_latent was moved to wrap the entire partial_attn_shared call rather than using a ternary expression. While this is more readable, it creates code duplication with the else clause at lines 875-876 that sets shared to (None, None, None). Consider extracting this pattern into a helper or maintaining the ternary for consistency with the causal and unique attention calls.

Copilot · 2026-01-07T17:37:52Z

vllm_gaudi/extension/unified.py

-            return query_latent.flatten(-2, -1)  # [tokens, num_heads * head_dim]
+    if use_online_merge:
+        acc_attn, acc_max, acc_sum = online_merge_step(acc_attn, acc_max, acc_sum, *unique)
+    if use_online_merge:


There are two consecutive 'if use_online_merge:' checks at lines 885 and 887. These should be combined into a single conditional block to improve readability and avoid redundant checks.

Suggested change

if use_online_merge:

Signed-off-by: Konrad Zawora <[email protected]>

github-actions · 2026-01-14T13:34:40Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

Signed-off-by: Konrad Zawora <[email protected]>

github-actions · 2026-01-14T14:07:12Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

kzawora-intel added 6 commits January 7, 2026 12:35

Handle spec decode optionals in unified batch

689431e

Signed-off-by: Konrad Zawora <[email protected]>

[WIP] Chunked Shared Attention with Dense Biases

0068dd0

Signed-off-by: Konrad Zawora <[email protected]>

cleanup

b4751ce

Signed-off-by: Konrad Zawora <[email protected]>

remove FA changes

1f0f17a

Signed-off-by: Konrad Zawora <[email protected]>

pad shared blocks to chunk size

019a1cf

Signed-off-by: Konrad Zawora <[email protected]>

Add FA online merge to UA

2d708e4

Signed-off-by: Konrad Zawora <[email protected]>

Copilot AI review requested due to automatic review settings January 7, 2026 17:36

kzawora-intel requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners January 7, 2026 17:36

kzawora-intel marked this pull request as draft January 7, 2026 17:36

Copilot AI reviewed Jan 7, 2026

View reviewed changes

kzawora-intel added 3 commits January 8, 2026 17:20

Unified Attention - multi-step low-level profiling

ef3c73f

Signed-off-by: Konrad Zawora <[email protected]>

Merge remote-tracking branch 'origin' into origin/main

26cfc38

Signed-off-by: Konrad Zawora <[email protected]>

cleanup flags + add graph splitting

527304e

Signed-off-by: Konrad Zawora <[email protected]>

reduce chunk size, fix mla

8a5eac0

Signed-off-by: Konrad Zawora <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Draft: Add FlashAttention online merge in Unified Attention #785

Draft: Add FlashAttention online merge in Unified Attention #785

kzawora-intel commented Jan 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if get_cumsum_and_arange is not None:
		cu_num_tokens, _ = get_cumsum_and_arange(num_scheduled_tokens)

		if scaled_query_latent is not None:
		shared = partial_attn_shared(query=scaled_query_latent,

Draft: Add FlashAttention online merge in Unified Attention #785

Are you sure you want to change the base?

Draft: Add FlashAttention online merge in Unified Attention #785

Conversation

kzawora-intel commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 7, 2026

🚧 CI Blocked

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 14, 2026

🚧 CI Blocked

Uh oh!

github-actions bot commented Jan 14, 2026

🚧 CI Blocked

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kzawora-intel commented Jan 7, 2026 •

edited

Loading