Enable HPU Fused SDPA for Qwen3-VL vision attention using attention masks #787

slokesha · 2026-01-07T21:07:28Z

Qwen3-VL vision attention is updated to use FusedSDPA.apply directly when the query sequence length is within the supported fused range (q_len ≤ 65536).
This removes the per-block Q/K/V attention loop and enables the optimized HPU fused SDPA kernel for vision attention.

The change aligns Qwen3-VL with the optimized path already used by Qwen2.5-VL on Gaudi, improving efficiency while preserving identical model outputs.

github-actions · 2026-01-07T21:07:40Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

Signed-off-by: slokesha <[email protected]>

github-actions · 2026-01-07T21:08:00Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

slokesha · 2026-01-07T21:54:43Z

Performance:
FusedSDPA enabled -
Serving Benchmark Result
Successful requests: 100
Failed requests: 0
Maximum request concurrency: 8
Benchmark duration (s): 336.64
Total input tokens: 5280
Total generated tokens: 11827
Request throughput (req/s): 0.30
Output token throughput (tok/s): 35.13
Peak output token throughput (tok/s): 120.00
Peak concurrent requests: 15.00
Total token throughput (tok/s): 50.82
---------------Time to First Token----------------
Mean TTFT (ms): 10186.15
Median TTFT (ms): 10396.45
P99 TTFT (ms): 28860.68
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 138.33
Median TPOT (ms): 124.15
P99 TPOT (ms): 246.55
---------------Inter-token Latency----------------
Mean ITL (ms): 139.06
Median ITL (ms): 71.97
P99 ITL (ms): 2633.77

Main -
Serving Benchmark Result
Successful requests: 100
Failed requests: 0
Maximum request concurrency: 8
Benchmark duration (s): 334.12
Total input tokens: 5280
Total generated tokens: 11855
Request throughput (req/s): 0.30
Output token throughput (tok/s): 35.48
Peak output token throughput (tok/s): 113.00
Peak concurrent requests: 15.00
Total token throughput (tok/s): 51.28
---------------Time to First Token----------------
Mean TTFT (ms): 10806.36
Median TTFT (ms): 11990.35
P99 TTFT (ms): 22304.62
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 131.59
Median TPOT (ms): 111.62
P99 TPOT (ms): 231.90
---------------Inter-token Latency----------------
Mean ITL (ms): 131.67
Median ITL (ms): 73.13
P99 ITL (ms): 2470.87

github-actions · 2026-01-07T22:00:11Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Signed-off-by: Spurthi Lokeshappa <[email protected]>

github-actions · 2026-01-07T22:00:38Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

Updated qwen3 to use HPUAttention

e07204d

Signed-off-by: slokesha <[email protected]>

slokesha force-pushed the slokesha/qwen3_enable_mask branch from c245eb8 to e07204d Compare January 7, 2026 21:07

slokesha marked this pull request as ready for review January 7, 2026 22:00

slokesha requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, kzawora-intel, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners January 7, 2026 22:00

slokesha marked this pull request as draft January 7, 2026 22:00

Merge branch 'main' into slokesha/qwen3_enable_mask

47143b8

Signed-off-by: Spurthi Lokeshappa <[email protected]>

slokesha marked this pull request as ready for review January 7, 2026 22:00

slokesha changed the title ~~Use FuseSDPA for Qwen3_VL~~ Enable HPU Fused SDPA for Qwen3-VL vision attention using attention masks Jan 7, 2026

github-actions bot mentioned this pull request Jan 7, 2026

🚦 Team Review Dashboard #701

Open

Merge branch 'main' into slokesha/qwen3_enable_mask

1dca268

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable HPU Fused SDPA for Qwen3-VL vision attention using attention masks #787

Enable HPU Fused SDPA for Qwen3-VL vision attention using attention masks #787

slokesha commented Jan 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

slokesha commented Jan 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Enable HPU Fused SDPA for Qwen3-VL vision attention using attention masks #787

Are you sure you want to change the base?

Enable HPU Fused SDPA for Qwen3-VL vision attention using attention masks #787

Conversation

slokesha commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 7, 2026

🚧 CI Blocked

Uh oh!

github-actions bot commented Jan 7, 2026

🚧 CI Blocked

Uh oh!

slokesha commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 7, 2026

🚧 CI Blocked

Uh oh!

github-actions bot commented Jan 7, 2026

🚧 CI Blocked

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

slokesha commented Jan 7, 2026 •

edited

Loading

slokesha commented Jan 7, 2026 •

edited

Loading