Skip to content

Conversation

@slokesha
Copy link
Contributor

@slokesha slokesha commented Jan 7, 2026

Qwen3-VL vision attention is updated to use FusedSDPA.apply directly when the query sequence length is within the supported fused range (q_len ≤ 65536).
This removes the per-block Q/K/V attention loop and enables the optimized HPU fused SDPA kernel for vision attention.

The change aligns Qwen3-VL with the optimized path already used by Qwen2.5-VL on Gaudi, improving efficiency while preserving identical model outputs.

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@slokesha slokesha force-pushed the slokesha/qwen3_enable_mask branch from c245eb8 to e07204d Compare January 7, 2026 21:07
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@slokesha
Copy link
Contributor Author

slokesha commented Jan 7, 2026

Performance:
FusedSDPA enabled -
Serving Benchmark Result

Successful requests: 100
Failed requests: 0
Maximum request concurrency: 8
Benchmark duration (s): 336.64
Total input tokens: 5280
Total generated tokens: 11827
Request throughput (req/s): 0.30
Output token throughput (tok/s): 35.13
Peak output token throughput (tok/s): 120.00
Peak concurrent requests: 15.00
Total token throughput (tok/s): 50.82
---------------Time to First Token----------------
Mean TTFT (ms): 10186.15
Median TTFT (ms): 10396.45
P99 TTFT (ms): 28860.68
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 138.33
Median TPOT (ms): 124.15
P99 TPOT (ms): 246.55
---------------Inter-token Latency----------------
Mean ITL (ms): 139.06
Median ITL (ms): 71.97
P99 ITL (ms): 2633.77

Main -
Serving Benchmark Result

Successful requests: 100
Failed requests: 0
Maximum request concurrency: 8
Benchmark duration (s): 334.12
Total input tokens: 5280
Total generated tokens: 11855
Request throughput (req/s): 0.30
Output token throughput (tok/s): 35.48
Peak output token throughput (tok/s): 113.00
Peak concurrent requests: 15.00
Total token throughput (tok/s): 51.28
---------------Time to First Token----------------
Mean TTFT (ms): 10806.36
Median TTFT (ms): 11990.35
P99 TTFT (ms): 22304.62
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 131.59
Median TPOT (ms): 111.62
P99 TPOT (ms): 231.90
---------------Inter-token Latency----------------
Mean ITL (ms): 131.67
Median ITL (ms): 73.13
P99 ITL (ms): 2470.87

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@slokesha slokesha marked this pull request as ready for review January 7, 2026 22:00
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@slokesha slokesha changed the title Use FuseSDPA for Qwen3_VL Enable HPU Fused SDPA for Qwen3-VL vision attention using attention masks Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant