Skip to content

Conversation

@skaulintel
Copy link
Contributor

port #759 to v0.14.0 release

libinta and others added 30 commits December 23, 2025 14:11
Also overwrite qwen3_vl function to use _merge_multimodal_embeddings
with index copy.
Format update
update for pre-commit error:
vllm_gaudi/models/qwen3_vl.py:62:25: F821 Undefined name `_require_is_multimodal`
Signed-off-by: Libin Tang <litang@habana.ai>
Following reasoning stated in PR:
vllm-project#616

Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com>
…ect#837)

Signed-off-by: linoy buchnik <lbuchnik@habana.ai>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Co-authored-by: Iryna Boiko <iboiko@habana.ai>
Adds support for cross-layer KV cache sharing on HPU, enabling models
like Gemma-3n that share KV cache between layers to run on Gaudi.

**Changes**
- hpu_attn.py: Store kv_sharing_target_layer_name and skip KV cache
writes for sharing layers
- hpu_model_runner.py: Track shared layers, validate config, and set up
tensor sharing during initialization
- test_hpu_model_runner.py: Enable KV sharing unit tests

**Expected Benefits**
Reduced KV cache memory usage for models with layer sharing
Lower TTFT for long-context scenarios in supported models (e.g.,
Gemma-3n)

**Testing**
Unit tests pass
E2E validation with a KV-sharing model (e.g., Gemma-3n) pending

---------

Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>
Co-authored-by: jakub-sochacki <jakub.sochacki@intel.com>
Signed-off-by: Shiv Kaul <shiv.kaul@intel.com>
from vllm.v1.attention.backends.mla.common import MLACommonImpl
from vllm_gaudi.attention.ops.hpu_paged_attn import (HPUPagedAttention, HPUPagedAttentionMetadata,
HPUPagedAttentionMetadataBuilder)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't change this file in original PR

# See https://help.github.com/articles/about-codeowners/
# for more info about CODEOWNERS file

* @kzawora-intel @xuechendi @adobrzyn @mgawarkiewicz-intel @afierka-intel @michalkuligowski @iboiko-habana @kamil-kaczor @ksmusz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't change this file in original PR

query, key, value, attn_bias, 0.0, is_causal, scale, softmax_mode, recompute_mode, valid_seq_lengths,
padding_side
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didnt chang this file in original PR

@skaulintel skaulintel changed the base branch from main to releases/v0.14.0 January 21, 2026 19:13
@skaulintel
Copy link
Contributor Author

moved to #858

@skaulintel skaulintel closed this Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants