Skip to content

Conversation

@hlahkar
Copy link

@hlahkar hlahkar commented Oct 22, 2025

This PR enables GPT OSS for vllm-gaudi. Major features:

  1. Fused MoE with bias
  2. Attention with Sinks Enabled

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

else:
topk_weights = F.softmax(router_logits, dim=1, dtype=torch.float32)
topk_weights, topk_ids = torch.topk(topk_weights, top_k, dim=-1)
topk_weights /= topk_weights.sum(dim=-1, keepdim=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that L86 is not needed in the "if" section of the loop? Please check again once with vllm-fork.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is done as per vllm-fork code changes

attn = attn.to(value.dtype)
block_sums = attn.sum(dim=-1, keepdim=True)
attn_shape = attn.shape
block_sums = attn.view(-1,attn_shape[-1]).sum(dim=-1, keepdim=True).view(attn_shape[0],attn_shape[1],attn_shape[2],attn_shape[3],1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For final reshape we are assuming that original attn shape was 4d and one of the dims was "1". This may not be the case for all models. Can we make this generic so that other models do not break due to this reshape?

block_bias = block_bias.view(key.size(0), 1, 1, -1)
sink = None
if sinks is not None:
# sink = sinks.reshape(1, -1, 1, 1).expand(query.shape[0], -1, query.shape[-2], -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete commented line.


attn_weights = fsdpa_op(*args)
attn_weights = attn_weights.transpose(1, 2)
htcore.mark_step()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this mark_step only when we have sink so that it is applicable only for gpt-oss? Also put a TODO comment that we should check if we can remove this later.

attn_weights = fsdpa_op(query, key, value, attn_bias, 0.0, is_causal, scale, softmax_mode, recompute_mode,
valid_seq_lengths, 'right')
if window_size is not None:
#causal window sdpa kernel only supports softmax None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO to check if we can remove this later. As per OPs team softmax_mode = 'fast' should work now with window also.

if window_size is not None:
#causal window sdpa kernel only supports softmax None
softmax_mode = 'None'
# padding_side ='left'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented line.

# Reshape the input keys and values and store them in the cache.
# If kv_cache is not provided, the new key and value tensors are
# not cached. This happens during the initial memory profiling run.
if key.dtype != key_cache.dtype:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? This seems to have nothing to do with gpt-oss changes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed as the rotary_emb output is in fp32; so we need to to cast q and k to bf16.

attn_bias = attn_metadata.window_attn_bias
if self.sliding_window:
window_size = (
128,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to fix it? Maybe different for other models which use sliding window such as GEMMA

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to fix it? Maybe different for other models which use sliding window such as GEMMA

This size should ideally come from the sliding window size parameter; since the same is not implemented now it is fixed to 128 for the time being.

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

1 similar comment
@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

hlahkar and others added 17 commits October 24, 2025 10:31
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
UT reference fix after bucketing changes in
vllm-project#355 and
vllm-project#350

---------

Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Current CI tests don't utilize fixed seeds, resulting in some minor
accuracy fluctuations that can sometimes fall just under the tolerance
threshold (likely due to random sampling). A better way would be to fix
the seeds and always expect the same results.

Signed-off-by: Konrad Zawora <kzawora@habana.ai>
`coordinate_batch_across_dp` does more work than what we need for dp
padding here, so just implement the logic in plugin.

Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: PatrykWo <patryk.wolsza@intel.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
…6 + #25103 + #25807 (vllm-project#366)

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: Chendi Xue <Chendi.Xue@intel.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
…-project#370)

Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
…ect#382)

Signed-off-by: Tadeusz Lipinski <tlipinski@habana.ai>
Signed-off-by: PatrykWo <patryk.wolsza@intel.com>
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
port from SW-240222. the latest ray will lost HPU device.

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
…rectly while defining model class instead of maintaining model specific M-RoPE implementation in mrope.py (vllm-project#388)

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Currently out of the box linear doesn't work, this should help

---------

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
vllm-project#331)

[SW-241908] Fixes regression in tests due to invalid buckets generated
if VLLM_PROMPT_BS_BUCKET_MAX is set, and the number of tokens in prefill
batch exceeds the max_num_batched_tokens. The regression is associated
with vllm-project#224

This fix checks that the number of tokens in both the current and next
token bucket does not exceed max_num_batched tokens, and resolves
"ValueError: operands could not be broadcast together with shape"
exception

`VLLM_CONTIGUOUS_PA=false VLLM_DECODE_BLOCK_BUCKET_MAX=512 VLLM_USE_V1=1
VLLM_PROMPT_BS_BUCKET_MAX=16 vllm serve meta-llama/Llama-3.1-8B-Instruct
--dtype bfloat16 --tensor-parallel-size 1 --swap-space 16`

(EngineCore_DP0 pid=352574) File
"/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 4051, in
warmup_model
(EngineCore_DP0 pid=352574)     self.warmup_graphs(
(EngineCore_DP0 pid=352574) File
"/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 3659, in
warmup_graphs
(EngineCore_DP0 pid=352574) self._prepare_dummy_scenario(prompt_cfg,
decode_cfg)
(EngineCore_DP0 pid=352574) File
"/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 3897, in
_prepare_dummy_scenario
(EngineCore_DP0 pid=352574) self._execute_dummy_scenario(requests,
scheduled_tokens)
(EngineCore_DP0 pid=352574) File
"/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 3928, in
_execute_dummy_scenario
(EngineCore_DP0 pid=352574) self.execute_model(sched_output,
warmup_mode=True)
(EngineCore_DP0 pid=352574) File
"/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py",
line 120, in decorate_context
(EngineCore_DP0 pid=352574)     return func(*args, **kwargs)
(EngineCore_DP0 pid=352574) File
"/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 2914, in
execute_model
(EngineCore_DP0 pid=352574) prefill_input_data, decode_input_data =
self._prepare_inputs(scheduler_output, num_prefills, num_decodes,
(EngineCore_DP0 pid=352574) File
"/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 2304, in
_prepare_inputs
(EngineCore_DP0 pid=352574)
np.add(self.input_batch.num_computed_tokens_cpu[req_indices], arange,
out=positions_np)
(EngineCore_DP0 pid=352574) ValueError: operands could not be broadcast
together with shapes (6144,) (6144,) (2048,)

---------

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
adobrzyn and others added 26 commits October 24, 2025 10:31
- [x] warmup funcioning
- [x] no recompiles

---------

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
Co-authored-by: Konrad Zawora <kzawora@habana.ai>
…or KV register and update host_buffer accordingly (vllm-project#411)

Implement for SW-242433

** Root cause for accuracy issue **

nixl_connector register_kv_cache assume any input kv_caches are with 4D
or 5D.
`[2, num_blocks, block_size, num_kv_heads, head_size]` or `[num_blocks,
block_size, num_kv_heads, head_size]`

However, HPU KV cache is with 3D tuple: `Tuple([num_blocks*block_size,
num_kv_heads, head_size], ...)`

=> Different KV layout leads to incorrect num_blocks and data_ptr
calculation in nixl_connector => leads to wrong data copied.

** Solution **

1. create a new KV_caches_4D dict for nixl_conenctor. This 4D KV_caches
is a reference to original KV_cache with 4D view. (Same memory address)
2. Fix `habana_frameworks.torch.utils.experimental._data_ptr`
incapability of fetching address on view-tensor by using global map to
map virtual to physical addr.
3. add a new TupleTensor Class which treat tuple as Tensor to return
shape, device, dtype


** validation **

Tested with both Gaudi2Gaudi and Gaudi2CPU2Gaudi on "Qwen/Qwen3-0.6B",
"deepseek-ai/DeepSeek-V2-Lite-Chat"

All 4 cases get expected accuracy.

---------

Signed-off-by: Chendi Xue <Chendi.Xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
### Test with multimodal-support for multiple images

- Current CI test for gemma3 only runs single image per prompt and its
input seq_len is less than current sliding_window length (1024).
- This new test is designed such that the total input length exceeds the
default sliding_window length (1024) to help validate the sliding_window
mechanism is actually working or not.

---------

Signed-off-by: Mohit Deopujari <mdeopujari@habana.ai>
Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
…990 (vllm-project#413)

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Co-authored-by: Artur Fierka <artur.fierka@intel.com>
Porting the FP8 calibration procedure from vllm-hpu-extension:
https://github.com/HabanaAI/vllm-hpu-extension/tree/main/calibration

---------

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Culprit commit: vllm-project/vllm#27022

---------

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: PatrykWo <patryk.wolsza@intel.com>
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Co-authored-by: Patryk Wolsza <patryk.wolsza@intel.com>
SW-242362

Change DecodeTP=2, PrefillTP=1 for test

Issue:
- HPU attn is using NHD, and vllm upstream only support prefill / decode
heterogeneous TP with HND.

Solution:
 - init:
    hpu_attn with NHD -> host_buffer with HND

 - copy device to host:
    permute kv for req -> copy to host buffer

 - nixl_connector transfer host KV with HND + TP_ratio support

 - copy host to device
    permute kv for req -> copy to device

=====

Validated, accuracy is good

--- 
FYI, no need change for MLA (Deepseek)

---------

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
…llm-project#385)

### Summary

This PR fixes a minor typo in the `installation.md` file where the
installation command incorrectly referenced `install_nixl.sh`. The
correct script name is `install_nixl.py`.

### Changes Made

- Updated `python install_nixl.sh` to `python install_nixl.py` in
installation.md.

### Why This Is Needed

The incorrect script name could lead to confusion or installation errors
for users following the documentation. This change ensures clarity and
accuracy in the setup instructions.

Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
…egression (vllm-project#424)

Fix performance regression caused by missing warmup buckets associated
with vllm-project#331

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
General formatting, fixes, some updates to docs

---------

Signed-off-by: PatrykWo <patryk.wolsza@intel.com>
Fix for "[Performance] Dual stream execution of "shared_experts" and
"selected_experts" inside FusedMoE #26440"
Similar to [Bugfix][CPU] Disable dual stream execution for experts on
CPU #27320

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
the title of this pr sounds like random set of words put together to
sound wise (or something from passphrase generator like "rose listen
donkey wild function") but i swear it is not

Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: Patryk Wolsza <patryk.wolsza@intel.com>
Completed _Executing inference_ section of _Quickstart_ doc.

Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai>
Co-authored-by: Patryk Wolsza <patryk.wolsza@intel.com>
Set numpy to latest

Cherry-pick from releases/v0.11.0:
vllm-project#443

Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
This is to reduce cached memory size for DP dispatch/combine when HPU
graph enabled.

---------

Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
~Depends on vllm-project#226
See last commit added for this PR.

---------

Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Add bucketing from file - experimental use for now!

- [X] Prompt / decode read from file
- [X] Example file with example buckets
- Unified attention from file -> to be done in differen PR
- [X] Filter recieved buckets?
- [X] Ranges
- [ ] README

---------

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
…vllm-project#451)

INC calibration fails with math domain error when math.log2 op has
undefined result during INC calibration. The INC calibration scripts
uses max_model_len=128 which results in max_ctx=0. log2 of zero is
undefined.

`PT_HPU_LAZY_MODE=1 ./calibrate_model.sh -m meta-llama/Llama-3.1-70B -d
NeelNanda/pile-10k -o ./inc2 -b 1 -t 2 -l 5`


ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m INFO 10-22 22:53:05
[hpu_worker.py:242] Initializing cache engine took 74.67 GiB of device
memory (109 GiB/126.5 GiB used) and -1.967 GiB of host memory (93.07
GiB/1007 GiB used)
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] WorkerProc hit an exception.
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] Traceback (most recent call last):
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] File
"/tmp/vllm/vllm/v1/executor/multiproc_executor.py", line 698, in
worker_busy_loop
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] output = func(*args, **kwargs)
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] File
"/tmp/vllm/vllm/v1/worker/worker_base.py", line 305, in
initialize_from_config
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703]
self.worker.initialize_from_config(kv_cache_config) # type: ignore
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] File
"/tmp/vllm-gaudi/vllm_gaudi/v1/worker/hpu_worker.py", line 243, in
initialize_from_config
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] self.compile_or_warm_up_model()
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] File
"/tmp/vllm-gaudi/vllm_gaudi/v1/worker/hpu_worker.py", line 249, in
compile_or_warm_up_model
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] self.model_runner.warmup_model()
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] File
"/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py",
line 116, in decorate_context
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] return func(*args, **kwargs)
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] File
"/tmp/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 4028,
in warmup_model
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703]
self.bucketing_manager.generate_prompt_buckets()
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] File
"/tmp/vllm-gaudi/vllm_gaudi/extension/bucketing/common.py", line 110, in
generate_prompt_buckets
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] bs_cfg, query_cfg, ctx_cfg =
strategy.get_prompt_cfgs(max_num_prefill_seqs=self.max_num_prefill_seqs,
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] File
"/tmp/vllm-gaudi/vllm_gaudi/extension/bucketing/exponential.py", line
42, in get_prompt_cfgs
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] max_prompt_ctx_limit = 2 if max_ctx == 1
else math.ceil(math.log2(max_ctx)) + 1
ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05
[multiproc_executor.py:703] ValueError: math domain error

Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
This line of code incorrectly filtered requirements to be installed:
`sed '/^[torch]/d' requirements/build.txt`
- filters out all packages which names start with t / o / r / c / h
 
So if requirements were:
`
cmake>=3.26.1 
ninja 
packaging>=24.2 
setuptools>=77.0.3,<80.0.0 
setuptools-scm>=8 
torch==2.8.0 
wheel 
jinja2>=3.1.6 
regex 
build  
`
we were skipping cmake, torch and regex.

`sed '/^torch/d' requirements/build.txt`
this would skip only all torch packages.

Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@hlahkar
Copy link
Author

hlahkar commented Oct 28, 2025

Tracking through #485

@hlahkar hlahkar closed this Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.