Gpt Oss Enablement #441

hlahkar · 2025-10-22T04:23:43Z

This PR enables GPT OSS for vllm-gaudi. Major features:

Fused MoE with bias
Attention with Sinks Enabled

github-actions · 2025-10-22T04:23:59Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

vivekgoe · 2025-10-22T05:16:37Z

vllm_gaudi/ops/hpu_fused_moe.py

+            else:
+                topk_weights = F.softmax(router_logits, dim=1, dtype=torch.float32)
+                topk_weights, topk_ids = torch.topk(topk_weights, top_k, dim=-1)
+                topk_weights /= topk_weights.sum(dim=-1, keepdim=True)


Are we sure that L86 is not needed in the "if" section of the loop? Please check again once with vllm-fork.

Yes this is done as per vllm-fork code changes

vivekgoe · 2025-10-22T05:23:00Z

vllm_gaudi/extension/ops.py

            attn = attn.to(value.dtype)
-        block_sums = attn.sum(dim=-1, keepdim=True)
+        attn_shape = attn.shape
+        block_sums = attn.view(-1,attn_shape[-1]).sum(dim=-1, keepdim=True).view(attn_shape[0],attn_shape[1],attn_shape[2],attn_shape[3],1)


For final reshape we are assuming that original attn shape was 4d and one of the dims was "1". This may not be the case for all models. Can we make this generic so that other models do not break due to this reshape?

vivekgoe · 2025-10-22T05:23:22Z

vllm_gaudi/extension/ops.py

    block_bias = block_bias.view(key.size(0), 1, 1, -1)
+    sink = None
+    if sinks is not None:
+        # sink = sinks.reshape(1, -1, 1, 1).expand(query.shape[0], -1, query.shape[-2], -1)


Delete commented line.

vivekgoe · 2025-10-22T05:28:54Z

vllm_gaudi/extension/ops.py

+
+    attn_weights = fsdpa_op(*args)
    attn_weights = attn_weights.transpose(1, 2)
+    htcore.mark_step()


Can we put this mark_step only when we have sink so that it is applicable only for gpt-oss? Also put a TODO comment that we should check if we can remove this later.

vivekgoe · 2025-10-22T05:32:08Z

vllm_gaudi/extension/ops.py

-    attn_weights = fsdpa_op(query, key, value, attn_bias, 0.0, is_causal, scale, softmax_mode, recompute_mode,
-                            valid_seq_lengths, 'right')
+    if window_size is not None:
+        #causal window sdpa kernel only supports softmax None


Add a TODO to check if we can remove this later. As per OPs team softmax_mode = 'fast' should work now with window also.

vivekgoe · 2025-10-22T05:32:17Z

vllm_gaudi/extension/ops.py

+    if window_size is not None:
+        #causal window sdpa kernel only supports softmax None
+        softmax_mode = 'None'
+        # padding_side ='left'


Remove commented line.

vivekgoe · 2025-10-22T05:33:16Z

vllm_gaudi/attention/backends/hpu_attn.py

            # Reshape the input keys and values and store them in the cache.
            # If kv_cache is not provided, the new key and value tensors are
            # not cached. This happens during the initial memory profiling run.
+            if key.dtype != key_cache.dtype:


Why do we need this? This seems to have nothing to do with gpt-oss changes?

This is needed as the rotary_emb output is in fp32; so we need to to cast q and k to bf16.

vivekgoe · 2025-10-22T05:34:10Z

vllm_gaudi/attention/backends/hpu_attn.py

                attn_bias = attn_metadata.window_attn_bias
+            if self.sliding_window:
+                window_size = (
+                        128,


Do we need to fix it? Maybe different for other models which use sliding window such as GEMMA

Do we need to fix it? Maybe different for other models which use sliding window such as GEMMA

This size should ideally come from the sliding window size parameter; since the same is not implemented now it is fixed to 128 for the time being.

github-actions · 2025-10-22T05:35:02Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2025-10-23T03:58:38Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>

UT reference fix after bucketing changes in vllm-project#355 and vllm-project#350 --------- Signed-off-by: Konrad Zawora <kzawora@habana.ai>

…ct#373) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com>

Current CI tests don't utilize fixed seeds, resulting in some minor accuracy fluctuations that can sometimes fall just under the tolerance threshold (likely due to random sampling). A better way would be to fix the seeds and always expect the same results. Signed-off-by: Konrad Zawora <kzawora@habana.ai>

`coordinate_batch_across_dp` does more work than what we need for dp padding here, so just implement the logic in plugin. Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

Signed-off-by: PatrykWo <patryk.wolsza@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>

…6 + #25103 + #25807 (vllm-project#366) Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

…-project#370) Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>

…ect#382) Signed-off-by: Tadeusz Lipinski <tlipinski@habana.ai>

Signed-off-by: PatrykWo <patryk.wolsza@intel.com>

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

port from SW-240222. the latest ray will lost HPU device. Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

…rectly while defining model class instead of maintaining model specific M-RoPE implementation in mrope.py (vllm-project#388) Signed-off-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

Currently out of the box linear doesn't work, this should help --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

vllm-project#331) [SW-241908] Fixes regression in tests due to invalid buckets generated if VLLM_PROMPT_BS_BUCKET_MAX is set, and the number of tokens in prefill batch exceeds the max_num_batched_tokens. The regression is associated with vllm-project#224 This fix checks that the number of tokens in both the current and next token bucket does not exceed max_num_batched tokens, and resolves "ValueError: operands could not be broadcast together with shape" exception `VLLM_CONTIGUOUS_PA=false VLLM_DECODE_BLOCK_BUCKET_MAX=512 VLLM_USE_V1=1 VLLM_PROMPT_BS_BUCKET_MAX=16 vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype bfloat16 --tensor-parallel-size 1 --swap-space 16` (EngineCore_DP0 pid=352574) File "/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 4051, in warmup_model (EngineCore_DP0 pid=352574) self.warmup_graphs( (EngineCore_DP0 pid=352574) File "/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 3659, in warmup_graphs (EngineCore_DP0 pid=352574) self._prepare_dummy_scenario(prompt_cfg, decode_cfg) (EngineCore_DP0 pid=352574) File "/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 3897, in _prepare_dummy_scenario (EngineCore_DP0 pid=352574) self._execute_dummy_scenario(requests, scheduled_tokens) (EngineCore_DP0 pid=352574) File "/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 3928, in _execute_dummy_scenario (EngineCore_DP0 pid=352574) self.execute_model(sched_output, warmup_mode=True) (EngineCore_DP0 pid=352574) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore_DP0 pid=352574) return func(*args, **kwargs) (EngineCore_DP0 pid=352574) File "/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 2914, in execute_model (EngineCore_DP0 pid=352574) prefill_input_data, decode_input_data = self._prepare_inputs(scheduler_output, num_prefills, num_decodes, (EngineCore_DP0 pid=352574) File "/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 2304, in _prepare_inputs (EngineCore_DP0 pid=352574) np.add(self.input_batch.num_computed_tokens_cpu[req_indices], arange, out=positions_np) (EngineCore_DP0 pid=352574) ValueError: operands could not be broadcast together with shapes (6144,) (6144,) (2048,) --------- Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>

- [x] warmup funcioning - [x] no recompiles --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com> Co-authored-by: Konrad Zawora <kzawora@habana.ai>

…or KV register and update host_buffer accordingly (vllm-project#411) Implement for SW-242433 ** Root cause for accuracy issue ** nixl_connector register_kv_cache assume any input kv_caches are with 4D or 5D. `[2, num_blocks, block_size, num_kv_heads, head_size]` or `[num_blocks, block_size, num_kv_heads, head_size]` However, HPU KV cache is with 3D tuple: `Tuple([num_blocks*block_size, num_kv_heads, head_size], ...)` => Different KV layout leads to incorrect num_blocks and data_ptr calculation in nixl_connector => leads to wrong data copied. ** Solution ** 1. create a new KV_caches_4D dict for nixl_conenctor. This 4D KV_caches is a reference to original KV_cache with 4D view. (Same memory address) 2. Fix `habana_frameworks.torch.utils.experimental._data_ptr` incapability of fetching address on view-tensor by using global map to map virtual to physical addr. 3. add a new TupleTensor Class which treat tuple as Tensor to return shape, device, dtype ** validation ** Tested with both Gaudi2Gaudi and Gaudi2CPU2Gaudi on "Qwen/Qwen3-0.6B", "deepseek-ai/DeepSeek-V2-Lite-Chat" All 4 cases get expected accuracy. --------- Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Chendi Xue <chendi.xue@intel.com>

### Test with multimodal-support for multiple images - Current CI test for gemma3 only runs single image per prompt and its input seq_len is less than current sliding_window length (1024). - This new test is designed such that the total input length exceeds the default sliding_window length (1024) to help validate the sliding_window mechanism is actually working or not. --------- Signed-off-by: Mohit Deopujari <mdeopujari@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>

…990 (vllm-project#413) Signed-off-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com>

Porting the FP8 calibration procedure from vllm-hpu-extension: https://github.com/HabanaAI/vllm-hpu-extension/tree/main/calibration --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

Culprit commit: vllm-project/vllm#27022 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Signed-off-by: PatrykWo <patryk.wolsza@intel.com> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Patryk Wolsza <patryk.wolsza@intel.com>

…vllm-project#427) Signed-off-by: Iryna Boiko <iboiko@habana.ai>

SW-242362 Change DecodeTP=2, PrefillTP=1 for test Issue: - HPU attn is using NHD, and vllm upstream only support prefill / decode heterogeneous TP with HND. Solution: - init: hpu_attn with NHD -> host_buffer with HND - copy device to host: permute kv for req -> copy to host buffer - nixl_connector transfer host KV with HND + TP_ratio support - copy host to device permute kv for req -> copy to device ===== Validated, accuracy is good --- FYI, no need change for MLA (Deepseek) --------- Signed-off-by: Chendi Xue <chendi.xue@intel.com>

…llm-project#385) ### Summary This PR fixes a minor typo in the `installation.md` file where the installation command incorrectly referenced `install_nixl.sh`. The correct script name is `install_nixl.py`. ### Changes Made - Updated `python install_nixl.sh` to `python install_nixl.py` in installation.md. ### Why This Is Needed The incorrect script name could lead to confusion or installation errors for users following the documentation. This change ensures clarity and accuracy in the setup instructions. Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>

…egression (vllm-project#424) Fix performance regression caused by missing warmup buckets associated with vllm-project#331 Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

General formatting, fixes, some updates to docs --------- Signed-off-by: PatrykWo <patryk.wolsza@intel.com>

Fix for "[Performance] Dual stream execution of "shared_experts" and "selected_experts" inside FusedMoE #26440" Similar to [Bugfix][CPU] Disable dual stream execution for experts on CPU #27320 Signed-off-by: Iryna Boiko <iboiko@habana.ai>

the title of this pr sounds like random set of words put together to sound wise (or something from passphrase generator like "rose listen donkey wild function") but i swear it is not Signed-off-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Patryk Wolsza <patryk.wolsza@intel.com>

Completed _Executing inference_ section of _Quickstart_ doc. Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai> Co-authored-by: Patryk Wolsza <patryk.wolsza@intel.com>

Set numpy to latest Cherry-pick from releases/v0.11.0: vllm-project#443 Signed-off-by: Artur Fierka <artur.fierka@intel.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

This is to reduce cached memory size for DP dispatch/combine when HPU graph enabled. --------- Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>

~Depends on vllm-project#226 See last commit added for this PR. --------- Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

Add bucketing from file - experimental use for now! - [X] Prompt / decode read from file - [X] Example file with example buckets - Unified attention from file -> to be done in differen PR - [X] Filter recieved buckets? - [X] Ranges - [ ] README --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>

…vllm-project#451) INC calibration fails with math domain error when math.log2 op has undefined result during INC calibration. The INC calibration scripts uses max_model_len=128 which results in max_ctx=0. log2 of zero is undefined. `PT_HPU_LAZY_MODE=1 ./calibrate_model.sh -m meta-llama/Llama-3.1-70B -d NeelNanda/pile-10k -o ./inc2 -b 1 -t 2 -l 5` ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m INFO 10-22 22:53:05 [hpu_worker.py:242] Initializing cache engine took 74.67 GiB of device memory (109 GiB/126.5 GiB used) and -1.967 GiB of host memory (93.07 GiB/1007 GiB used) ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] WorkerProc hit an exception. ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] Traceback (most recent call last): ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm/vllm/v1/executor/multiproc_executor.py", line 698, in worker_busy_loop ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] output = func(*args, **kwargs) ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm/vllm/v1/worker/worker_base.py", line 305, in initialize_from_config ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] self.worker.initialize_from_config(kv_cache_config) # type: ignore ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm-gaudi/vllm_gaudi/v1/worker/hpu_worker.py", line 243, in initialize_from_config ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] self.compile_or_warm_up_model() ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm-gaudi/vllm_gaudi/v1/worker/hpu_worker.py", line 249, in compile_or_warm_up_model ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] self.model_runner.warmup_model() ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] return func(*args, **kwargs) ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 4028, in warmup_model ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] self.bucketing_manager.generate_prompt_buckets() ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm-gaudi/vllm_gaudi/extension/bucketing/common.py", line 110, in generate_prompt_buckets ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] bs_cfg, query_cfg, ctx_cfg = strategy.get_prompt_cfgs(max_num_prefill_seqs=self.max_num_prefill_seqs, ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm-gaudi/vllm_gaudi/extension/bucketing/exponential.py", line 42, in get_prompt_cfgs ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] max_prompt_ctx_limit = 2 if max_ctx == 1 else math.ceil(math.log2(max_ctx)) + 1 ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] ValueError: math domain error Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

This line of code incorrectly filtered requirements to be installed: `sed '/^[torch]/d' requirements/build.txt` - filters out all packages which names start with t / o / r / c / h So if requirements were: ` cmake>=3.26.1 ninja packaging>=24.2 setuptools>=77.0.3,<80.0.0 setuptools-scm>=8 torch==2.8.0 wheel jinja2>=3.1.6 regex build ` we were skipping cmake, torch and regex. `sed '/^torch/d' requirements/build.txt` this would skip only all torch packages. Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>

Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>

github-actions · 2025-10-24T07:33:00Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

hlahkar · 2025-10-28T04:31:06Z

Tracking through #485

vivekgoe suggested changes Oct 22, 2025

View reviewed changes

vivekgoe marked this pull request as ready for review October 22, 2025 05:34

vivekgoe requested review from adobrzyn, afierka-intel, iboiko-habana, kzawora-intel, mgawarkiewicz-intel, michalkuligowski, mswiniarsk and xuechendi as code owners October 22, 2025 05:34

hlahkar and others added 17 commits October 24, 2025 10:31

Initial Commit GPT-OSS

bc964ec

Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>

Add Naive Attention

6bfaf28

Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>

[Bugfix] Fix bucketing UT (vllm-project#367)

4a4f5ea

UT reference fix after bucketing changes in vllm-project#355 and vllm-project#350 --------- Signed-off-by: Konrad Zawora <kzawora@habana.ai>

[GITHUB ACTION] Remove commits comparison so we can rerun (vllm-proje…

7959616

…ct#373) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com>

Fix dp padding after upstream change #25768 (vllm-project#362)

aa34036

`coordinate_batch_across_dp` does more work than what we need for dp padding here, so just implement the logic in plugin. Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

Create LICENSE (vllm-project#379)

340aa0b

Change to starting page and installation (vllm-project#371)

1d72e52

Signed-off-by: PatrykWo <patryk.wolsza@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>

[FIX_FOR_VLLM_LATEST] Fix upstream crash introduced by #24486 + #2492…

d365704

…6 + #25103 + #25807 (vllm-project#366) Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

Enable Parallel Compilation feature for compile mode by default (vllm…

a5c694f

…-project#370) Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>

[SW-239226] Adjust junit xml filenames for retry mechanism (vllm-proj…

b8d38e0

…ect#382) Signed-off-by: Tadeusz Lipinski <tlipinski@habana.ai>

Docs installation, quick start and build fixes (vllm-project#384)

6b42a79

Signed-off-by: PatrykWo <patryk.wolsza@intel.com>

Correct htexp._data_ptr utility (vllm-project#387)

940af0f

Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

ray: pin ray to <2.49.0 (vllm-project#386)

a48630f

port from SW-240222. the latest ray will lost HPU device. Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>

adobrzyn and others added 26 commits October 24, 2025 10:31

Unified attention improvemets (vllm-project#363)

f976f23

- [x] warmup funcioning - [x] no recompiles --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com> Co-authored-by: Konrad Zawora <kzawora@habana.ai>

[FIX_FOR_VLLM_LATEST] Fix for Separate out vllm.utils.collections #26…

5efe6a2

…990 (vllm-project#413) Signed-off-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com>

Add fp8 calibration procedure (vllm-project#309)

d47f341

Porting the FP8 calibration procedure from vllm-hpu-extension: https://github.com/HabanaAI/vllm-hpu-extension/tree/main/calibration --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>

[FIX_FOR_VLLM_LATEST] Fix for #27022 (vllm-project#418)

5883300

Culprit commit: vllm-project/vllm#27022 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

[CI]unified attn is too easy to fail, add small RTOL (vllm-project#422)

c29a226

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Update supported_features.md (vllm-project#180)

d0492f2

Signed-off-by: PatrykWo <patryk.wolsza@intel.com> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Patryk Wolsza <patryk.wolsza@intel.com>

[FIX_FOR_VLLM_LATEST] Fixes for upstream #26908 and #27143 and #27169 (…

4065723

…vllm-project#427) Signed-off-by: Iryna Boiko <iboiko@habana.ai>

[SW-242466] Update not_over_max_model_len filter to fix warmup perf r…

b91c957

…egression (vllm-project#424) Fix performance regression caused by missing warmup buckets associated with vllm-project#331 Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>

Docs update post v0.11 (vllm-project#428)

f8f2827

General formatting, fixes, some updates to docs --------- Signed-off-by: PatrykWo <patryk.wolsza@intel.com>

Update docs: Quickstart - Executing inference (vllm-project#410)

46f9ad8

Completed _Executing inference_ section of _Quickstart_ doc. Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai> Co-authored-by: Patryk Wolsza <patryk.wolsza@intel.com>

[Security] Update requirements.txt (vllm-project#443) (vllm-project#445)

272c110

Set numpy to latest Cherry-pick from releases/v0.11.0: vllm-project#443 Signed-off-by: Artur Fierka <artur.fierka@intel.com>

[GITHUB ACTION] Always run same job to same node (vllm-project#450)

9b9eddd

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

reuse DP allgather tensor across layers (vllm-project#415)

2535e05

This is to reduce cached memory size for DP dispatch/combine when HPU graph enabled. --------- Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>

Update test case

02e40f8

Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>

[Linear warmup] Default values optimization (vllm-project#426)

943852a

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

View to be shape agnostic

38170ff

Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>

hlahkar force-pushed the gpt_oss_porting branch from a95fee6 to 38170ff Compare October 24, 2025 07:32

hlahkar closed this Oct 28, 2025

Gpt Oss Enablement #441

Gpt Oss Enablement #441

Uh oh!

Conversation

hlahkar commented Oct 22, 2025

Uh oh!

github-actions bot commented Oct 22, 2025

🚧 CI Blocked

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 22, 2025

🚧 CI Blocked

Uh oh!

github-actions bot commented Oct 23, 2025

🚧 CI Blocked

Uh oh!

github-actions bot commented Oct 24, 2025

🚧 CI Blocked

Uh oh!

hlahkar commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants