-
Notifications
You must be signed in to change notification settings - Fork 101
Gpt Oss Enablement #441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gpt Oss Enablement #441
Conversation
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
| else: | ||
| topk_weights = F.softmax(router_logits, dim=1, dtype=torch.float32) | ||
| topk_weights, topk_ids = torch.topk(topk_weights, top_k, dim=-1) | ||
| topk_weights /= topk_weights.sum(dim=-1, keepdim=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that L86 is not needed in the "if" section of the loop? Please check again once with vllm-fork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is done as per vllm-fork code changes
vllm_gaudi/extension/ops.py
Outdated
| attn = attn.to(value.dtype) | ||
| block_sums = attn.sum(dim=-1, keepdim=True) | ||
| attn_shape = attn.shape | ||
| block_sums = attn.view(-1,attn_shape[-1]).sum(dim=-1, keepdim=True).view(attn_shape[0],attn_shape[1],attn_shape[2],attn_shape[3],1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For final reshape we are assuming that original attn shape was 4d and one of the dims was "1". This may not be the case for all models. Can we make this generic so that other models do not break due to this reshape?
vllm_gaudi/extension/ops.py
Outdated
| block_bias = block_bias.view(key.size(0), 1, 1, -1) | ||
| sink = None | ||
| if sinks is not None: | ||
| # sink = sinks.reshape(1, -1, 1, 1).expand(query.shape[0], -1, query.shape[-2], -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete commented line.
vllm_gaudi/extension/ops.py
Outdated
|
|
||
| attn_weights = fsdpa_op(*args) | ||
| attn_weights = attn_weights.transpose(1, 2) | ||
| htcore.mark_step() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this mark_step only when we have sink so that it is applicable only for gpt-oss? Also put a TODO comment that we should check if we can remove this later.
| attn_weights = fsdpa_op(query, key, value, attn_bias, 0.0, is_causal, scale, softmax_mode, recompute_mode, | ||
| valid_seq_lengths, 'right') | ||
| if window_size is not None: | ||
| #causal window sdpa kernel only supports softmax None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a TODO to check if we can remove this later. As per OPs team softmax_mode = 'fast' should work now with window also.
vllm_gaudi/extension/ops.py
Outdated
| if window_size is not None: | ||
| #causal window sdpa kernel only supports softmax None | ||
| softmax_mode = 'None' | ||
| # padding_side ='left' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented line.
| # Reshape the input keys and values and store them in the cache. | ||
| # If kv_cache is not provided, the new key and value tensors are | ||
| # not cached. This happens during the initial memory profiling run. | ||
| if key.dtype != key_cache.dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? This seems to have nothing to do with gpt-oss changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed as the rotary_emb output is in fp32; so we need to to cast q and k to bf16.
| attn_bias = attn_metadata.window_attn_bias | ||
| if self.sliding_window: | ||
| window_size = ( | ||
| 128, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to fix it? Maybe different for other models which use sliding window such as GEMMA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to fix it? Maybe different for other models which use sliding window such as GEMMA
This size should ideally come from the sliding window size parameter; since the same is not implemented now it is fixed to 128 for the time being.
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
1 similar comment
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
UT reference fix after bucketing changes in vllm-project#355 and vllm-project#350 --------- Signed-off-by: Konrad Zawora <kzawora@habana.ai>
…ct#373) Signed-off-by: Chendi Xue <Chendi.Xue@intel.com>
Current CI tests don't utilize fixed seeds, resulting in some minor accuracy fluctuations that can sometimes fall just under the tolerance threshold (likely due to random sampling). A better way would be to fix the seeds and always expect the same results. Signed-off-by: Konrad Zawora <kzawora@habana.ai>
`coordinate_batch_across_dp` does more work than what we need for dp padding here, so just implement the logic in plugin. Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: PatrykWo <patryk.wolsza@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
…6 + #25103 + #25807 (vllm-project#366) Signed-off-by: Iryna Boiko <iboiko@habana.ai> Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
…-project#370) Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
…ect#382) Signed-off-by: Tadeusz Lipinski <tlipinski@habana.ai>
Signed-off-by: PatrykWo <patryk.wolsza@intel.com>
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
port from SW-240222. the latest ray will lost HPU device. Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
…rectly while defining model class instead of maintaining model specific M-RoPE implementation in mrope.py (vllm-project#388) Signed-off-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Currently out of the box linear doesn't work, this should help --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
vllm-project#331) [SW-241908] Fixes regression in tests due to invalid buckets generated if VLLM_PROMPT_BS_BUCKET_MAX is set, and the number of tokens in prefill batch exceeds the max_num_batched_tokens. The regression is associated with vllm-project#224 This fix checks that the number of tokens in both the current and next token bucket does not exceed max_num_batched tokens, and resolves "ValueError: operands could not be broadcast together with shape" exception `VLLM_CONTIGUOUS_PA=false VLLM_DECODE_BLOCK_BUCKET_MAX=512 VLLM_USE_V1=1 VLLM_PROMPT_BS_BUCKET_MAX=16 vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype bfloat16 --tensor-parallel-size 1 --swap-space 16` (EngineCore_DP0 pid=352574) File "/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 4051, in warmup_model (EngineCore_DP0 pid=352574) self.warmup_graphs( (EngineCore_DP0 pid=352574) File "/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 3659, in warmup_graphs (EngineCore_DP0 pid=352574) self._prepare_dummy_scenario(prompt_cfg, decode_cfg) (EngineCore_DP0 pid=352574) File "/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 3897, in _prepare_dummy_scenario (EngineCore_DP0 pid=352574) self._execute_dummy_scenario(requests, scheduled_tokens) (EngineCore_DP0 pid=352574) File "/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 3928, in _execute_dummy_scenario (EngineCore_DP0 pid=352574) self.execute_model(sched_output, warmup_mode=True) (EngineCore_DP0 pid=352574) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore_DP0 pid=352574) return func(*args, **kwargs) (EngineCore_DP0 pid=352574) File "/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 2914, in execute_model (EngineCore_DP0 pid=352574) prefill_input_data, decode_input_data = self._prepare_inputs(scheduler_output, num_prefills, num_decodes, (EngineCore_DP0 pid=352574) File "/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 2304, in _prepare_inputs (EngineCore_DP0 pid=352574) np.add(self.input_batch.num_computed_tokens_cpu[req_indices], arange, out=positions_np) (EngineCore_DP0 pid=352574) ValueError: operands could not be broadcast together with shapes (6144,) (6144,) (2048,) --------- Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
- [x] warmup funcioning - [x] no recompiles --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com> Co-authored-by: Konrad Zawora <kzawora@habana.ai>
…or KV register and update host_buffer accordingly (vllm-project#411) Implement for SW-242433 ** Root cause for accuracy issue ** nixl_connector register_kv_cache assume any input kv_caches are with 4D or 5D. `[2, num_blocks, block_size, num_kv_heads, head_size]` or `[num_blocks, block_size, num_kv_heads, head_size]` However, HPU KV cache is with 3D tuple: `Tuple([num_blocks*block_size, num_kv_heads, head_size], ...)` => Different KV layout leads to incorrect num_blocks and data_ptr calculation in nixl_connector => leads to wrong data copied. ** Solution ** 1. create a new KV_caches_4D dict for nixl_conenctor. This 4D KV_caches is a reference to original KV_cache with 4D view. (Same memory address) 2. Fix `habana_frameworks.torch.utils.experimental._data_ptr` incapability of fetching address on view-tensor by using global map to map virtual to physical addr. 3. add a new TupleTensor Class which treat tuple as Tensor to return shape, device, dtype ** validation ** Tested with both Gaudi2Gaudi and Gaudi2CPU2Gaudi on "Qwen/Qwen3-0.6B", "deepseek-ai/DeepSeek-V2-Lite-Chat" All 4 cases get expected accuracy. --------- Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Chendi Xue <chendi.xue@intel.com>
### Test with multimodal-support for multiple images - Current CI test for gemma3 only runs single image per prompt and its input seq_len is less than current sliding_window length (1024). - This new test is designed such that the total input length exceeds the default sliding_window length (1024) to help validate the sliding_window mechanism is actually working or not. --------- Signed-off-by: Mohit Deopujari <mdeopujari@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
…990 (vllm-project#413) Signed-off-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Artur Fierka <artur.fierka@intel.com>
Porting the FP8 calibration procedure from vllm-hpu-extension: https://github.com/HabanaAI/vllm-hpu-extension/tree/main/calibration --------- Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Culprit commit: vllm-project/vllm#27022 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: PatrykWo <patryk.wolsza@intel.com> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Patryk Wolsza <patryk.wolsza@intel.com>
…vllm-project#427) Signed-off-by: Iryna Boiko <iboiko@habana.ai>
SW-242362
Change DecodeTP=2, PrefillTP=1 for test
Issue:
- HPU attn is using NHD, and vllm upstream only support prefill / decode
heterogeneous TP with HND.
Solution:
- init:
hpu_attn with NHD -> host_buffer with HND
- copy device to host:
permute kv for req -> copy to host buffer
- nixl_connector transfer host KV with HND + TP_ratio support
- copy host to device
permute kv for req -> copy to device
=====
Validated, accuracy is good
---
FYI, no need change for MLA (Deepseek)
---------
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
…llm-project#385) ### Summary This PR fixes a minor typo in the `installation.md` file where the installation command incorrectly referenced `install_nixl.sh`. The correct script name is `install_nixl.py`. ### Changes Made - Updated `python install_nixl.sh` to `python install_nixl.py` in installation.md. ### Why This Is Needed The incorrect script name could lead to confusion or installation errors for users following the documentation. This change ensures clarity and accuracy in the setup instructions. Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
…egression (vllm-project#424) Fix performance regression caused by missing warmup buckets associated with vllm-project#331 Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
General formatting, fixes, some updates to docs --------- Signed-off-by: PatrykWo <patryk.wolsza@intel.com>
Fix for "[Performance] Dual stream execution of "shared_experts" and "selected_experts" inside FusedMoE #26440" Similar to [Bugfix][CPU] Disable dual stream execution for experts on CPU #27320 Signed-off-by: Iryna Boiko <iboiko@habana.ai>
the title of this pr sounds like random set of words put together to sound wise (or something from passphrase generator like "rose listen donkey wild function") but i swear it is not Signed-off-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Patryk Wolsza <patryk.wolsza@intel.com>
Completed _Executing inference_ section of _Quickstart_ doc. Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai> Co-authored-by: Patryk Wolsza <patryk.wolsza@intel.com>
Set numpy to latest Cherry-pick from releases/v0.11.0: vllm-project#443 Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
This is to reduce cached memory size for DP dispatch/combine when HPU graph enabled. --------- Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
~Depends on vllm-project#226 See last commit added for this PR. --------- Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Add bucketing from file - experimental use for now! - [X] Prompt / decode read from file - [X] Example file with example buckets - Unified attention from file -> to be done in differen PR - [X] Filter recieved buckets? - [X] Ranges - [ ] README --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
…vllm-project#451) INC calibration fails with math domain error when math.log2 op has undefined result during INC calibration. The INC calibration scripts uses max_model_len=128 which results in max_ctx=0. log2 of zero is undefined. `PT_HPU_LAZY_MODE=1 ./calibrate_model.sh -m meta-llama/Llama-3.1-70B -d NeelNanda/pile-10k -o ./inc2 -b 1 -t 2 -l 5` ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m INFO 10-22 22:53:05 [hpu_worker.py:242] Initializing cache engine took 74.67 GiB of device memory (109 GiB/126.5 GiB used) and -1.967 GiB of host memory (93.07 GiB/1007 GiB used) ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] WorkerProc hit an exception. ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] Traceback (most recent call last): ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm/vllm/v1/executor/multiproc_executor.py", line 698, in worker_busy_loop ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] output = func(*args, **kwargs) ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm/vllm/v1/worker/worker_base.py", line 305, in initialize_from_config ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] self.worker.initialize_from_config(kv_cache_config) # type: ignore ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm-gaudi/vllm_gaudi/v1/worker/hpu_worker.py", line 243, in initialize_from_config ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] self.compile_or_warm_up_model() ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm-gaudi/vllm_gaudi/v1/worker/hpu_worker.py", line 249, in compile_or_warm_up_model ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] self.model_runner.warmup_model() ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] return func(*args, **kwargs) ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm-gaudi/vllm_gaudi/v1/worker/hpu_model_runner.py", line 4028, in warmup_model ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] self.bucketing_manager.generate_prompt_buckets() ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm-gaudi/vllm_gaudi/extension/bucketing/common.py", line 110, in generate_prompt_buckets ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] bs_cfg, query_cfg, ctx_cfg = strategy.get_prompt_cfgs(max_num_prefill_seqs=self.max_num_prefill_seqs, ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] File "/tmp/vllm-gaudi/vllm_gaudi/extension/bucketing/exponential.py", line 42, in get_prompt_cfgs ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] max_prompt_ctx_limit = 2 if max_ctx == 1 else math.ceil(math.log2(max_ctx)) + 1 ESC[1;36m(Worker_TP1 pid=65456)ESC[0;0m ERROR 10-22 22:53:05 [multiproc_executor.py:703] ValueError: math domain error Signed-off-by: Kavulya, Soila P <soila.p.kavulya@intel.com>
This line of code incorrectly filtered requirements to be installed: `sed '/^[torch]/d' requirements/build.txt` - filters out all packages which names start with t / o / r / c / h So if requirements were: ` cmake>=3.26.1 ninja packaging>=24.2 setuptools>=77.0.3,<80.0.0 setuptools-scm>=8 torch==2.8.0 wheel jinja2>=3.1.6 regex build ` we were skipping cmake, torch and regex. `sed '/^torch/d' requirements/build.txt` this would skip only all torch packages. Signed-off-by: jakub-sochacki <jakub.sochacki@intel.com>
Signed-off-by: Himangshu Lahkar <hlahkar@habana.ai>
a95fee6 to
38170ff
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
|
Tracking through #485 |
This PR enables GPT OSS for vllm-gaudi. Major features: