Implement profile_run method in HPU model runner #775

xwu-intel · 2026-01-04T09:50:57Z

Add comprehensive profile_run implementation to replace placeholder
Setup dummy KV caches using bind_kv_cache for proper memory initialization
Use existing _prepare_dummy_scenario infrastructure for profiling
Support unified attention

Copilot

Pull request overview

This PR implements a comprehensive profile_run method for the HPU model runner to replace the previous placeholder implementation. The main changes initialize proper dummy KV cache tensors with correct shapes and utilize existing dummy scenario infrastructure for profiling.

Key changes:

Setup dummy KV caches with proper shapes instead of empty tensors for profiling
Implement profile_run logic with support for unified attention scenarios
Add dynamic scale tensor creation based on quantization configuration

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
vllm_gaudi/v1/worker/hpu_worker.py	Creates properly shaped dummy KV cache tensors with dynamic scale support for profiling instead of empty tensors
vllm_gaudi/v1/worker/hpu_model_runner.py	Implements profile_run method with batch size calculation and scenario preparation for unified and standard attention

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_gaudi/v1/worker/hpu_worker.py

vllm_gaudi/v1/worker/hpu_model_runner.py

github-actions · 2026-01-04T10:59:14Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

- Add comprehensive profile_run implementation to replace placeholder - Skip profile run on decode instances following hpu_worker.py pattern - Setup KV caches using bind_kv_cache for proper memory initialization - Handle multimodal models with vision bucket management - Use existing _prepare_dummy_scenario infrastructure for profiling - Enables proper memory profiling for HPUWorker.determine_num_available_blocks Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

vllm_gaudi/v1/worker/hpu_worker.py

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

github-actions · 2026-01-05T14:11:52Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
b3a2bdf1ac90748d58bf8c05f8d0095ede5c7eca

wuxun-zhang · 2026-01-07T02:41:41Z

vllm_gaudi/v1/worker/hpu_worker.py

+                if hpu_v_cache is None:
+                    hpu_v_scales = None
+                elif create_dynamic_scales:
+                    hpu_v_scales = torch.ones(kv_scales_shape, dtype=torch.bfloat16, device='hpu')
+                else:
+                    hpu_v_scales = None


Suggested change

if hpu_v_cache is None:

hpu_v_scales = None

elif create_dynamic_scales:

hpu_v_scales = torch.ones(kv_scales_shape, dtype=torch.bfloat16, device='hpu')

else:

hpu_v_scales = None

hpu_v_scales = torch.ones(kv_scales_shape, dtype=torch.bfloat16, device='hpu') if (not self.model_config.use_mla and create_dynamic_scales) else None

@wuxun-zhang Copilot asked me to change previous your suggested style code to current one ...

Okay, anyway I don't think these if-else are good practice...

xinyu-intel · 2026-01-07T02:53:29Z

@hlin99 please review and try this one.

hlin99 · 2026-01-07T03:13:32Z

traditonal profile run is (bs, seq) to estimate memory footprint. can we chagne to (bs, seq, max ctx) considering chunked prefill/prefix caching is default way on v1?

xwu-intel · 2026-01-07T05:24:59Z

traditonal profile run is (bs, seq) to estimate memory footprint. can we chagne to (bs, seq, max ctx) considering chunked prefill/prefix caching is default way on v1?

The profile run estimates max workspace allocation memory footprint. Setting seq to max model len / max batched tokens already covered the max memory consumption as using context in kv-cache only consume less memory?

hlin99 · 2026-01-07T05:35:41Z

traditonal profile run is (bs, seq) to estimate memory footprint. can we chagne to (bs, seq, max ctx) considering chunked prefill/prefix caching is default way on v1?

The profile run estimates max workspace allocation memory footprint. Setting seq to max model len / max batched tokens already covered the max memory consumption as using context in kv-cache only consume less memory?

if model_len > max number batched tokens:
traditional way is okay
else
shape needs to be (1, batched tokens, model_len/batched tokens)

in the else path:

when chunked prefill happens, we only allow bs=1 for some other reason. so bs=1 here.
if go to the traditional way, the seq len = model_len may trigger OOM, seq_len = batched token will under estimate the workspace... so in my mind, right shape is (1, batched tokens, model_len/batched tokens) to estimate the workspace.

xwu-intel · 2026-01-09T01:21:29Z

traditonal profile run is (bs, seq) to estimate memory footprint. can we chagne to (bs, seq, max ctx) considering chunked prefill/prefix caching is default way on v1?

The profile run estimates max workspace allocation memory footprint. Setting seq to max model len / max batched tokens already covered the max memory consumption as using context in kv-cache only consume less memory?

if model_len > max number batched tokens: traditional way is okay else shape needs to be (1, batched tokens, model_len/batched tokens)

in the else path:

when chunked prefill happens, we only allow bs=1 for some other reason. so bs=1 here.

if go to the traditional way, the seq len = model_len may trigger OOM, seq_len = batched token will under estimate the workspace... so in my mind, right shape is (1, batched tokens, model_len/batched tokens) to estimate the workspace.

From what I got recently, when chunked prefill kicks in V1, prefill bs can be >1, according to env VLLM_PROMPT_BS_BUCKET_MAX (default to 1, but can set manually).
What is model_len/batched tokens in the shape here? I thought it should be context blocks?
I think the workspace should be limited to max_num_batched_tokens here since in every chunk, this is the tokens in processing. The later chunk will reuse the workspace, right? we don't need to reserve for the entire max_model_len?

hlin99 · 2026-01-10T02:57:28Z

traditonal profile run is (bs, seq) to estimate memory footprint. can we chagne to (bs, seq, max ctx) considering chunked prefill/prefix caching is default way on v1?

The profile run estimates max workspace allocation memory footprint. Setting seq to max model len / max batched tokens already covered the max memory consumption as using context in kv-cache only consume less memory?

if model_len > max number batched tokens: traditional way is okay else shape needs to be (1, batched tokens, model_len/batched tokens)
in the else path:

when chunked prefill happens, we only allow bs=1 for some other reason. so bs=1 here.

if go to the traditional way, the seq len = model_len may trigger OOM, seq_len = batched token will under estimate the workspace... so in my mind, right shape is (1, batched tokens, model_len/batched tokens) to estimate the workspace.

From what I got recently, when chunked prefill kicks in V1, prefill bs can be >1, according to env VLLM_PROMPT_BS_BUCKET_MAX (default to 1, but can set manually). What is model_len/batched tokens in the shape here? I thought it should be context blocks? I think the workspace should be limited to max_num_batched_tokens here since in every chunk, this is the tokens in processing. The later chunk will reuse the workspace, right? we don't need to reserve for the entire max_model_len?

i.e max batched token is 8K, model len is 128K. the workspace estimated by 8K is not sufficient for 128K model len even if 128K is chunked to N x 8K. to estimate the workspace for this case, a shape (1, 8K, max ctx number) for profile run might be better.
regarding to bs=1 in chunked prefill, you can refer to #753

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

github-actions · 2026-01-12T09:13:29Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2026-01-12T11:17:19Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
aa125ecf0edb9cd67656553d11d643aeb444ff9e

afierka-intel

Have just one minor request to remove redundant comment.

vllm_gaudi/v1/worker/hpu_worker.py

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

github-actions · 2026-01-13T08:23:46Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

afierka-intel

LGTM now :)

github-actions · 2026-01-14T20:33:26Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
9ea07b41da169f727a2eb7302adec4c724319522

github-actions · 2026-01-15T02:42:43Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
66652e8082b69ba7d1e6aca7c234433de55f1b9b

- Add comprehensive profile_run implementation to replace placeholder - Setup dummy KV caches using bind_kv_cache for proper memory initialization - Use existing _prepare_dummy_scenario infrastructure for profiling - Support unified attention --------- Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai>

Copilot AI review requested due to automatic review settings January 4, 2026 09:50

Copilot AI reviewed Jan 4, 2026

View reviewed changes

vllm_gaudi/v1/worker/hpu_worker.py Show resolved Hide resolved

vllm_gaudi/v1/worker/hpu_worker.py Outdated Show resolved Hide resolved

vllm_gaudi/v1/worker/hpu_model_runner.py Outdated Show resolved Hide resolved

xwu-intel and others added 10 commits January 5, 2026 04:56

Update vllm_gaudi/v1/worker/hpu_model_runner.py

5f0b0e3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

Update vllm_gaudi/v1/worker/hpu_model_runner.py

fea3bc5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

Update vllm_gaudi/v1/worker/hpu_model_runner.py

03c7a6e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

update profile_run

6c5ed62

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

fix dummy KV cache tensors

d6e74c9

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

fix k v scales

6817354

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

fix

db825c5

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

fix

b5fbc19

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

fix

c23cf37

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

xwu-intel force-pushed the feature/implement-profile-run-method branch from 3f1862f to c23cf37 Compare January 5, 2026 02:56

xwu-intel marked this pull request as ready for review January 5, 2026 02:56

xwu-intel requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, kzawora-intel, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners January 5, 2026 02:56

xinyu-intel reviewed Jan 5, 2026

View reviewed changes

vllm_gaudi/v1/worker/hpu_worker.py Show resolved Hide resolved

github-actions bot mentioned this pull request Jan 5, 2026

🚦 Team Review Dashboard #701

Open

xwu-intel added 2 commits January 5, 2026 07:49

fix

8f933b2

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

clear dummy kv cache

a90ee8f

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

wuxun-zhang reviewed Jan 7, 2026

View reviewed changes

fix unified_cfg and prompt_cfg

a296980

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

Merge branch 'main' into feature/implement-profile-run-method

d57aff3

afierka-intel requested changes Jan 13, 2026

View reviewed changes

vllm_gaudi/v1/worker/hpu_worker.py Outdated Show resolved Hide resolved

update

8cc4ca1

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>

Merge branch 'main' into feature/implement-profile-run-method

990327e

xwu-intel requested a review from afierka-intel January 13, 2026 08:24

afierka-intel approved these changes Jan 13, 2026

View reviewed changes

Merge branch 'main' into feature/implement-profile-run-method

09591e6

Merge branch 'main' into feature/implement-profile-run-method

fb301c9

adobrzyn merged commit a40b090 into vllm-project:main Jan 15, 2026
52 checks passed

Implement profile_run method in HPU model runner #775

Implement profile_run method in HPU model runner #775

Conversation

xwu-intel commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 4, 2026

🚧 CI Blocked

Uh oh!

Uh oh!

github-actions bot commented Jan 5, 2026

✅ CI Passed

Uh oh!

wuxun-zhang Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

xwu-intel Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

wuxun-zhang Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

xinyu-intel commented Jan 7, 2026

Uh oh!

hlin99 commented Jan 7, 2026

Uh oh!

xwu-intel commented Jan 7, 2026

Uh oh!

hlin99 commented Jan 7, 2026

Uh oh!

xwu-intel commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hlin99 commented Jan 10, 2026

Uh oh!

github-actions bot commented Jan 12, 2026

🚧 CI Blocked

Uh oh!

github-actions bot commented Jan 12, 2026

✅ CI Passed

Uh oh!

afierka-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jan 13, 2026

🚧 CI Blocked

Uh oh!

afierka-intel left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 14, 2026

✅ CI Passed

Uh oh!

github-actions bot commented Jan 15, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xwu-intel commented Jan 4, 2026 •

edited

Loading

xwu-intel commented Jan 9, 2026 •

edited

Loading