Skip to content

Conversation

@xwu-intel
Copy link
Contributor

@xwu-intel xwu-intel commented Jan 4, 2026

  • Add comprehensive profile_run implementation to replace placeholder
  • Setup dummy KV caches using bind_kv_cache for proper memory initialization
  • Use existing _prepare_dummy_scenario infrastructure for profiling
  • Support unified attention

Copilot AI review requested due to automatic review settings January 4, 2026 09:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive profile_run method for the HPU model runner to replace the previous placeholder implementation. The main changes initialize proper dummy KV cache tensors with correct shapes and utilize existing dummy scenario infrastructure for profiling.

Key changes:

  • Setup dummy KV caches with proper shapes instead of empty tensors for profiling
  • Implement profile_run logic with support for unified attention scenarios
  • Add dynamic scale tensor creation based on quantization configuration

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
vllm_gaudi/v1/worker/hpu_worker.py Creates properly shaped dummy KV cache tensors with dynamic scale support for profiling instead of empty tensors
vllm_gaudi/v1/worker/hpu_model_runner.py Implements profile_run method with batch size calculation and scenario preparation for unified and standard attention

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link

github-actions bot commented Jan 4, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

xwu-intel and others added 10 commits January 5, 2026 04:56
- Add comprehensive profile_run implementation to replace placeholder
- Skip profile run on decode instances following hpu_worker.py pattern
- Setup KV caches using bind_kv_cache for proper memory initialization
- Handle multimodal models with vision bucket management
- Use existing _prepare_dummy_scenario infrastructure for profiling
- Enables proper memory profiling for HPUWorker.determine_num_available_blocks

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
@github-actions
Copy link

github-actions bot commented Jan 5, 2026

✅ CI Passed

All checks passed successfully against the following vllm commit:
b3a2bdf1ac90748d58bf8c05f8d0095ede5c7eca

Comment on lines +212 to +217
if hpu_v_cache is None:
hpu_v_scales = None
elif create_dynamic_scales:
hpu_v_scales = torch.ones(kv_scales_shape, dtype=torch.bfloat16, device='hpu')
else:
hpu_v_scales = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if hpu_v_cache is None:
hpu_v_scales = None
elif create_dynamic_scales:
hpu_v_scales = torch.ones(kv_scales_shape, dtype=torch.bfloat16, device='hpu')
else:
hpu_v_scales = None
hpu_v_scales = torch.ones(kv_scales_shape, dtype=torch.bfloat16, device='hpu') if (not self.model_config.use_mla and create_dynamic_scales) else None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wuxun-zhang Copilot asked me to change previous your suggested style code to current one ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, anyway I don't think these if-else are good practice...

@xinyu-intel
Copy link
Contributor

@hlin99 please review and try this one.

@hlin99
Copy link
Contributor

hlin99 commented Jan 7, 2026

traditonal profile run is (bs, seq) to estimate memory footprint. can we chagne to (bs, seq, max ctx) considering chunked prefill/prefix caching is default way on v1?

@xwu-intel
Copy link
Contributor Author

traditonal profile run is (bs, seq) to estimate memory footprint. can we chagne to (bs, seq, max ctx) considering chunked prefill/prefix caching is default way on v1?

The profile run estimates max workspace allocation memory footprint. Setting seq to max model len / max batched tokens already covered the max memory consumption as using context in kv-cache only consume less memory?

@hlin99
Copy link
Contributor

hlin99 commented Jan 7, 2026

traditonal profile run is (bs, seq) to estimate memory footprint. can we chagne to (bs, seq, max ctx) considering chunked prefill/prefix caching is default way on v1?

The profile run estimates max workspace allocation memory footprint. Setting seq to max model len / max batched tokens already covered the max memory consumption as using context in kv-cache only consume less memory?

if model_len > max number batched tokens:
traditional way is okay
else
shape needs to be (1, batched tokens, model_len/batched tokens)

in the else path:

  1. when chunked prefill happens, we only allow bs=1 for some other reason. so bs=1 here.
  2. if go to the traditional way, the seq len = model_len may trigger OOM, seq_len = batched token will under estimate the workspace... so in my mind, right shape is (1, batched tokens, model_len/batched tokens) to estimate the workspace.

@xwu-intel
Copy link
Contributor Author

xwu-intel commented Jan 9, 2026

traditonal profile run is (bs, seq) to estimate memory footprint. can we chagne to (bs, seq, max ctx) considering chunked prefill/prefix caching is default way on v1?

The profile run estimates max workspace allocation memory footprint. Setting seq to max model len / max batched tokens already covered the max memory consumption as using context in kv-cache only consume less memory?

if model_len > max number batched tokens: traditional way is okay else shape needs to be (1, batched tokens, model_len/batched tokens)

in the else path:

  1. when chunked prefill happens, we only allow bs=1 for some other reason. so bs=1 here.
  2. if go to the traditional way, the seq len = model_len may trigger OOM, seq_len = batched token will under estimate the workspace... so in my mind, right shape is (1, batched tokens, model_len/batched tokens) to estimate the workspace.

From what I got recently, when chunked prefill kicks in V1, prefill bs can be >1, according to env VLLM_PROMPT_BS_BUCKET_MAX (default to 1, but can set manually).
What is model_len/batched tokens in the shape here? I thought it should be context blocks?
I think the workspace should be limited to max_num_batched_tokens here since in every chunk, this is the tokens in processing. The later chunk will reuse the workspace, right? we don't need to reserve for the entire max_model_len?

@hlin99
Copy link
Contributor

hlin99 commented Jan 10, 2026

traditonal profile run is (bs, seq) to estimate memory footprint. can we chagne to (bs, seq, max ctx) considering chunked prefill/prefix caching is default way on v1?

The profile run estimates max workspace allocation memory footprint. Setting seq to max model len / max batched tokens already covered the max memory consumption as using context in kv-cache only consume less memory?

if model_len > max number batched tokens: traditional way is okay else shape needs to be (1, batched tokens, model_len/batched tokens)
in the else path:

  1. when chunked prefill happens, we only allow bs=1 for some other reason. so bs=1 here.
  2. if go to the traditional way, the seq len = model_len may trigger OOM, seq_len = batched token will under estimate the workspace... so in my mind, right shape is (1, batched tokens, model_len/batched tokens) to estimate the workspace.

From what I got recently, when chunked prefill kicks in V1, prefill bs can be >1, according to env VLLM_PROMPT_BS_BUCKET_MAX (default to 1, but can set manually). What is model_len/batched tokens in the shape here? I thought it should be context blocks? I think the workspace should be limited to max_num_batched_tokens here since in every chunk, this is the tokens in processing. The later chunk will reuse the workspace, right? we don't need to reserve for the entire max_model_len?

i.e max batched token is 8K, model len is 128K. the workspace estimated by 8K is not sufficient for 128K model len even if 128K is chunked to N x 8K. to estimate the workspace for this case, a shape (1, 8K, max ctx number) for profile run might be better.
regarding to bs=1 in chunked prefill, you can refer to #753

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
aa125ecf0edb9cd67656553d11d643aeb444ff9e

Copy link
Collaborator

@afierka-intel afierka-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have just one minor request to remove redundant comment.

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copy link
Collaborator

@afierka-intel afierka-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now :)

@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
9ea07b41da169f727a2eb7302adec4c724319522

@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
66652e8082b69ba7d1e6aca7c234433de55f1b9b

@adobrzyn adobrzyn merged commit a40b090 into vllm-project:main Jan 15, 2026
52 checks passed
yeonsily pushed a commit to yeonsily/vllm-gaudi that referenced this pull request Jan 15, 2026
- Add comprehensive profile_run implementation to replace placeholder
- Setup dummy KV caches using bind_kv_cache for proper memory
initialization
- Use existing _prepare_dummy_scenario infrastructure for profiling
- Support unified attention

---------

Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Iryna Boiko <iboiko@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants