-
Notifications
You must be signed in to change notification settings - Fork 101
Implement profile_run method in HPU model runner #775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement profile_run method in HPU model runner #775
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a comprehensive profile_run method for the HPU model runner to replace the previous placeholder implementation. The main changes initialize proper dummy KV cache tensors with correct shapes and utilize existing dummy scenario infrastructure for profiling.
Key changes:
- Setup dummy KV caches with proper shapes instead of empty tensors for profiling
- Implement profile_run logic with support for unified attention scenarios
- Add dynamic scale tensor creation based on quantization configuration
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_worker.py | Creates properly shaped dummy KV cache tensors with dynamic scale support for profiling instead of empty tensors |
| vllm_gaudi/v1/worker/hpu_model_runner.py | Implements profile_run method with batch size calculation and scenario preparation for unified and standard attention |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
- Add comprehensive profile_run implementation to replace placeholder - Skip profile run on decode instances following hpu_worker.py pattern - Setup KV caches using bind_kv_cache for proper memory initialization - Handle multimodal models with vision bucket management - Use existing _prepare_dummy_scenario infrastructure for profiling - Enables proper memory profiling for HPUWorker.determine_num_available_blocks Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
3f1862f to
c23cf37
Compare
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
| if hpu_v_cache is None: | ||
| hpu_v_scales = None | ||
| elif create_dynamic_scales: | ||
| hpu_v_scales = torch.ones(kv_scales_shape, dtype=torch.bfloat16, device='hpu') | ||
| else: | ||
| hpu_v_scales = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if hpu_v_cache is None: | |
| hpu_v_scales = None | |
| elif create_dynamic_scales: | |
| hpu_v_scales = torch.ones(kv_scales_shape, dtype=torch.bfloat16, device='hpu') | |
| else: | |
| hpu_v_scales = None | |
| hpu_v_scales = torch.ones(kv_scales_shape, dtype=torch.bfloat16, device='hpu') if (not self.model_config.use_mla and create_dynamic_scales) else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wuxun-zhang Copilot asked me to change previous your suggested style code to current one ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, anyway I don't think these if-else are good practice...
|
@hlin99 please review and try this one. |
|
traditonal profile run is (bs, seq) to estimate memory footprint. can we chagne to (bs, seq, max ctx) considering chunked prefill/prefix caching is default way on v1? |
The profile run estimates max workspace allocation memory footprint. Setting seq to max model len / max batched tokens already covered the max memory consumption as using context in kv-cache only consume less memory? |
if model_len > max number batched tokens: in the else path:
|
From what I got recently, when chunked prefill kicks in V1, prefill bs can be >1, according to env VLLM_PROMPT_BS_BUCKET_MAX (default to 1, but can set manually). |
i.e max batched token is 8K, model len is 128K. the workspace estimated by 8K is not sufficient for 128K model len even if 128K is chunked to N x 8K. to estimate the workspace for this case, a shape (1, 8K, max ctx number) for profile run might be better. |
Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
✅ CI PassedAll checks passed successfully against the following vllm commit: |
afierka-intel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have just one minor request to remove redundant comment.
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
afierka-intel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now :)
✅ CI PassedAll checks passed successfully against the following vllm commit: |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
- Add comprehensive profile_run implementation to replace placeholder - Setup dummy KV caches using bind_kv_cache for proper memory initialization - Use existing _prepare_dummy_scenario infrastructure for profiling - Support unified attention --------- Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai>
Uh oh!
There was an error while loading. Please reload this page.