-
Notifications
You must be signed in to change notification settings - Fork 101
Introduce absolute and relative padding limits to the linear bucketing #762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces configurable absolute and relative padding limits to the linear bucketing algorithm to better balance warmup time and runtime performance. The change replaces exponential bucketing as the default strategy.
- Adds new environment variables for controlling padding limits (
PAD_MAXandPAD_PERCENT) across all bucket dimensions - Implements a new
warmup_range_with_limitsfunction that generates buckets respecting these padding constraints - Changes the default bucketing strategy from exponential to linear with limits
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_gaudi/extension/features.py | Adds new environment variables for padding limits and switches default bucketing strategy to linear |
| vllm_gaudi/extension/bucketing/linear.py | Implements padding-aware bucket generation with new warmup_range_with_limits function and updated configuration handling |
| vllm_gaudi/extension/bucketing/common.py | Simplifies bucketing strategy selection and adds debug logging for bucket ranges |
| tests/unit_tests/test_bucketing.py | Updates tests to accommodate new padding parameters in bucket configuration |
| docs/configuration/env_variables.md | Documents new padding-related environment variables and updated defaults |
Comments suppressed due to low confidence (1)
vllm_gaudi/extension/bucketing/linear.py:1
- The
BUCKET_PAD_PERCENTenvironment variables are defined asinttype, but they represent percentages. This could lead to confusion as the documentation shows25meaning 25%, but users might expect values like0.25. Consider using a float type or clearly documenting that the value should be specified as an integer percentage (0-100).
import os
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
0e742bc to
575e7d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
vllm_gaudi/extension/bucketing/linear.py:1
- The
PAD_PERCENTparameter is stored as an integer but represents a percentage value (0-50). Consider using a float type or renaming to indicate it's in integer percentage points to avoid confusion.
import os
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
faf3e47 to
aa9d30f
Compare
|
Submitted #780 to solve the OOM issue in CI. |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
e4eabf6 to
fefd207
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Signed-off-by: Youlei Yang <[email protected]>
Signed-off-by: Youlei Yang <[email protected]>
Signed-off-by: Youlei Yang <[email protected]>
Signed-off-by: Youlei Yang <[email protected]>
4b01722 to
8c7c399
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Motivation
The exponential bucketing is introduced to significantly reduce the number of buckets. For the example with
max_num_batched_tokens=8192,max_model_len=32768,max_num_seqs=256and# hpu blocks: 4127. The exponential bucketing generates 120 prompt buckets and 81 decode buckets, and the linear bucketing generates 14368 prompt buckets and 4042 decode buckets. The exponential buckets are filtered combinations with the following ranges:The max absolute padding (
max(bucket[i]-bucket[i-1]-1)) is proportional to the bucket max without limitations, and the max relative padding ((bucket[i]-bucket[i-1]-1)/bucket[i]) towards 50% for large bucket max. The large padding cause large overhead especially for the cases with long sequences.We need a bucketing algorithm to balance the bucket number (warmup time) and the runtime performance (padding overhead).
Changes
warmup_rangein linear bucketing towarmup_range_with_limitsto generate a range that ensure the absolute and relative padding not exceeds the specified limits.VLLM_{phase}_{dim}_BUCKET_PAD_MAXandVLLM_{phase}_{dim}_BUCKET_PAD_PERCENTto set the absolute and relative padding limits respectively.For the above example with default settings:
Which results in 284 prompt buckets and 222 decode buckets with much less padding.
Benefits
VLLM_{phase}_{dim}_BUCKET_PAD_MAXand settingVLLM_{phase}_{dim}_BUCKET_PAD_PERCENT=50.VLLM_{phase}_{dim}_BUCKET_PAD_PERCENT=0.VLLM_{phase}_{dim}_BUCKET_PAD_MAXto multiple ofPT_HPU_SDPA_BR_FACTORandPT_HPU_SDPA_BC_FACTORcould generate buckets that align with the slicing chunk size and give better performance.Minor changes