Skip to content

Conversation

@yangulei
Copy link
Collaborator

@yangulei yangulei commented Dec 26, 2025

Motivation

The exponential bucketing is introduced to significantly reduce the number of buckets. For the example with max_num_batched_tokens=8192, max_model_len=32768, max_num_seqs=256 and # hpu blocks: 4127. The exponential bucketing generates 120 prompt buckets and 81 decode buckets, and the linear bucketing generates 14368 prompt buckets and 4042 decode buckets. The exponential buckets are filtered combinations with the following ranges:

Prompt query range: [128, 256, 384, 512, 640, 1792, 2816, 3968, 4992, 6144, 7168, 8192]
Prompt context range: [0, 1, 3, 8, 22, 56, 90, 124, 158, 192]
Decode BS range: [1, 2, 4, 8, 14, 24, 42, 78, 140, 256]
Decode context range: [1, 256, 512, 768, 1024, 1280, 1536, 1792, 2304, 2816, 3584, 4352]

The max absolute padding (max(bucket[i]-bucket[i-1]-1)) is proportional to the bucket max without limitations, and the max relative padding ((bucket[i]-bucket[i-1]-1)/bucket[i]) towards 50% for large bucket max. The large padding cause large overhead especially for the cases with long sequences.

We need a bucketing algorithm to balance the bucket number (warmup time) and the runtime performance (padding overhead).

Changes

  • Enhance the warmup_range in linear bucketing to warmup_range_with_limits to generate a range that ensure the absolute and relative padding not exceeds the specified limits.
  • Introduce new ENVs named VLLM_{phase}_{dim}_BUCKET_PAD_MAX and VLLM_{phase}_{dim}_BUCKET_PAD_PERCENT to set the absolute and relative padding limits respectively.

For the above example with default settings:

Prompt query range: [128, 256, 384, 512, 640, 768, 1024, 1280, 1664, 2176, 2816, 3712, 4864, 6400, 8192]
Prompt context range: [0, 1, 2, 4, 6, 8, 12, 16, 22, 30, 40, 54, 64, 86, 116, 128, 172, 192, 255]
Decode BS range: [1, 2, 4, 6, 8, 12, 16, 22, 30, 32, 44, 60, 64, 86, 96, 128, 160, 192, 224, 256]
Decode context range: [128, 256, 384, 512, 640, 768, 1024, 1280, 1664, 2176, 2816, 3712, 4127]

Which results in 284 prompt buckets and 222 decode buckets with much less padding.

Benefits

  • Could simulate the exponential bucketing by setting large VLLM_{phase}_{dim}_BUCKET_PAD_MAX and setting VLLM_{phase}_{dim}_BUCKET_PAD_PERCENT=50.
  • Could fallback to the original linear bucketing by setting VLLM_{phase}_{dim}_BUCKET_PAD_PERCENT=0.
  • Users could further tuning the absolute and relative padding limits to balance the warmup time and runtime performance.
  • Setting VLLM_{phase}_{dim}_BUCKET_PAD_MAX to multiple of PT_HPU_SDPA_BR_FACTOR and PT_HPU_SDPA_BC_FACTOR could generate buckets that align with the slicing chunk size and give better performance.

Minor changes

  • Set to use linear bucketing with limits by default instead of exponential bucketing.
  • Update the tests and documentations.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces configurable absolute and relative padding limits to the linear bucketing algorithm to better balance warmup time and runtime performance. The change replaces exponential bucketing as the default strategy.

  • Adds new environment variables for controlling padding limits (PAD_MAX and PAD_PERCENT) across all bucket dimensions
  • Implements a new warmup_range_with_limits function that generates buckets respecting these padding constraints
  • Changes the default bucketing strategy from exponential to linear with limits

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vllm_gaudi/extension/features.py Adds new environment variables for padding limits and switches default bucketing strategy to linear
vllm_gaudi/extension/bucketing/linear.py Implements padding-aware bucket generation with new warmup_range_with_limits function and updated configuration handling
vllm_gaudi/extension/bucketing/common.py Simplifies bucketing strategy selection and adds debug logging for bucket ranges
tests/unit_tests/test_bucketing.py Updates tests to accommodate new padding parameters in bucket configuration
docs/configuration/env_variables.md Documents new padding-related environment variables and updated defaults
Comments suppressed due to low confidence (1)

vllm_gaudi/extension/bucketing/linear.py:1

  • The BUCKET_PAD_PERCENT environment variables are defined as int type, but they represent percentages. This could lead to confusion as the documentation shows 25 meaning 25%, but users might expect values like 0.25. Consider using a float type or clearly documenting that the value should be specified as an integer percentage (0-100).
import os

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@yangulei yangulei force-pushed the linear_limit_main branch 3 times, most recently from 0e742bc to 575e7d1 Compare December 30, 2025 01:51
@yangulei yangulei requested a review from Copilot December 30, 2025 02:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

vllm_gaudi/extension/bucketing/linear.py:1

  • The PAD_PERCENT parameter is stored as an integer but represents a percentage value (0-50). Consider using a float type or renaming to indicate it's in integer percentage points to avoid confusion.
import os

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yangulei yangulei requested a review from Copilot December 30, 2025 03:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yangulei
Copy link
Collaborator Author

yangulei commented Jan 7, 2026

Submitted #780 to solve the OOM issue in CI.

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

✅ CI Passed

All checks passed successfully against the following vllm commit:
b3a2bdf1ac90748d58bf8c05f8d0095ede5c7eca

@github-actions
Copy link

github-actions bot commented Jan 8, 2026

✅ CI Passed

All checks passed successfully against the following vllm commit:
cddbc2b4b2547c681d1bdb876fdd6a7b8e0ec58d

@yangulei yangulei force-pushed the linear_limit_main branch 2 times, most recently from e4eabf6 to fefd207 Compare January 15, 2026 01:35
@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
66652e8082b69ba7d1e6aca7c234433de55f1b9b

Signed-off-by: Youlei Yang <[email protected]>
@yangulei yangulei force-pushed the linear_limit_main branch 2 times, most recently from 4b01722 to 8c7c399 Compare January 15, 2026 17:44
@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
4c1c501a7ee1d5efbad945ea62a702ce5cefb799

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant