Introduce absolute and relative padding limits to the linear bucketing #762

yangulei · 2025-12-26T08:52:36Z

Motivation

The exponential bucketing is introduced to significantly reduce the number of buckets. For the example with max_num_batched_tokens=8192, max_model_len=32768, max_num_seqs=256 and # hpu blocks: 4127. The exponential bucketing generates 120 prompt buckets and 81 decode buckets, and the linear bucketing generates 14368 prompt buckets and 4042 decode buckets. The exponential buckets are filtered combinations with the following ranges:

Prompt query range: [128, 256, 384, 512, 640, 1792, 2816, 3968, 4992, 6144, 7168, 8192]
Prompt context range: [0, 1, 3, 8, 22, 56, 90, 124, 158, 192]
Decode BS range: [1, 2, 4, 8, 14, 24, 42, 78, 140, 256]
Decode context range: [1, 256, 512, 768, 1024, 1280, 1536, 1792, 2304, 2816, 3584, 4352]

The max absolute padding (max(bucket[i]-bucket[i-1]-1)) is proportional to the bucket max without limitations, and the max relative padding ((bucket[i]-bucket[i-1]-1)/bucket[i]) towards 50% for large bucket max. The large padding cause large overhead especially for the cases with long sequences.

We need a bucketing algorithm to balance the bucket number (warmup time) and the runtime performance (padding overhead).

Changes

Enhance the warmup_range in linear bucketing to warmup_range_with_limits to generate a range that ensure the absolute and relative padding not exceeds the specified limits.
Introduce new ENVs named VLLM_{phase}_{dim}_BUCKET_PAD_MAX and VLLM_{phase}_{dim}_BUCKET_PAD_PERCENT to set the absolute and relative padding limits respectively.

For the above example with default settings:

Prompt query range: [128, 256, 384, 512, 640, 768, 1024, 1280, 1664, 2176, 2816, 3712, 4864, 6400, 8192]
Prompt context range: [0, 1, 2, 4, 6, 8, 12, 16, 22, 30, 40, 54, 64, 86, 116, 128, 172, 192, 255]
Decode BS range: [1, 2, 4, 6, 8, 12, 16, 22, 30, 32, 44, 60, 64, 86, 96, 128, 160, 192, 224, 256]
Decode context range: [128, 256, 384, 512, 640, 768, 1024, 1280, 1664, 2176, 2816, 3712, 4127]

Which results in 284 prompt buckets and 222 decode buckets with much less padding.

Benefits

Could simulate the exponential bucketing by setting large VLLM_{phase}_{dim}_BUCKET_PAD_MAX and setting VLLM_{phase}_{dim}_BUCKET_PAD_PERCENT=50.
Could fallback to the original linear bucketing by setting VLLM_{phase}_{dim}_BUCKET_PAD_PERCENT=0.
Users could further tuning the absolute and relative padding limits to balance the warmup time and runtime performance.
Setting VLLM_{phase}_{dim}_BUCKET_PAD_MAX to multiple of PT_HPU_SDPA_BR_FACTOR and PT_HPU_SDPA_BC_FACTOR could generate buckets that align with the slicing chunk size and give better performance.

Minor changes

Set to use linear bucketing with limits by default instead of exponential bucketing.
Update the tests and documentations.

Copilot

Pull request overview

This PR introduces configurable absolute and relative padding limits to the linear bucketing algorithm to better balance warmup time and runtime performance. The change replaces exponential bucketing as the default strategy.

Adds new environment variables for controlling padding limits (PAD_MAX and PAD_PERCENT) across all bucket dimensions
Implements a new warmup_range_with_limits function that generates buckets respecting these padding constraints
Changes the default bucketing strategy from exponential to linear with limits

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
vllm_gaudi/extension/features.py	Adds new environment variables for padding limits and switches default bucketing strategy to linear
vllm_gaudi/extension/bucketing/linear.py	Implements padding-aware bucket generation with new `warmup_range_with_limits` function and updated configuration handling
vllm_gaudi/extension/bucketing/common.py	Simplifies bucketing strategy selection and adds debug logging for bucket ranges
tests/unit_tests/test_bucketing.py	Updates tests to accommodate new padding parameters in bucket configuration
docs/configuration/env_variables.md	Documents new padding-related environment variables and updated defaults

Comments suppressed due to low confidence (1)

vllm_gaudi/extension/bucketing/linear.py:1

The BUCKET_PAD_PERCENT environment variables are defined as int type, but they represent percentages. This could lead to confusion as the documentation shows 25 meaning 25%, but users might expect values like 0.25. Consider using a float type or clearly documenting that the value should be specified as an integer percentage (0-100).

import os

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_gaudi/extension/bucketing/linear.py

docs/configuration/env_variables.md

github-actions · 2025-12-30T01:08:49Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

vllm_gaudi/extension/bucketing/linear.py:1

The PAD_PERCENT parameter is stored as an integer but represents a percentage value (0-50). Consider using a float type or renaming to indicate it's in integer percentage points to avoid confusion.

import os

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_gaudi/extension/bucketing/linear.py

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_gaudi/extension/bucketing/linear.py

yangulei · 2026-01-07T07:04:53Z

Submitted #780 to solve the OOM issue in CI.

github-actions · 2026-01-07T10:18:53Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
b3a2bdf1ac90748d58bf8c05f8d0095ede5c7eca

github-actions · 2026-01-08T10:50:39Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
cddbc2b4b2547c681d1bdb876fdd6a7b8e0ec58d

github-actions · 2026-01-15T05:43:58Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
66652e8082b69ba7d1e6aca7c234433de55f1b9b

Signed-off-by: Youlei Yang <[email protected]>

github-actions · 2026-01-16T06:29:07Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
4c1c501a7ee1d5efbad945ea62a702ce5cefb799

Copilot AI review requested due to automatic review settings December 26, 2025 08:52

yangulei requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, kzawora-intel, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners December 26, 2025 08:52

Copilot AI reviewed Dec 26, 2025

View reviewed changes

vllm_gaudi/extension/bucketing/linear.py Show resolved Hide resolved

vllm_gaudi/extension/bucketing/linear.py Outdated Show resolved Hide resolved

docs/configuration/env_variables.md Outdated Show resolved Hide resolved

github-actions bot mentioned this pull request Dec 26, 2025

🚦 Team Review Dashboard #701

Open

yangulei force-pushed the linear_limit_main branch 3 times, most recently from 0e742bc to 575e7d1 Compare December 30, 2025 01:51

yangulei requested a review from Copilot December 30, 2025 02:43

Copilot AI reviewed Dec 30, 2025

View reviewed changes

yangulei requested a review from Copilot December 30, 2025 03:08

Copilot AI reviewed Dec 30, 2025

View reviewed changes

vllm_gaudi/extension/bucketing/linear.py Show resolved Hide resolved

vllm_gaudi/extension/bucketing/linear.py Outdated Show resolved Hide resolved

vllm_gaudi/extension/bucketing/linear.py Show resolved Hide resolved

vllm_gaudi/extension/bucketing/linear.py Show resolved Hide resolved

yangulei force-pushed the linear_limit_main branch from faf3e47 to aa9d30f Compare January 7, 2026 02:44

yangulei force-pushed the linear_limit_main branch 2 times, most recently from e4eabf6 to fefd207 Compare January 15, 2026 01:35

yangulei added 3 commits January 16, 2026 01:16

add padding abs and ratio limits for the linear bucketing

b8e1710

Signed-off-by: Youlei Yang <[email protected]>

use linear bucketing by default

0c6cf50

Signed-off-by: Youlei Yang <[email protected]>

update tests for linear bucketing

16ffb1f

Signed-off-by: Youlei Yang <[email protected]>

update readme

8c7c399

Signed-off-by: Youlei Yang <[email protected]>

yangulei force-pushed the linear_limit_main branch 2 times, most recently from 4b01722 to 8c7c399 Compare January 15, 2026 17:44

Introduce absolute and relative padding limits to the linear bucketing #762

Are you sure you want to change the base?

Introduce absolute and relative padding limits to the linear bucketing #762

Conversation

yangulei commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Benefits

Minor changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 30, 2025

🚧 CI Blocked

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yangulei commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

✅ CI Passed

Uh oh!

github-actions bot commented Jan 8, 2026

✅ CI Passed

Uh oh!

github-actions bot commented Jan 15, 2026

✅ CI Passed

Uh oh!

github-actions bot commented Jan 16, 2026

✅ CI Passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yangulei commented Dec 26, 2025 •

edited

Loading