vllm-project · yangulei · Jan 15, 2026 · Jan 15, 2026 · Jan 15, 2026 · Jan 15, 2026
@@ -20,7 +20,7 @@ This document lists the supported diagnostic and profiling, as well as performan
 | Parameter name               | Description                                                   | Default value |
 | ---------------------------- | ------------------------------------------------------------- | ------------- |
 | `VLLM_GRAPH_RESERVED_MEM`    | Percentage of memory dedicated to HPUGraph capture.           | `0.1`         |
-| `VLLM_EXPONENTIAL_BUCKETING` | Enables exponential bucket spacing instead of linear spacing. | `true`        |
+| `VLLM_EXPONENTIAL_BUCKETING` | Enables exponential bucket spacing instead of linear spacing. | `false`        |
 | `VLLM_BUCKETING_FROM_FILE`   | Enables reading bucket configuration from file | `None`        |
 
 ## Developer Mode Parameters
@@ -52,29 +52,38 @@ HPU PyTorch bridge environment variables impacting vLLM execution:
 
 `VLLM_{phase}_{dim}_BUCKET_{param}` is a collection of environment variables configuring ranges of linear bucketing mechanism, where:
 
-- `{phase}` is either `PROMPT` or `DECODE`
-- `{dim}` is either `BS`, `SEQ` or `BLOCK`
-- `{param}` is either `MIN`, `STEP` or `MAX`
+- `{phase}` is in `['PROMPT', 'DECODE']`.
+- `{dim}` is in `['BS', 'QUERY', 'CTX']` for `PROMPT` phase or in `['BS', 'BLOCK']` for `DECODE` phase.
+- `{param}` is in `['MIN', 'STEP', 'MAX', 'PAD_MAX', 'PAD_PERCENT']`.
 
 The following table lists the available variables with their default values:
-
-| Phase  | Variable name                                     | Default value                                |
-| ------ | ------------------------------------------------- | -------------------------------------------- |
-| Prompt | batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`)      | `1`                                          |
-| Prompt | batch size step (`VLLM_PROMPT_BS_BUCKET_STEP`)    | `1`                                          |
-| Prompt | batch size max (`VLLM_PROMPT_BS_BUCKET_MAX`)      | `max_num_prefill_seqs`                       |
-| Prompt | query length min (`VLLM_PROMPT_SEQ_BUCKET_MIN`)   | `block_size`                                 |
-| Prompt | query length step (`VLLM_PROMPT_SEQ_BUCKET_STEP`) | `block_size`                                 |
-| Prompt | query length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`)   | `max_num_batched_tokens`                     |
-| Prompt | sequence ctx min (`VLLM_PROMPT_CTX_BUCKET_MIN`)   | `0`                                          |
-| Prompt | sequence ctx step (`VLLM_PROMPT_CTX_BUCKET_STEP`) | `1`                                          |
-| Prompt | sequence ctx max (`VLLM_PROMPT_CTX_BUCKET_MAX`)   | `(max_model_len - block_size) // block_size` |
-| Decode | batch size min (`VLLM_DECODE_BS_BUCKET_MIN`)      | `1`                                          |
-| Decode | batch size step (`VLLM_DECODE_BS_BUCKET_STEP`)    | `32`                                         |
-| Decode | batch size max (`VLLM_DECODE_BS_BUCKET_MAX`)      | `max_num_seqs`                               |
-| Decode | block size min (`VLLM_DECODE_BLOCK_BUCKET_MIN`)   | `1`                                 |
-| Decode | block size step (`VLLM_DECODE_BLOCK_BUCKET_STEP`) | `block_size`                                 |
-| Decode | block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`)   | `max_model_len * max_num_seqs // block_size` <br> by default or `max_blocks` <br> if `VLLM_CONTIGUOUS_PA = True` |
+| Phase  | Variable Name                                              | Default Value                                                                                      |
+|--------|------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
+| **Prompt** | **Batch size min** (`VLLM_PROMPT_BS_BUCKET_MIN`)                | `1`                                                                                                |
+|          | **Batch size step** (`VLLM_PROMPT_BS_BUCKET_STEP`)               | `2`                                                                                                |
+|          | **Batch size max** (`VLLM_PROMPT_BS_BUCKET_MAX`)                 | `max_num_prefill_seqs`                                                                             |
+|          | **Batch size max abs padding** (`VLLM_PROMPT_BS_BUCKET_PAD_MAX`) | `16`                                                                                               |
+|          | **Batch size max padding %** (`VLLM_PROMPT_BS_BUCKET_PAD_PERCENT`)| `25`                                                                                               |
+|          | **Query length min** (`VLLM_PROMPT_QUERY_BUCKET_MIN`)            | `block_size`                                                                                       |
+|          | **Query length step** (`VLLM_PROMPT_QUERY_BUCKET_STEP`)          | `block_size`                                                                                       |
+|          | **Query length max** (`VLLM_PROMPT_QUERY_BUCKET_MAX`)            | `max_num_batched_tokens`                                                                           |
+|          | **Query length max abs padding** (`VLLM_PROMPT_QUERY_BUCKET_PAD_MAX`) | `max_num_batched_tokens`                                                                      |
+|          | **Query length max padding %** (`VLLM_PROMPT_QUERY_BUCKET_PAD_PERCENT`)| `25`                                                                                         |
+|          | **Sequence ctx min** (`VLLM_PROMPT_CTX_BUCKET_MIN`)              | `0`                                                                                                |
+|          | **Sequence ctx step** (`VLLM_PROMPT_CTX_BUCKET_STEP`)            | `2`                                                                                                |
+|          | **Sequence ctx max** (`VLLM_PROMPT_CTX_BUCKET_MAX`)              | `(max_model_len - block_size) // block_size`                                                       |
+|          | **Sequence ctx max abs padding** (`VLLM_PROMPT_CTX_BUCKET_PAD_MAX`)| `max_num_batched_tokens // block_size`                                                         |
+|          | **Sequence ctx max padding %** (`VLLM_PROMPT_CTX_BUCKET_PAD_PERCENT`)| `25`                                                                                         |
+| **Decode** | **Batch size min** (`VLLM_DECODE_BS_BUCKET_MIN`)                | `1`                                                                                                |
+|          | **Batch size step** (`VLLM_DECODE_BS_BUCKET_STEP`)               | `2`                                                                                                |
+|          | **Batch size max** (`VLLM_DECODE_BS_BUCKET_MAX`)                 | `max_num_seqs`                                                                                     |
+|          | **Batch size max abs padding** (`VLLM_DECODE_BS_BUCKET_PAD_MAX`) | `32`                                                                                               |
+|          | **Batch size max padding %** (`VLLM_DECODE_BS_BUCKET_PAD_PERCENT`)| `25`                                                                                               |
+|          | **Block size min** (`VLLM_DECODE_BLOCK_BUCKET_MIN`)              | `block_size`                                                                                       |
+|          | **Block size step** (`VLLM_DECODE_BLOCK_BUCKET_STEP`)            | `block_size`                                                                                       |
+|          | **Block size max** (`VLLM_DECODE_BLOCK_BUCKET_MAX`)              | `max_model_len * max_num_seqs // block_size` (default) <br> or `max_blocks` if `VLLM_CONTIGUOUS_PA=True` |
+|          | **Block size max abs padding** (`VLLM_DECODE_BLOCK_BUCKET_PAD_MAX`)| `max_num_batched_tokens * max_num_seqs // block_size`                                         |
+|          | **Block size max padding %** (`VLLM_DECODE_BLOCK_BUCKET_PAD_PERCENT`)| `25`                                                                                         |
 
 When a deployed workload does not use the full context a model can handle, we
 recommend you to limit the maximum values upfront, based on the expected input
@@ -88,7 +97,7 @@ unnecessary and you can limit the values upfront. It reduces the startup time
 and warm-up. Recommended settings for this case are:
 
 - `--max_model_len`: `3072`, which is the sum of input and output sequences (1+2)*1024.  
-- `VLLM_PROMPT_SEQ_BUCKET_MAX`: `1024`, which is the maximum input token size that you expect to handle.
+- `VLLM_PROMPT_QUERY_BUCKET_MAX`: `1024`, which is the maximum input token size that you expect to handle.
 
 !!! note
     If the model config specifies a high `max_model_len`, set it to the sum of `input_tokens` and `output_tokens`, rounded up to a multiple of `block_size` according to actual requirements.
@@ -24,24 +24,26 @@ def test_read_bucket_settings(monkeypatch):
     monkeypatch.setenv("VLLM_PROMPT_BS_BUCKET_MIN", "1")
     monkeypatch.setenv("VLLM_PROMPT_BS_BUCKET_STEP", "16")
     monkeypatch.setenv("VLLM_PROMPT_BS_BUCKET_MAX", "64")
-    config = linear.read_bucket_settings("prompt", "bs", min=1, step=32, max=128)
-    assert config == [1, 16, 64]
+    monkeypatch.setenv("VLLM_PROMPT_BS_BUCKET_PAD_MAX", "32")
+    monkeypatch.setenv("VLLM_PROMPT_BS_BUCKET_PAD_PERCENT", "25")
+    config = linear.read_bucket_settings("prompt", "bs", min=1, step=32, max=128, pad_max=64, pad_percent=10)
+    assert config == [1, 16, 64, 32, 25]
 
 
 def test_read_bucket_settings_empty_flags():
-    config = linear.read_bucket_settings("prompt", "bs", min=1, step=32, max=128)
-    assert config == [1, 32, 128]
+    config = linear.read_bucket_settings("prompt", "bs", min=1, step=32, max=128, pad_max=64, pad_percent=10)
+    assert config == [1, 32, 128, 64, 10]
 
 
 def test_warmup_range():
-    config = (2, 64, 128)
-    result = linear.warmup_range(config)
+    config = (2, 64, 128, 64, 25)
+    result = linear.warmup_range_with_limits(config)
     assert result == [2, 4, 8, 16, 32, 64, 128]
 
 
 def test_warmup_range_with_one():
-    config = (1, 64, 128)
-    result = linear.warmup_range(config)
+    config = (1, 64, 128, 64, 25)
+    result = linear.warmup_range_with_limits(config)
     assert result == [1, 2, 4, 8, 16, 32, 64, 128]
 
 

@@ -91,11 +91,8 @@ def read_from_file(self, is_prompt):
     def get_bucketing_strategy(self):
         strategy = None
         # TODO - we can use different strategies for decode and prompt
-        use_exponential_bucketing = True if \
-                get_config().VLLM_EXPONENTIAL_BUCKETING == None else \
-                get_config().VLLM_EXPONENTIAL_BUCKETING
 
-        if use_exponential_bucketing:
+        if get_config().VLLM_EXPONENTIAL_BUCKETING:
             from vllm_gaudi.extension.bucketing.exponential import (ExponentialBucketingStrategy)
             strategy = ExponentialBucketingStrategy()
         else:
@@ -152,6 +149,9 @@ def generate_prompt_buckets(self):
                 bs_range = strategy.get_range(bs_cfg)
                 query_range = strategy.get_range(query_cfg)
                 ctx_range = strategy.get_range(ctx_cfg)
+                logger().debug(f"Prompt BS range: {bs_range}")
+                logger().debug(f"Prompt query range: {query_range}")
+                logger().debug(f"Prompt context range: {ctx_range}")
 
             self.prompt_buckets = generate_buckets(bs_range, query_range, ctx_range, True, self.max_model_len,
                                                    self.max_num_seqs, self.max_num_prefill_seqs,
@@ -195,6 +195,10 @@ def generate_decode_buckets(self):
                 if get_config().use_contiguous_pa and ctx_range[-1] < self.num_hpu_blocks:
                     ctx_range.append(self.num_hpu_blocks)
 
+                logger().debug(f"Decode BS range: {bs_range}")
+                logger().debug(f"Decode query range: {query_range}")
+                logger().debug(f"Decode context range: {ctx_range}")
+
             self.decode_buckets = generate_buckets(bs_range, query_range, ctx_range, False, self.max_model_len,
                                                    self.max_num_seqs, self.max_num_prefill_seqs,
                                                    self.max_num_batched_tokens, self.block_size, self.num_hpu_blocks,