Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 32 additions & 23 deletions docs/configuration/env_variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ This document lists the supported diagnostic and profiling, as well as performan
| Parameter name | Description | Default value |
| ---------------------------- | ------------------------------------------------------------- | ------------- |
| `VLLM_GRAPH_RESERVED_MEM` | Percentage of memory dedicated to HPUGraph capture. | `0.1` |
| `VLLM_EXPONENTIAL_BUCKETING` | Enables exponential bucket spacing instead of linear spacing. | `true` |
| `VLLM_EXPONENTIAL_BUCKETING` | Enables exponential bucket spacing instead of linear spacing. | `false` |
| `VLLM_BUCKETING_FROM_FILE` | Enables reading bucket configuration from file | `None` |

## Developer Mode Parameters
Expand Down Expand Up @@ -52,29 +52,38 @@ HPU PyTorch bridge environment variables impacting vLLM execution:

`VLLM_{phase}_{dim}_BUCKET_{param}` is a collection of environment variables configuring ranges of linear bucketing mechanism, where:

- `{phase}` is either `PROMPT` or `DECODE`
- `{dim}` is either `BS`, `SEQ` or `BLOCK`
- `{param}` is either `MIN`, `STEP` or `MAX`
- `{phase}` is in `['PROMPT', 'DECODE']`.
- `{dim}` is in `['BS', 'QUERY', 'CTX']` for `PROMPT` phase or in `['BS', 'BLOCK']` for `DECODE` phase.
- `{param}` is in `['MIN', 'STEP', 'MAX', 'PAD_MAX', 'PAD_PERCENT']`.

The following table lists the available variables with their default values:

| Phase | Variable name | Default value |
| ------ | ------------------------------------------------- | -------------------------------------------- |
| Prompt | batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`) | `1` |
| Prompt | batch size step (`VLLM_PROMPT_BS_BUCKET_STEP`) | `1` |
| Prompt | batch size max (`VLLM_PROMPT_BS_BUCKET_MAX`) | `max_num_prefill_seqs` |
| Prompt | query length min (`VLLM_PROMPT_SEQ_BUCKET_MIN`) | `block_size` |
| Prompt | query length step (`VLLM_PROMPT_SEQ_BUCKET_STEP`) | `block_size` |
| Prompt | query length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`) | `max_num_batched_tokens` |
| Prompt | sequence ctx min (`VLLM_PROMPT_CTX_BUCKET_MIN`) | `0` |
| Prompt | sequence ctx step (`VLLM_PROMPT_CTX_BUCKET_STEP`) | `1` |
| Prompt | sequence ctx max (`VLLM_PROMPT_CTX_BUCKET_MAX`) | `(max_model_len - block_size) // block_size` |
| Decode | batch size min (`VLLM_DECODE_BS_BUCKET_MIN`) | `1` |
| Decode | batch size step (`VLLM_DECODE_BS_BUCKET_STEP`) | `32` |
| Decode | batch size max (`VLLM_DECODE_BS_BUCKET_MAX`) | `max_num_seqs` |
| Decode | block size min (`VLLM_DECODE_BLOCK_BUCKET_MIN`) | `1` |
| Decode | block size step (`VLLM_DECODE_BLOCK_BUCKET_STEP`) | `block_size` |
| Decode | block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`) | `max_model_len * max_num_seqs // block_size` <br> by default or `max_blocks` <br> if `VLLM_CONTIGUOUS_PA = True` |
| Phase | Variable Name | Default Value |
|--------|------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| **Prompt** | **Batch size min** (`VLLM_PROMPT_BS_BUCKET_MIN`) | `1` |
| | **Batch size step** (`VLLM_PROMPT_BS_BUCKET_STEP`) | `2` |
| | **Batch size max** (`VLLM_PROMPT_BS_BUCKET_MAX`) | `max_num_prefill_seqs` |
| | **Batch size max abs padding** (`VLLM_PROMPT_BS_BUCKET_PAD_MAX`) | `16` |
| | **Batch size max padding %** (`VLLM_PROMPT_BS_BUCKET_PAD_PERCENT`)| `25` |
| | **Query length min** (`VLLM_PROMPT_QUERY_BUCKET_MIN`) | `block_size` |
| | **Query length step** (`VLLM_PROMPT_QUERY_BUCKET_STEP`) | `block_size` |
| | **Query length max** (`VLLM_PROMPT_QUERY_BUCKET_MAX`) | `max_num_batched_tokens` |
| | **Query length max abs padding** (`VLLM_PROMPT_QUERY_BUCKET_PAD_MAX`) | `max_num_batched_tokens` |
| | **Query length max padding %** (`VLLM_PROMPT_QUERY_BUCKET_PAD_PERCENT`)| `25` |
| | **Sequence ctx min** (`VLLM_PROMPT_CTX_BUCKET_MIN`) | `0` |
| | **Sequence ctx step** (`VLLM_PROMPT_CTX_BUCKET_STEP`) | `2` |
| | **Sequence ctx max** (`VLLM_PROMPT_CTX_BUCKET_MAX`) | `(max_model_len - block_size) // block_size` |
| | **Sequence ctx max abs padding** (`VLLM_PROMPT_CTX_BUCKET_PAD_MAX`)| `max_num_batched_tokens // block_size` |
| | **Sequence ctx max padding %** (`VLLM_PROMPT_CTX_BUCKET_PAD_PERCENT`)| `25` |
| **Decode** | **Batch size min** (`VLLM_DECODE_BS_BUCKET_MIN`) | `1` |
| | **Batch size step** (`VLLM_DECODE_BS_BUCKET_STEP`) | `2` |
| | **Batch size max** (`VLLM_DECODE_BS_BUCKET_MAX`) | `max_num_seqs` |
| | **Batch size max abs padding** (`VLLM_DECODE_BS_BUCKET_PAD_MAX`) | `32` |
| | **Batch size max padding %** (`VLLM_DECODE_BS_BUCKET_PAD_PERCENT`)| `25` |
| | **Block size min** (`VLLM_DECODE_BLOCK_BUCKET_MIN`) | `block_size` |
| | **Block size step** (`VLLM_DECODE_BLOCK_BUCKET_STEP`) | `block_size` |
| | **Block size max** (`VLLM_DECODE_BLOCK_BUCKET_MAX`) | `max_model_len * max_num_seqs // block_size` (default) <br> or `max_blocks` if `VLLM_CONTIGUOUS_PA=True` |
| | **Block size max abs padding** (`VLLM_DECODE_BLOCK_BUCKET_PAD_MAX`)| `max_num_batched_tokens * max_num_seqs // block_size` |
| | **Block size max padding %** (`VLLM_DECODE_BLOCK_BUCKET_PAD_PERCENT`)| `25` |

When a deployed workload does not use the full context a model can handle, we
recommend you to limit the maximum values upfront, based on the expected input
Expand All @@ -88,7 +97,7 @@ unnecessary and you can limit the values upfront. It reduces the startup time
and warm-up. Recommended settings for this case are:

- `--max_model_len`: `3072`, which is the sum of input and output sequences (1+2)*1024.
- `VLLM_PROMPT_SEQ_BUCKET_MAX`: `1024`, which is the maximum input token size that you expect to handle.
- `VLLM_PROMPT_QUERY_BUCKET_MAX`: `1024`, which is the maximum input token size that you expect to handle.

!!! note
If the model config specifies a high `max_model_len`, set it to the sum of `input_tokens` and `output_tokens`, rounded up to a multiple of `block_size` according to actual requirements.
18 changes: 10 additions & 8 deletions tests/unit_tests/test_bucketing.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,24 +24,26 @@ def test_read_bucket_settings(monkeypatch):
monkeypatch.setenv("VLLM_PROMPT_BS_BUCKET_MIN", "1")
monkeypatch.setenv("VLLM_PROMPT_BS_BUCKET_STEP", "16")
monkeypatch.setenv("VLLM_PROMPT_BS_BUCKET_MAX", "64")
config = linear.read_bucket_settings("prompt", "bs", min=1, step=32, max=128)
assert config == [1, 16, 64]
monkeypatch.setenv("VLLM_PROMPT_BS_BUCKET_PAD_MAX", "32")
monkeypatch.setenv("VLLM_PROMPT_BS_BUCKET_PAD_PERCENT", "25")
config = linear.read_bucket_settings("prompt", "bs", min=1, step=32, max=128, pad_max=64, pad_percent=10)
assert config == [1, 16, 64, 32, 25]


def test_read_bucket_settings_empty_flags():
config = linear.read_bucket_settings("prompt", "bs", min=1, step=32, max=128)
assert config == [1, 32, 128]
config = linear.read_bucket_settings("prompt", "bs", min=1, step=32, max=128, pad_max=64, pad_percent=10)
assert config == [1, 32, 128, 64, 10]


def test_warmup_range():
config = (2, 64, 128)
result = linear.warmup_range(config)
config = (2, 64, 128, 64, 25)
result = linear.warmup_range_with_limits(config)
assert result == [2, 4, 8, 16, 32, 64, 128]


def test_warmup_range_with_one():
config = (1, 64, 128)
result = linear.warmup_range(config)
config = (1, 64, 128, 64, 25)
result = linear.warmup_range_with_limits(config)
assert result == [1, 2, 4, 8, 16, 32, 64, 128]


Expand Down
12 changes: 8 additions & 4 deletions vllm_gaudi/extension/bucketing/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,11 +91,8 @@ def read_from_file(self, is_prompt):
def get_bucketing_strategy(self):
strategy = None
# TODO - we can use different strategies for decode and prompt
use_exponential_bucketing = True if \
get_config().VLLM_EXPONENTIAL_BUCKETING == None else \
get_config().VLLM_EXPONENTIAL_BUCKETING

if use_exponential_bucketing:
if get_config().VLLM_EXPONENTIAL_BUCKETING:
from vllm_gaudi.extension.bucketing.exponential import (ExponentialBucketingStrategy)
strategy = ExponentialBucketingStrategy()
else:
Expand Down Expand Up @@ -152,6 +149,9 @@ def generate_prompt_buckets(self):
bs_range = strategy.get_range(bs_cfg)
query_range = strategy.get_range(query_cfg)
ctx_range = strategy.get_range(ctx_cfg)
logger().debug(f"Prompt BS range: {bs_range}")
logger().debug(f"Prompt query range: {query_range}")
logger().debug(f"Prompt context range: {ctx_range}")

self.prompt_buckets = generate_buckets(bs_range, query_range, ctx_range, True, self.max_model_len,
self.max_num_seqs, self.max_num_prefill_seqs,
Expand Down Expand Up @@ -195,6 +195,10 @@ def generate_decode_buckets(self):
if get_config().use_contiguous_pa and ctx_range[-1] < self.num_hpu_blocks:
ctx_range.append(self.num_hpu_blocks)

logger().debug(f"Decode BS range: {bs_range}")
logger().debug(f"Decode query range: {query_range}")
logger().debug(f"Decode context range: {ctx_range}")

self.decode_buckets = generate_buckets(bs_range, query_range, ctx_range, False, self.max_model_len,
self.max_num_seqs, self.max_num_prefill_seqs,
self.max_num_batched_tokens, self.block_size, self.num_hpu_blocks,
Expand Down
Loading