diff --git a/docs/configuration/long_context.md b/docs/configuration/long_context.md index 0ec99f46..64cef30b 100644 --- a/docs/configuration/long_context.md +++ b/docs/configuration/long_context.md @@ -15,17 +15,14 @@ Set the following environment variables to avoid OOM/functional issues. Additio - `VLLM_ENGINE_ITERATION_TIMEOUT_S=3600` - `VLLM_RPC_TIMEOUT=100000` -- `VLLM_PROMPT_USE_FUSEDSDPA=1` -- `PT_HPU_ENABLE_LAZY_COLLECTIVES=true` -- `PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1` - `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1` -**32K context length flags examples:** +## Warmup buckets preparation +Exponential bucketing mechanism automatically prepares buckets for long context. Linear bucketing mechanism requires manual flags settings. + +**32K context length flags examples for linear warmup:** - `VLLM_GRAPH_RESERVED_MEM`: The value depends on the model and context length settings. Use `VLLM_GRAPH_RESERVED_MEM=0.02` for Llama3.1-8B or `VLLM_GRAPH_RESERVED_MEM=0.1` for Llama3.1-70B. -- `VLLM_PROMPT_BS_BUCKET_MIN=1`: Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs. -- `VLLM_PROMPT_BS_BUCKET_STEP=16`: Suggested value, depends on the model. Increasing the step value results in fewer buckets. If an OOM error occurs, the value should be increased. -- `VLLM_PROMPT_BS_BUCKET_MAX=16`: Suggested value, depends on the model. You can increase it until you reach an OOM error or decrease it if OOM occurs. - `VLLM_PROMPT_SEQ_BUCKET_MIN=24576`: Suggested value, depends on warmup results. - `VLLM_PROMPT_SEQ_BUCKET_STEP=2048`: Suggested value, depends on warmup results. It is recommended to increase it to a higher value for faster warmup. `VLLM_PROMPT_SEQ_BUCKET_STEP=16384` - Suggested value for Intel Gaudi 3. - `VLLM_PROMPT_SEQ_BUCKET_MAX=32768`: Value for context length of 32K. Use 16384 for 16K.