Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 140 additions & 2 deletions docs/features/warmup.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,9 +79,147 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
```

## Sampler Warm-Up
The sampler converts model logits into next-token selections, using configured decoding strategies (greedy or probabilistic). Its warm-up phase prepares compiled graph variants (or internal code paths) for a representative set of batch sizes and sampling parameter combinations, so that first real user requests avoid extra compilation/setup latency.

### How the Sampler Warm-Up Works

Implemented in `warmup_sampler`, the routine systematically exercises the sampling stack across a Cartesian set of (batch size, temperature, top-p, top-k) patterns and a flag, that signals whether the batch size changed. Key steps:

1. Build a list of test batch sizes: it prepends `[0, 1]` to the distinct decode bucket batch sizes, as these need to be always warmed up.
2. Define a list of sampling configurations (12 total) covering:
* Greedy (temperature=0.0)
* Typical random sampling (temperature=1.0)
* Creative settings (0.7/0.9/top-k=50)
* Conservative (0.3/0.95/top-k=20)
* High temperature (1.2/0.8/top-k=100)
* Top-p only variants (e.g. 0.8/0.85/top-k=0)
Each appears twice: once with `batch_changed=True` and once with `batch_changed=False` to exercise any internal fast-path or cache invalidation logic tied to batch resizing.
3. For every batch size:
* Create a dummy hidden state tensor shaped `(batch_size, hidden_size)` and compute logits via `model.compute_logits`.
* Instantiate dummy request objects (at least one) with placeholder prompt tokens and single KV block.
4. For each sampling configuration:
* Update each request's `SamplingParams` (temperature, top_p, top_k).
* Mark the request as greedy or random (separate sets) to test branching.
* Populate `req_output_token_ids` with padded placeholders and refresh internal sampling metadata.
* Invoke `_run_sampling` passing `batch_changed` so both changed/unchanged batch-size code paths get compiled/exercised.
* Reset per-iteration sampler bookkeeping sets/lists.
5. After finishing all sampling configs for a batch size, clear request maps and continue.
6. Perform an HPU synchronize and log success.

### What the Logs Look Like

Typical sequence:

```text
INFO 09-22 16:39:42 [hpu_model_runner.py:3347] Warming up sampler with batch sizes: [0, 1, 138] and following configs:
INFO 09-22 16:39:42 [hpu_model_runner.py:3349] temp=0.0, top_p=1.0, top_k=0, batch_changed=True
INFO 09-22 16:39:42 [hpu_model_runner.py:3349] temp=1.0, top_p=1.0, top_k=0, batch_changed=True
INFO 09-22 16:39:42 [hpu_model_runner.py:3349] temp=0.7, top_p=0.9, top_k=50, batch_changed=True
INFO 09-22 16:39:42 [hpu_model_runner.py:3349] temp=0.3, top_p=0.95, top_k=20, batch_changed=True
INFO 09-22 16:39:42 [hpu_model_runner.py:3349] temp=1.2, top_p=0.8, top_k=100, batch_changed=True
INFO 09-22 16:39:42 [hpu_model_runner.py:3349] temp=0.8, top_p=0.85, top_k=0, batch_changed=True
INFO 09-22 16:39:42 [hpu_model_runner.py:3349] temp=0.0, top_p=1.0, top_k=0, batch_changed=False
INFO 09-22 16:39:42 [hpu_model_runner.py:3349] temp=1.0, top_p=1.0, top_k=0, batch_changed=False
INFO 09-22 16:39:42 [hpu_model_runner.py:3349] temp=0.7, top_p=0.9, top_k=50, batch_changed=False
INFO 09-22 16:39:42 [hpu_model_runner.py:3349] temp=0.3, top_p=0.95, top_k=20, batch_changed=False
INFO 09-22 16:39:42 [hpu_model_runner.py:3349] temp=1.2, top_p=0.8, top_k=100, batch_changed=False
INFO 09-22 16:39:42 [hpu_model_runner.py:3349] temp=0.8, top_p=0.85, top_k=0, batch_changed=False
INFO 09-22 16:39:42 [hpu_model_runner.py:3350] Starting sampler warmup...
INFO 09-22 16:39:43 [hpu_model_runner.py:3411] Sampler warmup completed successfully
```

If warm-up is globally skipped ([see below](#how-to-turn-it-off)), none of these lines appear.

### Why We Warm Up the Sampler (and Risks If We Do Not)

Without sampler warm-up:
* The first real request using a new combination (e.g., first high-temp + top-k path, or first batch size after scaling up load) might incur graph recompilation, adding latency to that user request.
* Tail latency variance increases: early heterogeneous workloads cause multiple staggered compilations.
* Batch-size transition logic (paths where `batch_changed=True`) may pay initialization cost during live traffic.

With warm-up:
* Common sampling hyperparameter mixes are compiled ahead-of-time.
* Greedy vs random branching and metadata refresh code paths are stabilized.
* Batch growth/shrink handling is already exercised, smoothing later scaling behavior.

Skipping the sampler warm-up does not affect correctness—only the latency profile of the earliest varied sampling requests.

### How to Turn It Off

There is no dedicated flag for the sampler alone. It participates in the global warm-up sequence and is skipped when:

* `VLLM_SKIP_WARMUP=true` is set.
* The engine is configured to enforce eager execution in a mode where no graph capture/compilation is desired (sampler still runs the first time on demand, but without a separate warm-up call).

WIP
### Related Notes & Environment Variables

* `VLLM_SKIP_WARMUP` – Disables sampler warm-up along with other warm-up phases.
* Decode bucket configuration env vars indirectly influence the set of batch sizes the sampler warms up (since it derives test batch sizes from decode buckets).

> [!NOTE]
> If you introduce new sampling behaviors (e.g., new nucleus filtering, penalties, or speculative metadata), extend `sampling_configs` in `warmup_sampler` so their graph paths are primed.

## Defragmenter Warm-Up

WIP
The defragmenter reclaims and compacts sparse KV-cache block usage at runtime by swapping rarely packed high-index blocks with lower free indices. Its warm-up phase pre-compiles the small swap graphs so that later online defragmentation can execute with near-zero graph compile latency.

### How the Defragmenter Warm-Up Works

During the main warm-up (`warmup_model`) we call an internal method (`warmup_defragmenter`) after the KV caches and defragmenter have been initialized. The routine:

1. Verifies the feature is enabled (defragmenter only runs when unified attention is enabled) and that swap utilities (`cache_utils`) are prepared.
2. Determines the list of padding thresholds: `[8, 16, 32, 64, 128, 256, 512]`.
3. Chooses a minimal valid swap pair `[(1, 0)]` (two distinct block IDs). Only two real blocks are required; internally each swap call is padded up to the current threshold length so that a compiled graph for that exact padded size is produced.
4. Iterates through each threshold and invokes a swap. This captures/compiles (depending on execution mode) the swap graph for that padded size.
5. If the number of thresholds is odd, performs one extra swap with the first threshold so that the sequence of swaps returns the KV cache to its original state (net zero logical change).
6. Logs completion.

Because every future real defragmentation swap request will round/pad to one of these known thresholds, all operational swap sizes hit a pre-compiled path and avoid on-demand compilation latency.

### What the Logs Look Like

You will typically see one of two flows. If there are at least two KV-cache blocks available:

```text
INFO 09-22 16:26:24 [hpu_model_runner.py:3428] Warming up defragmenter with thresholds: [8, 16, 32, 64, 128, 256, 512]
INFO 09-22 16:26:27 [hpu_model_runner.py:3452] Defragmenter warmup completed successfully
```

If insufficient blocks exist (e.g., extremely small test configuration or allocation failure) warm-up is skipped gracefully:

```text
INFO 09-22 16:26:24 [hpu_model_runner.py:3428] Warming up defragmenter with thresholds: [8, 16, 32, 64, 128, 256, 512]
WARNING hh:mm:ss hpu_model_runner.py:#### Skipping defragmenter warmup, insufficient blocks (1)
```

Add `VLLM_DEBUG=defrag` to the environment to emit fine-grained debug messages during live defragmentation (not during the minimal warm-up swaps only) such as the number of blocks swapped and post-compaction statistics.

### Why We Warm Up (and What Happens If We Do Not)

Defragmentation may be triggered mid-serving when the highest allocated block index drifts far above the actual number of in-use blocks (fragmentation). The operation itself is a sequence of swap kernels over key & value caches. Without warm-up:

* The first fragmentation event that requires a new (previously unseen) padded swap size would incur graph capture/compilation in the critical path.
* That added latency can surface as a sudden tail-latency spike for a user request.
* Multiple different first-seen swap sizes across processes could each trigger separate compilations.

With warm-up, all representative padded sizes are compiled ahead-of-time via a deterministic, tiny swap, so online defragmentation becomes a predictable, low-latency maintenance task.

Skipping only the defragmenter warm-up does not break correctness; it only risks sporadic latency when fragmentation first crosses a threshold that mandates compaction.

### How to Turn It Off

You can disable (a) the warm-up step itself or (b) the entire defragmentation feature:

* Disable all warm-up phases (including defragmenter) by setting `VLLM_SKIP_WARMUP=true`.
* Run without unified attention (the defragmenter is tied to unified attention; if unified attention is disabled, `defrag` is not enabled and the warm-up is a no-op). There is no separate dedicated environment flag to force-enable/disable defrag beyond unified attention in this version.
* Avoid graph compilation for defragmenter swaps by setting `VLLM_DEFRAG_WITH_GRAPHS=false` (falls back to regular execution; warm-up will still exercise swaps but without graph capture), if supported by the execution mode.

Related environment variables:

* `VLLM_DEFRAG_THRESHOLD` – Fragmentation trigger heuristic (default 32). Lower values make compaction more aggressive.
* `VLLM_DEFRAG_WITH_GRAPHS` – Whether swap paths are compiled/graphed (defaults to `bridge_mode == eager`).
* `VLLM_DEBUG=defrag` – Enables verbose defragmentation debug logging.
* `VLLM_SKIP_WARMUP` – Disables all warm-up stages including this one.

> [!NOTE]
> Disabling defragmenter warm-up does not disable defragmentation itself (unless unified attention/the feature is off). It only removes ahead-of-time graph preparation, potentially pushing compile cost into the first live fragmentation event.