-
Notifications
You must be signed in to change notification settings - Fork 331
vllm metrics docs #1662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
vllm metrics docs #1662
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,6 +2,7 @@ | |
| "title": "Checkpointing and Logging", | ||
| "pages": [ | ||
| "checkpointing", | ||
| "logging" | ||
| "logging", | ||
| "vllm-metrics" | ||
| ] | ||
| } | ||
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,72 @@ | ||||||||
| --- | ||||||||
| title: "vLLM Engine Metrics" | ||||||||
| --- | ||||||||
|
|
||||||||
| SkyRL can route vLLM's engine-level metrics (queue depth, KV cache usage, | ||||||||
| throughput, latency, prefix-cache hit rate) through Ray's per-node Prometheus | ||||||||
| metrics agents. A small fixed subset is also scraped once per training step | ||||||||
| and merged into the trainer's wandb payload. | ||||||||
|
|
||||||||
| ## Enabling | ||||||||
|
|
||||||||
| This is **on by default**. To disable it: | ||||||||
|
|
||||||||
| ```yaml | ||||||||
| generator: | ||||||||
| inference_engine: | ||||||||
| enable_ray_prometheus_stats: false | ||||||||
| ``` | ||||||||
|
|
||||||||
| When enabled, vLLM's `RayPrometheusStatLogger` is installed on every engine. Each | ||||||||
| engine reports its stats through `ray.util.metrics`, and Ray's per-node | ||||||||
| metrics agent exposes them at `http://<node-ip>:<MetricsExportPort>/metrics` | ||||||||
| in Prometheus text format. On Anyscale this feeds the hosted Prometheus + | ||||||||
| Grafana stack with no extra setup. | ||||||||
|
Comment on lines
+23
to
+24
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
|
|
||||||||
| ## Inference path support | ||||||||
|
|
||||||||
| | Inference path | Supported | | ||||||||
| | ----------------------------------------------- | --------- | | ||||||||
| | New inference (`_SKYRL_USE_NEW_INFERENCE=1`, default) | Yes | | ||||||||
| | Old inference + `generator.async_engine=true` | Yes | | ||||||||
| | Old inference + `generator.async_engine=false` | **No** | | ||||||||
|
|
||||||||
| The new inference path ([vllm_server_actor.py:329-339](skyrl/backends/skyrl_train/inference_servers/vllm_server_actor.py#L329-L339)) | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The link path |
||||||||
| always uses `AsyncLLMEngine` and wires the stat logger unconditionally. | ||||||||
|
|
||||||||
| The legacy path supports it only when `async_engine=true` | ||||||||
| ([vllm_engine.py:359-370](skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L359-L370)). | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||
| The synchronous `VLLMInferenceEngine` pops the flag and emits a warning | ||||||||
| ([vllm_engine.py:240-247](skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L240-L247)): | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||
| vLLM's sync `LLM` class doesn't accept `stat_loggers`. Set | ||||||||
| `generator.async_engine=true` if you need engine metrics on the legacy path. | ||||||||
|
|
||||||||
| ## Metrics logged to wandb | ||||||||
|
|
||||||||
| When the flag is on, the trainer constructs a `VLLMMetricsScraper` | ||||||||
| ([trainer.py:122-124](skyrl/train/trainer.py#L122-L124)) that scrapes every | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||
| alive Ray node's metrics endpoint once per training step and merges its | ||||||||
| output into the wandb log payload — the same payload used for training | ||||||||
| metrics, so the keys appear under whatever logger backend is configured | ||||||||
| (`wandb`, `mlflow`, `swanlab`, `tensorboard`, or `console`). | ||||||||
|
|
||||||||
| Both `Trainer` and `FullyAsyncTrainer` log these: | ||||||||
|
|
||||||||
| | Key | Source | Aggregation | | ||||||||
| | ---------------------------------- | ---------------------------- | -------------------------- | | ||||||||
| | `vllm/num_requests_running` | gauge | sum across replicas | | ||||||||
| | `vllm/num_requests_waiting` | gauge | sum across replicas | | ||||||||
| | `vllm/kv_cache_usage_perc` | gauge | mean across replicas | | ||||||||
| | `vllm/generation_throughput_tok_s` | counter delta / Δt | summed before differencing | | ||||||||
| | `vllm/prompt_throughput_tok_s` | counter delta / Δt | summed before differencing | | ||||||||
| | `vllm/prefix_cache_hit_rate` | hits Δ / queries Δ | summed before ratio | | ||||||||
| | `vllm/ttft_seconds_avg` | histogram sum Δ / count Δ | summed before ratio | | ||||||||
| | `vllm/tpot_seconds_avg` | histogram sum Δ / count Δ | summed before ratio | | ||||||||
|
|
||||||||
| Rate- and ratio-style metrics need two consecutive samples to take a delta, | ||||||||
| so they appear starting from the **second** training step. Counter resets | ||||||||
| (e.g. engine restart) are skipped rather than reported as negative rates. | ||||||||
|
|
||||||||
| The full set of vLLM metrics is still available via the Prometheus endpoints | ||||||||
| themselves — only this curated subset is forwarded to wandb. The selection | ||||||||
| lives in [vllm_metrics_scraper.py:27-51](skyrl/train/utils/vllm_metrics_scraper.py#L27-L51). | ||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Comment on lines
+70
to
+72
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you provide an example for querying KV Cache Residency metrics like Lifetime here? |
||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.