NovaSky-AI · hao-aaron · May 13, 2026 · SumanthRH · May 14, 2026 · SumanthRH
diff --git a/docs/content/docs/checkpointing-logging/meta.json b/docs/content/docs/checkpointing-logging/meta.json
@@ -2,6 +2,7 @@
   "title": "Checkpointing and Logging",
-  "title": "Checkpointing and Logging",
+  "title": "Checkpointing and Observability",
-  "title": "Checkpointing and Logging",
+  "title": "Checkpointing and Observability",
   "pages": [
     "checkpointing",
-    "logging"
+    "logging",
+    "vllm-metrics"
   ]
 }
diff --git a/docs/content/docs/checkpointing-logging/vllm-metrics.mdx b/docs/content/docs/checkpointing-logging/vllm-metrics.mdx
@@ -0,0 +1,72 @@
+---
+title: "vLLM Engine Metrics"
+---
+
+SkyRL can route vLLM's engine-level metrics (queue depth, KV cache usage,
+throughput, latency, prefix-cache hit rate) through Ray's per-node Prometheus
+metrics agents. A small fixed subset is also scraped once per training step
+and merged into the trainer's wandb payload.
+
+## Enabling
+
+This is **on by default**. To disable it:
+
+```yaml
+generator:
+  inference_engine:
+    enable_ray_prometheus_stats: false
+```
+
+When enabled, vLLM's `RayPrometheusStatLogger` is installed on every engine. Each
+engine reports its stats through `ray.util.metrics`, and Ray's per-node
+metrics agent exposes them at `http://<node-ip>:<MetricsExportPort>/metrics`
+in Prometheus text format. On Anyscale this feeds the hosted Prometheus +
+Grafana stack with no extra setup.
-in Prometheus text format. On Anyscale this feeds the hosted Prometheus +
-Grafana stack with no extra setup.
+in Prometheus text format. 
-in Prometheus text format. On Anyscale this feeds the hosted Prometheus +
-Grafana stack with no extra setup.
+in Prometheus text format. 
+
+## Inference path support
+
+| Inference path                                  | Supported |
+| ----------------------------------------------- | --------- |
+| New inference (`_SKYRL_USE_NEW_INFERENCE=1`, default) | Yes  |
+| Old inference + `generator.async_engine=true`   | Yes       |
+| Old inference + `generator.async_engine=false`  | **No**    |
+
+The new inference path ([vllm_server_actor.py:329-339](skyrl/backends/skyrl_train/inference_servers/vllm_server_actor.py#L329-L339))
+always uses `AsyncLLMEngine` and wires the stat logger unconditionally.
+
+The legacy path supports it only when `async_engine=true`
+([vllm_engine.py:359-370](skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L359-L370)).
+The synchronous `VLLMInferenceEngine` pops the flag and emits a warning
+([vllm_engine.py:240-247](skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L240-L247)):
+vLLM's sync `LLM` class doesn't accept `stat_loggers`. Set
+`generator.async_engine=true` if you need engine metrics on the legacy path.
+
+## Metrics logged to wandb
+
+When the flag is on, the trainer constructs a `VLLMMetricsScraper`
+([trainer.py:122-124](skyrl/train/trainer.py#L122-L124)) that scrapes every
+alive Ray node's metrics endpoint once per training step and merges its
+output into the wandb log payload — the same payload used for training
+metrics, so the keys appear under whatever logger backend is configured
+(`wandb`, `mlflow`, `swanlab`, `tensorboard`, or `console`).
+
+Both `Trainer` and `FullyAsyncTrainer` log these:
+
+| Key                                | Source                       | Aggregation                |
+| ---------------------------------- | ---------------------------- | -------------------------- |
+| `vllm/num_requests_running`        | gauge                        | sum across replicas        |
+| `vllm/num_requests_waiting`        | gauge                        | sum across replicas        |
+| `vllm/kv_cache_usage_perc`         | gauge                        | mean across replicas       |
+| `vllm/generation_throughput_tok_s` | counter delta / Δt           | summed before differencing |
+| `vllm/prompt_throughput_tok_s`     | counter delta / Δt           | summed before differencing |
+| `vllm/prefix_cache_hit_rate`       | hits Δ / queries Δ           | summed before ratio        |
+| `vllm/ttft_seconds_avg`            | histogram sum Δ / count Δ    | summed before ratio        |
+| `vllm/tpot_seconds_avg`            | histogram sum Δ / count Δ    | summed before ratio        |
+
+Rate- and ratio-style metrics need two consecutive samples to take a delta,
+so they appear starting from the **second** training step. Counter resets
+(e.g. engine restart) are skipped rather than reported as negative rates.
+
+The full set of vLLM metrics is still available via the Prometheus endpoints
+themselves — only this curated subset is forwarded to wandb. The selection
+lives in [vllm_metrics_scraper.py:27-51](skyrl/train/utils/vllm_metrics_scraper.py#L27-L51).