Add vLLM's health and metrics endpoints #263

QwertyJack · 2024-09-02T07:12:59Z

/health: reuse vLLM's endpoint with a reassigned variable.
/metrics: reuse vLLM's endpoint using mount_metrics.

jeffreymeetkai

LGTM! Thanks for this! 🚀

nskumz · 2024-10-18T10:12:48Z

but the metrics were not enabled. everything was #ed. to get these in prometheus these needs to be enabled.

# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
# HELP vllm:num_requests_swapped Number of requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
# HELP vllm:cpu_prefix_cache_hit_rate CPU prefix cache block hit rate.
# TYPE vllm:cpu_prefix_cache_hit_rate gauge
# HELP vllm:gpu_prefix_cache_hit_rate GPU prefix cache block hit rate.
# TYPE vllm:gpu_prefix_cache_hit_rate gauge
# HELP vllm:num_preemptions_total Cumulative number of preemption from the engine.
# TYPE vllm:num_preemptions_total counter
# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
# TYPE vllm:prompt_tokens_total counter
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
# HELP vllm:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE vllm:time_per_output_token_seconds histogram
# HELP vllm:e2e_request_latency_seconds Histogram of end to end request latency in seconds.
# TYPE vllm:e2e_request_latency_seconds histogram
# HELP vllm:request_prompt_tokens Number of prefill tokens processed.
# TYPE vllm:request_prompt_tokens histogram
# HELP vllm:request_generation_tokens Number of generation tokens processed.
# TYPE vllm:request_generation_tokens histogram
# HELP vllm:request_params_best_of Histogram of the best_of request parameter.
# TYPE vllm:request_params_best_of histogram
# HELP vllm:request_params_n Histogram of the n request parameter.
# TYPE vllm:request_params_n histogram
# HELP vllm:request_success_total Count of successfully processed requests.
# TYPE vllm:request_success_total counter

QwertyJack · 2024-10-18T10:40:51Z

If you see nothing during a fresh start, try sending your first request and check again.

nskumz · 2024-10-18T11:29:32Z

its worked.Thanks

add vLLM's health and metrics endpoints

26643ed

jeffreymeetkai approved these changes Sep 4, 2024

View reviewed changes

jeffreymeetkai merged commit 2ade257 into MeetKai:main Sep 4, 2024
3 checks passed

QwertyJack mentioned this pull request Sep 26, 2024

vLLM server capabilities #260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vLLM's health and metrics endpoints #263

Add vLLM's health and metrics endpoints #263

QwertyJack commented Sep 2, 2024

jeffreymeetkai left a comment

nskumz commented Oct 18, 2024 •

edited

Loading

QwertyJack commented Oct 18, 2024

nskumz commented Oct 18, 2024

Add vLLM's health and metrics endpoints #263

Add vLLM's health and metrics endpoints #263

Conversation

QwertyJack commented Sep 2, 2024

jeffreymeetkai left a comment

Choose a reason for hiding this comment

nskumz commented Oct 18, 2024 • edited Loading

QwertyJack commented Oct 18, 2024

nskumz commented Oct 18, 2024

nskumz commented Oct 18, 2024 •

edited

Loading