Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 29 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,51 +40,42 @@ It provides detailed insights into model serving performance, offering both a us
- 📝 **Rich Logs**: Automatically flushed to both terminal and file upon experiment completion.
- 📈 **Experiment Analyzer**: Generates comprehensive Excel reports with pricing and raw metrics data, plus flexible plot configurations (default 2x4 grid) that visualize key performance metrics including throughput, latency (TTFT, E2E, TPOT), error rates, and RPS across different traffic scenarios and concurrency levels. Supports custom plot layouts and multi-line comparisons.

## How to Start
## Installation

Please check [User Guide](https://docs.sglang.ai/genai-bench/user-guide/) and [CONTRIBUTING.md](https://docs.sglang.ai/genai-bench/development/contributing/) for how to install and use genai-bench.
**Quick Start**: Install with `pip install genai-bench`.
Alternatively, check [Installation Guide](https://docs.sglang.ai/genai-bench/getting-started/installation) for other options.

## Benchmark Metrics Definition
## How to use

This section puts together the standard metrics required for LLM serving performance analysis. We classify metrics to two types: **single-request level metrics**, representing the metrics collected from one request. And **aggregated level metrics**, summarizing the single-request metrics from one run (with specific traffic scenario and num concurrency).
### Quick Start

**NOTE**:
1. **Run a benchmark** against your model:
```bash
genai-bench benchmark --api-backend openai \
--api-base "http://localhost:8080" \
--api-key "your-api-key" \
--api-model-name "your-model" \
--task text-to-text \
--max-time-per-run 5 \
--max-requests-per-run 100
```

- Each single-request metric includes standard statistics: **percentile**, **min**, **max**, **stddev**, and **mean**.
- The following metrics cover **input**, **output**, and **end-to-end (e2e)** stages. For *chat* tasks, all stages are relevant for evaluation. For *embedding* tasks, where there is no output stage, output metrics will be set to 0. For details about output metrics collection, please check out `OUTPUT_METRICS_FIELDS` in [metrics.py](genai_bench/metrics/metrics.py).
2. **Generate Excel reports** from your results:
```bash
genai-bench excel --experiment-folder ./experiments/your_experiment \
--excel-name results --metric-percentile mean
```

### Single Request Level Metrics
3. **Create visualizations**:
```bash
genai-bench plot --experiments-folder ./experiments \
--group-key traffic_scenario --preset 2x4_default
```

The following metrics capture token-level performance for a single request, providing insights into server efficiency for each individual request.
### Next Steps

| Glossary | Meaning | Calculation Formula | Units |
|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|---------------|
| TTFT | Time to First Token. Initial response time when the first output token is generated. <br/> This is also known as the latency for the input (input) stage. | `TTFT = time_at_first_token - start_time` | seconds |
| End-to-End Latency | End-to-End latency. This metric indicates how long it takes from submitting a query to receiving the full response, including network latencies. | `e2e_latency = end_time - start_time` | seconds |
| TPOT | Time Per Output Token. The average time between two subsequent generated tokens. | `TPOT = (e2e_latency - TTFT) / (num_output_tokens - 1)` | seconds |
| Output Latency | Output latency. This metric indicates how long it takes to receive the full response after the first token is generated. | `output_latency = e2e_latency - TTFT` | seconds |
| Output Inference Speed | The rate of how many tokens the model can generate per second for a single request. | `inference_speed = 1 / TPOT` | tokens/second |
| Num of Input Tokens | Number of prompt tokens. | `num_input_tokens = tokenizer.encode(prompt)` | tokens |
| Num of Output Tokens | Number of output tokens. | `num_output_tokens = num_completion_tokens` | tokens |
| Num of Request Tokens | Total number of tokens processed in one request. | `num_request_tokens = num_input_tokens + num_output_tokens` | tokens |
| Input Throughput | The overall throughput of input (input process). | `input_throughput = num_input_tokens / TTFT` | tokens/second |
| Output Throughput | The throughput of output (output generation) for a single request. | `output_throughput = (num_output_tokens - 1) / output_latency` | tokens/second |
For detailed instructions, advanced configuration options, and comprehensive examples, check out the [User Guide](https://docs.sglang.ai/genai-bench/user-guide/).

### Aggregated Metrics
## Development

This metrics collection summarizes the metrics relevant to a specific traffic load pattern, defined by the traffic scenario and the num of concurrency. It provides insights into server capacity and performance under pressure.

| Glossary | Meaning | Calculation Formula | Units |
|---------------------------|------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------|
| Mean Input Throughput | The average throughput of how many input tokens can be processed by the model in one run with multiple concurrent requests. | `mean_input_throughput = sum(input_tokens_for_all_requests) / run_duration` | tokens/second |
| Mean Output Throughput | The average throughput of how many output tokens can be processed by the model in one run with multiple concurrent requests. | `mean_output_throughput = sum(output_tokens_for_all_requests) / run_duration` | tokens/second |
| Total Tokens Throughput | The average throughput of how many tokens can be processed by the model, including both input and output tokens. | `mean_total_tokens_throughput = all_requests["total_tokens"]["sum"] / run_duration` | tokens/second |
| Total Chars Per Hour[^1] | The average total characters can be processed by the model per hour. | `total_chars_per_hour = total_tokens_throughput * dataset_chars_to_token_ratio * 3600` | Characters |
| Requests Per Minute | The number of requests processed by the model per minute. | `num_completed_requests_per_min = num_completed_requests / (end_time - start_time) * 60` | Requests |
| Error Codes to Frequency | A map that shows the returned error status code to its frequency. | | |
| Error Rate | The rate of error requests over total requests. | `error_rate = num_error_requests / num_requests` | |
| Num of Error Requests | The number of error requests in one load. | <pre><code>if requests.status_code != '200': <br/> num_error_requests += 1</code></pre> | |
| Num of Completed Requests | The number of completed requests in one load. | <pre><code>if requests.status_code == '200': <br/> num_completed_requests += 1</code></pre> | |
| Num of Requests | The total number of requests processed for one load. | `total_requests = num_completed_requests + num_error_requests` | |

[^1]: *Total Chars Per Hour* is derived from a character-to-token ratio based on sonnet.txt and the model’s tokenizer. This metric aids in pricing decisions for an LLM serving solution. For tasks with multi-modal inputs, non-text tokens are converted to an equivalent character count using the same character-to-token ratio.
If you are interested in contributing to GenAI-Bench, you can use the [Development Guide](https://docs.sglang.ai/genai-bench/development/).
4 changes: 2 additions & 2 deletions docs/.config/mkdocs-gh-pages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -116,14 +116,14 @@ nav:
- getting-started/index.md
- Installation: getting-started/installation.md
- Task Definition: getting-started/task-definition.md
- Command Guidelines: getting-started/command-guidelines.md
- Entrypoints: getting-started/entrypoints.md
- Metrics Definition: getting-started/metrics-definition.md
- User Guide:
- user-guide/index.md
- Run Benchmark: user-guide/run-benchmark.md
- Traffic Scenarios: user-guide/scenario-definition.md
- Multi-Cloud Authentication: user-guide/multi-cloud-auth-storage.md
- Quick Reference: user-guide/multi-cloud-quick-reference.md
- Multi-Cloud Quick Reference: user-guide/multi-cloud-quick-reference.md
- Docker Deployment: user-guide/run-benchmark-using-docker.md
- Excel Reports: user-guide/generate-excel-sheet.md
- Visualizations: user-guide/generate-plot.md
Expand Down
9 changes: 3 additions & 6 deletions docs/.config/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -123,15 +123,12 @@ nav:
- Run Benchmark: user-guide/run-benchmark.md
- Traffic Scenarios: user-guide/scenario-definition.md
- Multi-Cloud Authentication: user-guide/multi-cloud-auth-storage.md
- Quick Reference: user-guide/multi-cloud-quick-reference.md
- Multi-Cloud Quick Reference: user-guide/multi-cloud-quick-reference.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have a quick reference for multi-cloud and an authentication?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quick reference is more concise and is filled with very easy to use examples, while the multi-cloud auth page is much more thorough and much longer We could combine these in theory but I think the resulting page would be incredibly long and a bit hard to navigate. I didn't make any content changes to those pages, I just renamed the page from 'Quick Reference' to Multi-Cloud Quick Reference in the navigation bar since calling it Quick Reference was misleading for what the page showed.

- Docker Deployment: user-guide/run-benchmark-using-docker.md
- Excel Reports: user-guide/generate-excel-sheet.md
- Visualizations: user-guide/generate-plot.md
- Upload Results: user-guide/upload-benchmark-result.md
- Examples:
- examples/index.md
- Development:
- development/index.md
- Contributing: development/contributing.md
- API Reference:
- api/index.md
- Adding New Features: development/adding-new-features.md
- API Reference: development/api-reference.md
98 changes: 0 additions & 98 deletions docs/api/index.md

This file was deleted.

Loading