Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,4 @@ genai_bench*.log
# MkDocs
site/
.cache/
together_text-to-text_*/
53 changes: 53 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,63 @@ It provides detailed insights into model serving performance, offering both a us
- 📝 **Rich Logs**: Automatically flushed to both terminal and file upon experiment completion.
- 📈 **Experiment Analyzer**: Generates comprehensive Excel reports with pricing and raw metrics data, plus flexible plot configurations (default 2x4 grid) that visualize key performance metrics including throughput, latency (TTFT, E2E, TPOT), error rates, and RPS across different traffic scenarios and concurrency levels. Supports custom plot layouts and multi-line comparisons.

- 🧪 **Synthetic Tore-style prompts (optional)**: Generate synthetic requests that mimic tore-speed’s synthetic dataset prep, including a cached prefix region and exact input/output token counts for precise performance experiments.

### Open-loop QPS mode (non-Locust)

- Enable with `--non-locust` to use an open-loop arrival process (tore-speed style). Arrivals are scheduled globally by inter-arrival intervals; completions may lag depending on server speed.
- Use `--qps-level` (repeatable; floats allowed) to specify QPS levels and `--qps-distribution` (uniform|exponential|constant) for inter-arrival sampling.
- Duration of each level comes from `--max-time-per-run` (in minutes; floats allowed). Internally converted to seconds.
- Example (tore-speed compatible synthetic run):

```bash
genai-bench benchmark \
--non-locust \
--qps-level 0.1 --qps-level 0.3 \
--qps-distribution uniform \
--max-requests-per-run 1500 --max-time-per-run 2 \
--api-backend together --api-base https://api.together.xyz \
--api-model-name <model> --model-tokenizer <hf-tokenizer> \
--task text-to-text \
--traffic-scenario "D(10000,825)" \
--synthetic --synthetic-cached-input-length 3000
```

Notes:
- Arrival rate (QPS) is the planned schedule; observed RPS depends on completions within the time window.
- In synthetic mode, dataset file loading is skipped; prompts are constructed to exact token counts with a cached prefix region matching tore-speed semantics.

## How to Start

Please check [User Guide](https://docs.sglang.ai/genai-bench/user-guide/) and [CONTRIBUTING.md](https://docs.sglang.ai/genai-bench/development/contributing/) for how to install and use genai-bench.

### Synthetic data mode (tore-speed compatible)

Genai-bench can synthesize prompts similar to tore-speed’s `--dataset_type synthetic`, with a fixed-size cached prefix and exact token counts enforced at the tokenizer level.

- Enable with the `--synthetic` flag and provide a deterministic traffic scenario for input/output tokens (e.g., `D(10000,825)`).
- Specify the cached prefix size (in tokens) with `--synthetic-cached-input-length`.

Example (concurrency mode):

```bash
genai-bench benchmark \
--api-backend together \
--api-base https://api.together.xyz \
--api-model-name <model> \
--model-tokenizer <hf-tokenizer> \
--task text-to-text \
--traffic-scenario "D(10000,825)" \
--max-requests-per-run 1500 --max-time-per-run 2 \
--num-concurrency 128 --spawn-rate 128 \
--synthetic --synthetic-cached-input-length 3000 \
--additional-request-params '{"stream": true}'
```

Notes:
- The sampler ensures the prompt contains exactly the requested number of input tokens. The leading `--synthetic-cached-input-length` tokens are filled with a repeated base phrase to emulate a cacheable prefix; a unique marker and a long instruction are appended to the uncached suffix region.
- This is useful for cache stress tests and apples-to-apples comparisons with tore-speed’s synthetic mode.

## Benchmark Metrics Definition

This section puts together the standard metrics required for LLM serving performance analysis. We classify metrics to two types: **single-request level metrics**, representing the metrics collected from one request. And **aggregated level metrics**, summarizing the single-request metrics from one run (with specific traffic scenario and num concurrency).
Expand Down
138 changes: 138 additions & 0 deletions TOGETHER_AI_INTEGRATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Together AI Integration

This document describes the Together AI backend integration for genai-bench.

## Overview

The Together AI backend has been fully integrated into genai-bench, allowing you to benchmark models hosted on Together AI's platform.

## Features

- **Chat Completions**: Support for text-to-text and image-text-to-text tasks
- **Embeddings**: Support for text-to-embeddings tasks
- **Streaming**: Full support for streaming responses
- **Authentication**: API key-based authentication

## Usage

### Basic Usage

```bash
genai-bench benchmark \
--api-backend together \
--api-base https://api.together.xyz \
--api-key YOUR_TOGETHER_API_KEY \
--api-model-name meta-llama/Llama-2-7b-chat-hf \
--task text-to-text \
--num-concurrency 1,2,4,8 \
--batch-size 1,2,4 \
--dataset-path /path/to/your/dataset.json
```

### Environment Variables

You can also set the API key via environment variable:

```bash
export TOGETHER_API_KEY=your_api_key_here
genai-bench benchmark \
--api-backend together \
--api-base https://api.together.xyz \
--api-model-name meta-llama/Llama-2-7b-chat-hf \
--task text-to-text \
# ... other options
```

### Supported Models

Together AI supports a wide range of models. Some popular options include:

- `meta-llama/Llama-2-7b-chat-hf`
- `meta-llama/Llama-2-13b-chat-hf`
- `meta-llama/Llama-2-70b-chat-hf`
- `mistralai/Mistral-7B-Instruct-v0.1`
- `togethercomputer/RedPajama-INCITE-Chat-3B-v1`
- And many more...

### Supported Tasks

- `text-to-text`: Standard chat completions
- `image-text-to-text`: Multimodal chat with images
- `text-to-embeddings`: Text embedding generation

## Implementation Details

### Files Added/Modified

1. **User Implementation**: `genai_bench/user/together_user.py`
- Implements `TogetherUser` class extending `BaseUser`
- Supports chat completions and embeddings
- Handles streaming responses

2. **Authentication**: `genai_bench/auth/together/`
- `auth.py`: Basic Together AI authentication
- `model_auth_adapter.py`: Adapter for model authentication

3. **CLI Integration**:
- Added "together" to supported backends in `option_groups.py`
- Added together backend handling in `cli.py`
- Added TogetherUser to validation mapping

### API Compatibility

The Together AI backend uses OpenAI-compatible API endpoints:
- Chat completions: `/v1/chat/completions`
- Embeddings: `/v1/embeddings`

This ensures compatibility with existing benchmarking scenarios and metrics collection.

## Example Commands

### Text-to-Text Benchmarking

```bash
genai-bench benchmark \
--api-backend together \
--api-base https://api.together.xyz \
--api-key $TOGETHER_API_KEY \
--api-model-name meta-llama/Llama-2-7b-chat-hf \
--task text-to-text \
--num-concurrency 1,2,4,8,16 \
--batch-size 1,2,4,8 \
--dataset-path examples/dataset_configs/huggingface_simple.json
```

### Embeddings Benchmarking

```bash
genai-bench benchmark \
--api-backend together \
--api-base https://api.together.xyz \
--api-key $TOGETHER_API_KEY \
--api-model-name togethercomputer/RedPajama-INCITE-Chat-3B-v1 \
--task text-to-embeddings \
--num-concurrency 1,2,4,8 \
--batch-size 1,2,4,8 \
--dataset-path examples/dataset_configs/huggingface_simple.json
```

### Multimodal Benchmarking

```bash
genai-bench benchmark \
--api-backend together \
--api-base https://api.together.xyz \
--api-key $TOGETHER_API_KEY \
--api-model-name meta-llama/Llama-2-7b-chat-hf \
--task image-text-to-text \
--num-concurrency 1,2,4 \
--batch-size 1,2 \
--dataset-path examples/dataset_configs/config_llava-bench-in-the-wild.json
```

## Notes

- The Together AI backend requires a valid API key from [Together AI](https://together.ai)
- All standard genai-bench features are supported (metrics collection, reporting, etc.)
- The implementation follows the same patterns as other backends for consistency
- Streaming responses are fully supported for accurate latency measurements
39 changes: 39 additions & 0 deletions docs/examples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,27 @@
This section provides practical examples and configurations for GenAI Bench.

## Quick Examples
### Open-loop QPS (non-Locust) — tore-speed style

Use an open-loop arrival process that schedules requests by inter-arrival times.

```bash
genai-bench benchmark \
--non-locust \
--qps-level 0.1 --qps-level 0.3 \
--qps-distribution uniform \
--max-requests-per-run 1500 --max-time-per-run 2 \
--api-backend together --api-base https://api.together.xyz \
--api-model-name <model> --model-tokenizer <hf-tokenizer> \
--task text-to-text \
--traffic-scenario "D(10000,825)" \
--synthetic --synthetic-cached-input-length 3000
```

Notes:
- `--max-time-per-run` is in minutes (floats allowed); internally converted to seconds. It also drives the open-loop schedule duration per level.
- Arrival rate (QPS) sets the schedule; completion-based metrics (RPS) reflect how many finished within the window.


### OpenAI GPT-4 Benchmark

Expand Down Expand Up @@ -75,6 +96,24 @@ GenAI Bench supports various traffic patterns:
- `N(480,240)/(300,150)` - Normal distribution
- `U(50,100)/(200,250)` - Uniform distribution

### Synthetic Tore-style Prompts

To mimic tore-speed’s synthetic dataset with a cached prefix and exact token counts:

```bash
genai-bench benchmark \
--api-backend together \
--api-base https://api.together.xyz \
--api-model-name <model> \
--model-tokenizer <hf-tokenizer> \
--task text-to-text \
--traffic-scenario "D(10000,825)" \
--synthetic --synthetic-cached-input-length 3000 \
--max-requests-per-run 1500 --max-time-per-run 2
```

This constructs prompts with a leading 3000-token cacheable region and a unique uncached suffix, matching tore-speed synthetic behavior.

### Embedding Scenarios

- `E(64)` - 64 tokens per document
Expand Down
36 changes: 19 additions & 17 deletions genai_bench/analysis/excel_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,8 +107,9 @@ def _create_sheet_with_common_layout(
sheet.append(row)
num_rows += 1

# Merge GPU Type column cells
merge_cells(sheet, 2, num_rows, 1)
# Merge GPU Type column cells only when there is at least one data row
if num_rows >= 2:
merge_cells(sheet, 2, num_rows, 1)

apply_number_format(sheet, exclude_columns=["A", "B", "C"])
column_width_autofit(sheet)
Expand Down Expand Up @@ -418,9 +419,9 @@ def create_aggregated_metrics_sheet(
metrics: AggregatedMetrics = run_data[scenario][iteration][ # type: ignore[call-overload, assignment]
"aggregated_metrics"
]
assert isinstance(
metrics, AggregatedMetrics
), f"Expected AggregatedMetrics, got {type(metrics)}"
assert isinstance(metrics, AggregatedMetrics), (
f"Expected AggregatedMetrics, got {type(metrics)}"
)
metrics_dict = metrics.model_dump()
row = []
for field in metadata_headers:
Expand Down Expand Up @@ -490,18 +491,19 @@ def create_single_request_metrics_sheet(
sheet.append(row)
rows_for_scenario += 1
row_for_iteration += 1
merge_cells(
sheet,
start_row_iteration,
row_for_iteration + start_row_iteration - 1,
1,
)
merge_cells(
sheet,
start_row_iteration,
row_for_iteration + start_row_iteration - 1,
2,
)
if row_for_iteration >= 1:
merge_cells(
sheet,
start_row_iteration,
row_for_iteration + start_row_iteration - 1,
1,
)
merge_cells(
sheet,
start_row_iteration,
row_for_iteration + start_row_iteration - 1,
2,
)
start_row_iteration += row_for_iteration

start_row += rows_for_scenario
Expand Down
12 changes: 6 additions & 6 deletions genai_bench/analysis/flexible_plot_report.py
Original file line number Diff line number Diff line change
Expand Up @@ -889,7 +889,7 @@ def validate_plot_config_with_data(
plot_spec.x_field,
sample_agg_metrics, # type: ignore[arg-type]
):
errors.append(f"Plot {i+1}: Invalid x_field '{plot_spec.x_field}'")
errors.append(f"Plot {i + 1}: Invalid x_field '{plot_spec.x_field}'")

# Validate Y field paths (single or multiple)
try:
Expand All @@ -901,22 +901,22 @@ def validate_plot_config_with_data(
):
if len(y_field_specs) == 1:
errors.append(
f"Plot {i+1}: Invalid y_field '{y_field_spec.field}'"
f"Plot {i + 1}: Invalid y_field '{y_field_spec.field}'"
)
else:
errors.append(
f"Plot {i+1}: Invalid y_fields[{j}] '{y_field_spec.field}'"
f"Plot {i + 1}: Invalid y_fields[{j}] '{y_field_spec.field}'"
)
except Exception as e:
errors.append(f"Plot {i+1}: Error validating Y-fields: {e}")
errors.append(f"Plot {i + 1}: Error validating Y-fields: {e}")

# Validate position bounds
layout = config.layout
row, col = plot_spec.position
if row >= layout.rows or col >= layout.cols:
errors.append(
f"Plot {i+1}: Position ({row}, {col}) exceeds layout bounds "
f"({layout.rows-1}, {layout.cols-1})"
f"Plot {i + 1}: Position ({row}, {col}) exceeds layout bounds "
f"({layout.rows - 1}, {layout.cols - 1})"
)

return errors
13 changes: 13 additions & 0 deletions genai_bench/auth/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from genai_bench.auth.oci.session import OCISessionAuth
from genai_bench.auth.oci.user_principal import OCIUserPrincipalAuth
from genai_bench.auth.openai.auth import OpenAIAuth
from genai_bench.auth.together.auth import TogetherAuth


class AuthFactory:
Expand All @@ -25,6 +26,18 @@ def create_openai_auth(api_key: str) -> OpenAIAuth:
"""
return OpenAIAuth(api_key=api_key)

@staticmethod
def create_together_auth(api_key: str) -> TogetherAuth:
"""Create Together authentication provider.

Args:
api_key (str): Together API key

Returns:
TogetherAuth: OpenAI auth provider
"""
return TogetherAuth(api_key=api_key)

@staticmethod
def create_oci_auth(
auth_type: str,
Expand Down
Empty file.
Loading
Loading