sgl-project · waitong94 · Oct 13, 2025 · Oct 13, 2025 · Oct 13, 2025 · Oct 17, 2025
@@ -47,3 +47,4 @@ genai_bench*.log
 # MkDocs
 site/
 .cache/
+together_text-to-text_*/
@@ -40,10 +40,63 @@ It provides detailed insights into model serving performance, offering both a us
 - 📝 **Rich Logs**: Automatically flushed to both terminal and file upon experiment completion.
 - 📈 **Experiment Analyzer**: Generates comprehensive Excel reports with pricing and raw metrics data, plus flexible plot configurations (default 2x4 grid) that visualize key performance metrics including throughput, latency (TTFT, E2E, TPOT), error rates, and RPS across different traffic scenarios and concurrency levels. Supports custom plot layouts and multi-line comparisons.
 
+- 🧪 **Synthetic Tore-style prompts (optional)**: Generate synthetic requests that mimic tore-speed’s synthetic dataset prep, including a cached prefix region and exact input/output token counts for precise performance experiments.
+
+### Open-loop QPS mode (non-Locust)
+
+- Enable with `--non-locust` to use an open-loop arrival process (tore-speed style). Arrivals are scheduled globally by inter-arrival intervals; completions may lag depending on server speed.
+- Use `--qps-level` (repeatable; floats allowed) to specify QPS levels and `--qps-distribution` (uniform|exponential|constant) for inter-arrival sampling.
+- Duration of each level comes from `--max-time-per-run` (in minutes; floats allowed). Internally converted to seconds.
+- Example (tore-speed compatible synthetic run):
+
+```bash
+genai-bench benchmark \
+  --non-locust \
+  --qps-level 0.1 --qps-level 0.3 \
+  --qps-distribution uniform \
+  --max-requests-per-run 1500 --max-time-per-run 2 \
+  --api-backend together --api-base https://api.together.xyz \
+  --api-model-name <model> --model-tokenizer <hf-tokenizer> \
+  --task text-to-text \
+  --traffic-scenario "D(10000,825)" \
+  --synthetic --synthetic-cached-input-length 3000
+```
+
+Notes:
+- Arrival rate (QPS) is the planned schedule; observed RPS depends on completions within the time window.
+- In synthetic mode, dataset file loading is skipped; prompts are constructed to exact token counts with a cached prefix region matching tore-speed semantics.
+
 ## How to Start
 
 Please check [User Guide](https://docs.sglang.ai/genai-bench/user-guide/) and [CONTRIBUTING.md](https://docs.sglang.ai/genai-bench/development/contributing/) for how to install and use genai-bench.
 
+### Synthetic data mode (tore-speed compatible)
+
+Genai-bench can synthesize prompts similar to tore-speed’s `--dataset_type synthetic`, with a fixed-size cached prefix and exact token counts enforced at the tokenizer level.
+
+- Enable with the `--synthetic` flag and provide a deterministic traffic scenario for input/output tokens (e.g., `D(10000,825)`).
+- Specify the cached prefix size (in tokens) with `--synthetic-cached-input-length`.
+
+Example (concurrency mode):
+
+```bash
+genai-bench benchmark \
+  --api-backend together \
+  --api-base https://api.together.xyz \
+  --api-model-name <model> \
+  --model-tokenizer <hf-tokenizer> \
+  --task text-to-text \
+  --traffic-scenario "D(10000,825)" \
+  --max-requests-per-run 1500 --max-time-per-run 2 \
+  --num-concurrency 128 --spawn-rate 128 \
+  --synthetic --synthetic-cached-input-length 3000 \
+  --additional-request-params '{"stream": true}'
+```
+
+Notes:
+- The sampler ensures the prompt contains exactly the requested number of input tokens. The leading `--synthetic-cached-input-length` tokens are filled with a repeated base phrase to emulate a cacheable prefix; a unique marker and a long instruction are appended to the uncached suffix region.
+- This is useful for cache stress tests and apples-to-apples comparisons with tore-speed’s synthetic mode.
+
 ## Benchmark Metrics Definition
 
 This section puts together the standard metrics required for LLM serving performance analysis. We classify metrics to two types: **single-request level metrics**, representing the metrics collected from one request. And **aggregated level metrics**, summarizing the single-request metrics from one run (with specific traffic scenario and num concurrency).

@@ -0,0 +1,138 @@
+# Together AI Integration
+
+This document describes the Together AI backend integration for genai-bench.
+
+## Overview
+
+The Together AI backend has been fully integrated into genai-bench, allowing you to benchmark models hosted on Together AI's platform.
+
+## Features
+
+- **Chat Completions**: Support for text-to-text and image-text-to-text tasks
+- **Embeddings**: Support for text-to-embeddings tasks
+- **Streaming**: Full support for streaming responses
+- **Authentication**: API key-based authentication
+
+## Usage
+
+### Basic Usage
+
+```bash
+genai-bench benchmark \
+    --api-backend together \
+    --api-base https://api.together.xyz \
+    --api-key YOUR_TOGETHER_API_KEY \
+    --api-model-name meta-llama/Llama-2-7b-chat-hf \
+    --task text-to-text \
+    --num-concurrency 1,2,4,8 \
+    --batch-size 1,2,4 \
+    --dataset-path /path/to/your/dataset.json
+```
+
+### Environment Variables
+
+You can also set the API key via environment variable:
+
+```bash
+export TOGETHER_API_KEY=your_api_key_here
+genai-bench benchmark \
+    --api-backend together \
+    --api-base https://api.together.xyz \
+    --api-model-name meta-llama/Llama-2-7b-chat-hf \
+    --task text-to-text \
+    # ... other options
+```
+
+### Supported Models
+
+Together AI supports a wide range of models. Some popular options include:
+
+- `meta-llama/Llama-2-7b-chat-hf`
+- `meta-llama/Llama-2-13b-chat-hf`
+- `meta-llama/Llama-2-70b-chat-hf`
+- `mistralai/Mistral-7B-Instruct-v0.1`
+- `togethercomputer/RedPajama-INCITE-Chat-3B-v1`
+- And many more...
+
+### Supported Tasks
+
+- `text-to-text`: Standard chat completions
+- `image-text-to-text`: Multimodal chat with images
+- `text-to-embeddings`: Text embedding generation
+
+## Implementation Details
+
+### Files Added/Modified
+
+1. **User Implementation**: `genai_bench/user/together_user.py`
+   - Implements `TogetherUser` class extending `BaseUser`
+   - Supports chat completions and embeddings
+   - Handles streaming responses
+
+2. **Authentication**: `genai_bench/auth/together/`
+   - `auth.py`: Basic Together AI authentication
+   - `model_auth_adapter.py`: Adapter for model authentication
+
+3. **CLI Integration**:
+   - Added "together" to supported backends in `option_groups.py`
+   - Added together backend handling in `cli.py`
+   - Added TogetherUser to validation mapping
+
+### API Compatibility
+
+The Together AI backend uses OpenAI-compatible API endpoints:
+- Chat completions: `/v1/chat/completions`
+- Embeddings: `/v1/embeddings`
+
+This ensures compatibility with existing benchmarking scenarios and metrics collection.
+
+## Example Commands
+
+### Text-to-Text Benchmarking
+
+```bash
+genai-bench benchmark \
+    --api-backend together \
+    --api-base https://api.together.xyz \
+    --api-key $TOGETHER_API_KEY \
+    --api-model-name meta-llama/Llama-2-7b-chat-hf \
+    --task text-to-text \
+    --num-concurrency 1,2,4,8,16 \
+    --batch-size 1,2,4,8 \
+    --dataset-path examples/dataset_configs/huggingface_simple.json
+```
+
+### Embeddings Benchmarking
+
+```bash
+genai-bench benchmark \
+    --api-backend together \
+    --api-base https://api.together.xyz \
+    --api-key $TOGETHER_API_KEY \
+    --api-model-name togethercomputer/RedPajama-INCITE-Chat-3B-v1 \
+    --task text-to-embeddings \
+    --num-concurrency 1,2,4,8 \
+    --batch-size 1,2,4,8 \
+    --dataset-path examples/dataset_configs/huggingface_simple.json
+```
+
+### Multimodal Benchmarking
+
+```bash
+genai-bench benchmark \
+    --api-backend together \
+    --api-base https://api.together.xyz \
+    --api-key $TOGETHER_API_KEY \
+    --api-model-name meta-llama/Llama-2-7b-chat-hf \
+    --task image-text-to-text \
+    --num-concurrency 1,2,4 \
+    --batch-size 1,2 \
+    --dataset-path examples/dataset_configs/config_llava-bench-in-the-wild.json
+```
+
+## Notes
+
+- The Together AI backend requires a valid API key from [Together AI](https://together.ai)
+- All standard genai-bench features are supported (metrics collection, reporting, etc.)
+- The implementation follows the same patterns as other backends for consistency
+- Streaming responses are fully supported for accurate latency measurements
@@ -3,6 +3,27 @@
 This section provides practical examples and configurations for GenAI Bench.
 
 ## Quick Examples
+### Open-loop QPS (non-Locust) — tore-speed style
+
+Use an open-loop arrival process that schedules requests by inter-arrival times.
+
+```bash
+genai-bench benchmark \
+  --non-locust \
+  --qps-level 0.1 --qps-level 0.3 \
+  --qps-distribution uniform \
+  --max-requests-per-run 1500 --max-time-per-run 2 \
+  --api-backend together --api-base https://api.together.xyz \
+  --api-model-name <model> --model-tokenizer <hf-tokenizer> \
+  --task text-to-text \
+  --traffic-scenario "D(10000,825)" \
+  --synthetic --synthetic-cached-input-length 3000
+```
+
+Notes:
+- `--max-time-per-run` is in minutes (floats allowed); internally converted to seconds. It also drives the open-loop schedule duration per level.
+- Arrival rate (QPS) sets the schedule; completion-based metrics (RPS) reflect how many finished within the window.
+
 
 ### OpenAI GPT-4 Benchmark
 
@@ -75,6 +96,24 @@ GenAI Bench supports various traffic patterns:
 - `N(480,240)/(300,150)` - Normal distribution
 - `U(50,100)/(200,250)` - Uniform distribution
 
+### Synthetic Tore-style Prompts
+
+To mimic tore-speed’s synthetic dataset with a cached prefix and exact token counts:
+
+```bash
+genai-bench benchmark \
+  --api-backend together \
+  --api-base https://api.together.xyz \
+  --api-model-name <model> \
+  --model-tokenizer <hf-tokenizer> \
+  --task text-to-text \
+  --traffic-scenario "D(10000,825)" \
+  --synthetic --synthetic-cached-input-length 3000 \
+  --max-requests-per-run 1500 --max-time-per-run 2
+```
+
+This constructs prompts with a leading 3000-token cacheable region and a unique uncached suffix, matching tore-speed synthetic behavior.
+
 ### Embedding Scenarios
 
 - `E(64)` - 64 tokens per document

@@ -107,8 +107,9 @@ def _create_sheet_with_common_layout(
         sheet.append(row)
         num_rows += 1
 
-    # Merge GPU Type column cells
-    merge_cells(sheet, 2, num_rows, 1)
+    # Merge GPU Type column cells only when there is at least one data row
+    if num_rows >= 2:
+        merge_cells(sheet, 2, num_rows, 1)
 
     apply_number_format(sheet, exclude_columns=["A", "B", "C"])
     column_width_autofit(sheet)
@@ -418,9 +419,9 @@ def create_aggregated_metrics_sheet(
             metrics: AggregatedMetrics = run_data[scenario][iteration][  # type: ignore[call-overload, assignment]
                 "aggregated_metrics"
             ]
-            assert isinstance(
-                metrics, AggregatedMetrics
-            ), f"Expected AggregatedMetrics, got {type(metrics)}"
+            assert isinstance(metrics, AggregatedMetrics), (
+                f"Expected AggregatedMetrics, got {type(metrics)}"
+            )
             metrics_dict = metrics.model_dump()
             row = []
             for field in metadata_headers:
@@ -490,18 +491,19 @@ def create_single_request_metrics_sheet(
                 sheet.append(row)
                 rows_for_scenario += 1
                 row_for_iteration += 1
-            merge_cells(
-                sheet,
-                start_row_iteration,
-                row_for_iteration + start_row_iteration - 1,
-                1,
-            )
-            merge_cells(
-                sheet,
-                start_row_iteration,
-                row_for_iteration + start_row_iteration - 1,
-                2,
-            )
+            if row_for_iteration >= 1:
+                merge_cells(
+                    sheet,
+                    start_row_iteration,
+                    row_for_iteration + start_row_iteration - 1,
+                    1,
+                )
+                merge_cells(
+                    sheet,
+                    start_row_iteration,
+                    row_for_iteration + start_row_iteration - 1,
+                    2,
+                )
             start_row_iteration += row_for_iteration
 
         start_row += rows_for_scenario

@@ -889,7 +889,7 @@ def validate_plot_config_with_data(
             plot_spec.x_field,
             sample_agg_metrics,  # type: ignore[arg-type]
         ):
-            errors.append(f"Plot {i+1}: Invalid x_field '{plot_spec.x_field}'")
+            errors.append(f"Plot {i + 1}: Invalid x_field '{plot_spec.x_field}'")
 
         # Validate Y field paths (single or multiple)
         try:
@@ -901,22 +901,22 @@ def validate_plot_config_with_data(
                 ):
                     if len(y_field_specs) == 1:
                         errors.append(
-                            f"Plot {i+1}: Invalid y_field '{y_field_spec.field}'"
+                            f"Plot {i + 1}: Invalid y_field '{y_field_spec.field}'"
                         )
                     else:
                         errors.append(
-                            f"Plot {i+1}: Invalid y_fields[{j}] '{y_field_spec.field}'"
+                            f"Plot {i + 1}: Invalid y_fields[{j}] '{y_field_spec.field}'"
                         )
         except Exception as e:
-            errors.append(f"Plot {i+1}: Error validating Y-fields: {e}")
+            errors.append(f"Plot {i + 1}: Error validating Y-fields: {e}")
 
         # Validate position bounds
         layout = config.layout
         row, col = plot_spec.position
         if row >= layout.rows or col >= layout.cols:
             errors.append(
-                f"Plot {i+1}: Position ({row}, {col}) exceeds layout bounds "
-                f"({layout.rows-1}, {layout.cols-1})"
+                f"Plot {i + 1}: Position ({row}, {col}) exceeds layout bounds "
+                f"({layout.rows - 1}, {layout.cols - 1})"
             )
 
     return errors
@@ -8,6 +8,7 @@
 from genai_bench.auth.oci.session import OCISessionAuth
 from genai_bench.auth.oci.user_principal import OCIUserPrincipalAuth
 from genai_bench.auth.openai.auth import OpenAIAuth
+from genai_bench.auth.together.auth import TogetherAuth
 
 
 class AuthFactory:
@@ -25,6 +26,18 @@ def create_openai_auth(api_key: str) -> OpenAIAuth:
         """
         return OpenAIAuth(api_key=api_key)
 
+    @staticmethod
+    def create_together_auth(api_key: str) -> TogetherAuth:
+        """Create Together authentication provider.
+
+        Args:
+            api_key (str): Together API key
+
+        Returns:
+            TogetherAuth: OpenAI auth provider
+        """
+        return TogetherAuth(api_key=api_key)
+
     @staticmethod
     def create_oci_auth(
         auth_type: str,
-Original file line number
+Diff line change
@@ Expand Up / @@ -47,3 +47,4 @@ genai_bench*.log @@
     # MkDocs
     site/
     .cache/
+    together_text-to-text_*/