Skip to content

Commit 9065116

Browse files
authored
feat(dataset): add spec_al_* acceptance-length benchmark datasets (#1046)
Signed-off-by: Aaron Batilo <AaronBatilo@gmail.com>
1 parent 0ce4f12 commit 9065116

9 files changed

Lines changed: 441 additions & 18 deletions

File tree

docs/cli-options.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -359,7 +359,7 @@ Path to file or directory containing benchmark dataset. Required when using `--c
359359
#### `--public-dataset` `<str>`
360360

361361
Pre-configured public dataset to download and use for benchmarking (e.g., `sharegpt`). AIPerf automatically downloads and parses these datasets. Mutually exclusive with `--custom-dataset-type`. Run `aiperf plugins public_dataset_loader` to list available datasets. Use `--hf-subset` to override the HuggingFace subset/config for HF-backed datasets.
362-
<br/>_Choices: [`sharegpt`, `aimo`, `mmstar`, `mmvu`, `vision_arena`, `llava_onevision`, `aimo_aime`, `aimo_numina_cot`, `aimo_numina_1_5`, `spec_bench`, `instruct_coder`, `blazedit_5k`, `blazedit_10k`, `librispeech`, `voxpopuli`, `gigaspeech`, `ami`, `spgispeech`]_
362+
<br/>_Choices: [`sharegpt`, `aimo`, `mmstar`, `mmvu`, `vision_arena`, `llava_onevision`, `aimo_aime`, `aimo_numina_cot`, `aimo_numina_1_5`, `spec_bench`, `spec_al_gsm8k`, `spec_al_math500`, `spec_al_humaneval`, `spec_al_mbpp`, `spec_al_mtbench`, `instruct_coder`, `blazedit_5k`, `blazedit_10k`, `librispeech`, `voxpopuli`, `gigaspeech`, `ami`, `spgispeech`]_
363363

364364
#### `--hf-subset` `<str>`
365365

@@ -1688,7 +1688,7 @@ Path to file or directory containing benchmark dataset. Required when using `--c
16881688
#### `--public-dataset` `<str>`
16891689

16901690
Pre-configured public dataset to download and use for benchmarking (e.g., `sharegpt`). AIPerf automatically downloads and parses these datasets. Mutually exclusive with `--custom-dataset-type`. Run `aiperf plugins public_dataset_loader` to list available datasets. Use `--hf-subset` to override the HuggingFace subset/config for HF-backed datasets.
1691-
<br/>_Choices: [`sharegpt`, `aimo`, `mmstar`, `mmvu`, `vision_arena`, `llava_onevision`, `aimo_aime`, `aimo_numina_cot`, `aimo_numina_1_5`, `spec_bench`, `instruct_coder`, `blazedit_5k`, `blazedit_10k`, `librispeech`, `voxpopuli`, `gigaspeech`, `ami`, `spgispeech`]_
1691+
<br/>_Choices: [`sharegpt`, `aimo`, `mmstar`, `mmvu`, `vision_arena`, `llava_onevision`, `aimo_aime`, `aimo_numina_cot`, `aimo_numina_1_5`, `spec_bench`, `spec_al_gsm8k`, `spec_al_math500`, `spec_al_humaneval`, `spec_al_mbpp`, `spec_al_mtbench`, `instruct_coder`, `blazedit_5k`, `blazedit_10k`, `librispeech`, `voxpopuli`, `gigaspeech`, `ami`, `spgispeech`]_
16921692

16931693
#### `--hf-subset` `<str>`
16941694

docs/tutorials/speed-bench.md

Lines changed: 82 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,7 @@ aiperf speed-bench-report ./artifacts/ --format both
187187
This produces a CSV (`speed_bench_report.csv`) and console table:
188188

189189
```
190-
SPEED-Bench Acceptance Length Report
190+
Acceptance Length Report
191191
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
192192
┃ Model ┃ coding ┃ humanities ┃ math ┃ writing ┃ Overall ┃
193193
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
@@ -209,6 +209,87 @@ aiperf speed-bench-report ./artifacts/ --metric throughput
209209

210210
---
211211

212+
## Literature Acceptance-Length Datasets (GSM8K, MT-Bench, MATH-500, HumanEval, MBPP)
213+
214+
The speculative-decoding literature overwhelmingly reports acceptance length against five standard benchmarks. AIPerf registers each as a public dataset that is auto-downloaded from HuggingFace at runtime, so there is no prepare-data step: just select one with `--public-dataset` and run the same `aiperf speed-bench-report` workflow shown above.
215+
216+
| Dataset Name | HuggingFace Source | Prompts | Turns | License |
217+
|---|---|---|---|---|
218+
| `spec_al_gsm8k` | `openai/gsm8k` (`main`, `test`) | 1,319 | single | MIT |
219+
| `spec_al_math500` | `HuggingFaceH4/MATH-500` (`test`) | 500 | single | MIT |
220+
| `spec_al_humaneval` | `openai/openai_humaneval` (`test`) | 164 | single | MIT |
221+
| `spec_al_mbpp` | `google-research-datasets/mbpp` (`full`, `test`) | 500 | single | CC-BY-4.0 |
222+
| `spec_al_mtbench` | `HuggingFaceH4/mt_bench_prompts` (`train`) | 80 | two-turn | Apache-2.0 |
223+
224+
Prompts are emitted verbatim (the raw question/problem/prompt field); the served model's chat template wraps them at request time via `--endpoint-type chat`. HumanEval and MBPP are text-completion tasks in the spec-decode literature, so chat-wrapping them keeps the matrix uniform but shifts their acceptance length somewhat from the papers' headline numbers. Acceptance length is correctness-agnostic, so use greedy decoding (`--extra-inputs temperature:0`) to match the headline numbers reported in the literature. Note that `--osl` does not apply to public datasets, so cap generation with `--extra-inputs max_tokens:N` instead. `spec_al_mtbench` is multi-turn: AIPerf dispatches both turns per session and feeds the live assistant reply back as conversation history between them - size it with `--num-conversations` rather than `--request-count` (see below).
225+
226+
### Run All Five with a Matrix Report
227+
228+
```bash
229+
MODEL="meta/llama-3.1-8b-instruct"
230+
ART=./artifacts/spec-al # dedicated root so this matrix never merges with speed_bench_* runs
231+
232+
# Single-turn datasets: size each run to the full dataset with --request-count.
233+
for pair in spec_al_gsm8k:1319 spec_al_math500:500 spec_al_humaneval:164 spec_al_mbpp:500; do
234+
ds="${pair%%:*}"; count="${pair##*:}"
235+
echo "=== Running dataset: $ds ($count requests) ==="
236+
aiperf profile \
237+
--model "$MODEL" \
238+
--endpoint-type chat \
239+
--streaming \
240+
--url localhost:8000 \
241+
--public-dataset "$ds" \
242+
--server-metrics http://localhost:8000/metrics \
243+
--request-count "$count" \
244+
--extra-inputs temperature:0 max_tokens:4096 \
245+
--concurrency 16 \
246+
--output-artifact-dir "$ART/$ds"
247+
done
248+
249+
# MT-Bench is multi-turn (80 two-turn conversations). Size it with
250+
# --num-conversations so every session runs exactly once; --request-count
251+
# recycles the 80 sessions to reach the count and would dispatch each prompt
252+
# more than once.
253+
aiperf profile \
254+
--model "$MODEL" \
255+
--endpoint-type chat \
256+
--streaming \
257+
--url localhost:8000 \
258+
--public-dataset spec_al_mtbench \
259+
--server-metrics http://localhost:8000/metrics \
260+
--num-conversations 80 \
261+
--extra-inputs temperature:0 max_tokens:4096 \
262+
--concurrency 16 \
263+
--output-artifact-dir "$ART/spec_al_mtbench"
264+
265+
# Assemble the acceptance-length matrix (one column per dataset)
266+
aiperf speed-bench-report "$ART" --metric accept_length --format both
267+
```
268+
269+
> Size each run to the full dataset — without an explicit count AIPerf defaults
270+
> to 10 requests. Single-turn datasets use `--request-count`; the multi-turn
271+
> `spec_al_mtbench` uses `--num-conversations 80` (one run per conversation),
272+
> since `--request-count` recycles its 80 sessions to reach the count. Cap
273+
> generation with `--extra-inputs max_tokens:N` (`--osl` is ignored for public
274+
> datasets), and keep these runs in their own artifacts directory so
275+
> `speed-bench-report` does not average them into an unrelated `speed_bench_*`
276+
> matrix.
277+
278+
The report recognizes these runs the same way it recognizes the `speed_bench_*` runs, producing one matrix column per dataset:
279+
280+
```
281+
Acceptance Length Report
282+
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┓
283+
┃ Model ┃ gsm8k ┃ math500 ┃ mtbench ┃ humaneval ┃ mbpp ┃ Overall ┃
284+
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━┩
285+
│ meta/llama-3.1-8b-instruct │ 2.40 │ 2.31 │ 1.95 │ 2.62 │ 2.55 │ 2.37 │
286+
└────────────────────────────┴───────┴─────────┴─────────┴───────────┴──────┴─────────┘
287+
```
288+
289+
The `accept_rate` and `throughput` metrics work identically (`aiperf speed-bench-report ./artifacts/ --metric accept_rate`).
290+
291+
---
292+
212293
## Profile with Aggregate Qualitative Split
213294

214295
To run all 880 prompts in a single benchmark (without per-category breakdown):

src/aiperf/analysis/speed_bench_report.py

Lines changed: 29 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,15 @@ class SpeedBenchReportError(Exception):
4343

4444
THROUGHPUT_TIERS = ["low_entropy", "mixed", "high_entropy"]
4545

46+
# spec_al_* acceptance-length benchmarks, in a curated order so the report
47+
# columns read math -> chat -> code rather than alphabetically.
48+
SPEC_AL_BENCHMARKS = ["gsm8k", "math500", "mtbench", "humaneval", "mbpp"]
49+
50+
# Dataset-selector prefixes that mark an acceptance-length benchmark run. The
51+
# category is the selector value with the prefix stripped (e.g.
52+
# "speed_bench_coding" -> "coding", "spec_al_gsm8k" -> "gsm8k").
53+
CATEGORY_PREFIXES = ("speed_bench_", "spec_al_")
54+
4655
# Server metric names that represent acceptance length, in priority order.
4756
# Different engines expose this under different names.
4857
ACCEPT_LENGTH_METRICS = [
@@ -108,11 +117,14 @@ def load_server_metrics(run_dir: Path) -> dict | None:
108117

109118

110119
def extract_category(profile: dict) -> str | None:
111-
"""Extract the SPEED-Bench category from the input config.
112-
113-
The exporter writes ``input_config`` as a dump of the v2 ``BenchmarkConfig``,
114-
so the public dataset enum lives on ``datasets[].dataset``. Returns the
115-
suffix of the first entry whose value starts with ``speed_bench_``.
120+
"""Extract the acceptance-length benchmark category from the input config.
121+
122+
The exporter writes ``input_config`` as a dump of the v2 ``BenchmarkConfig``.
123+
Custom/file datasets (e.g. SPEED-Bench) serialize their selector under
124+
``datasets[].format``; public datasets (e.g. the spec_al_* HuggingFace
125+
benchmarks) serialize it under ``datasets[].dataset``. Returns the suffix of
126+
the first entry whose selector starts with a recognized prefix
127+
(see ``CATEGORY_PREFIXES``).
116128
"""
117129
try:
118130
datasets = profile["input_config"]["datasets"]
@@ -123,9 +135,12 @@ def extract_category(profile: dict) -> str | None:
123135
for entry in datasets:
124136
if not isinstance(entry, dict):
125137
continue
126-
name = entry.get("format")
127-
if isinstance(name, str) and name.startswith("speed_bench_"):
128-
return name.removeprefix("speed_bench_")
138+
name = entry.get("format") or entry.get("dataset")
139+
if not isinstance(name, str):
140+
continue
141+
for prefix in CATEGORY_PREFIXES:
142+
if name.startswith(prefix):
143+
return name.removeprefix(prefix)
129144
return None
130145

131146

@@ -294,6 +309,8 @@ def detect_columns(results: dict[str, dict[str, float | None]]) -> list[str]:
294309
return [c for c in QUALITATIVE_CATEGORIES if c in all_cats]
295310
if all_cats <= set(THROUGHPUT_TIERS):
296311
return [c for c in THROUGHPUT_TIERS if c in all_cats]
312+
if all_cats <= set(SPEC_AL_BENCHMARKS):
313+
return [c for c in SPEC_AL_BENCHMARKS if c in all_cats]
297314
return sorted(all_cats)
298315

299316

@@ -331,12 +348,12 @@ def print_table(
331348
from rich.table import Table
332349

333350
title_map = {
334-
"accept_length": "SPEED-Bench Acceptance Length Report",
335-
"accept_rate": "SPEED-Bench Acceptance Rate Report",
336-
"throughput": "SPEED-Bench Throughput Report (tokens/sec)",
351+
"accept_length": "Acceptance Length Report",
352+
"accept_rate": "Acceptance Rate Report",
353+
"throughput": "Throughput Report (tokens/sec)",
337354
}
338355
table = Table(
339-
title=title_map.get(metric_type, "SPEED-Bench Report"),
356+
title=title_map.get(metric_type, "Speculative Decoding Report"),
340357
show_header=True,
341358
header_style="bold magenta",
342359
)

src/aiperf/config/schema/aiperf-config.schema.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9479,6 +9479,11 @@
94799479
"aimo_numina_cot",
94809480
"aimo_numina_1_5",
94819481
"spec_bench",
9482+
"spec_al_gsm8k",
9483+
"spec_al_math500",
9484+
"spec_al_humaneval",
9485+
"spec_al_mbpp",
9486+
"spec_al_mtbench",
94829487
"instruct_coder",
94839488
"blazedit_5k",
94849489
"blazedit_10k",
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
from __future__ import annotations
5+
6+
from typing import Any
7+
8+
from aiperf.common.exceptions import DatasetLoaderError
9+
from aiperf.common.models import Conversation, Text, Turn
10+
from aiperf.dataset.loader.base_hf_dataset import BaseHFDatasetLoader
11+
12+
13+
class MTBenchDatasetLoader(BaseHFDatasetLoader):
14+
"""HuggingFace loader for MT-Bench prompts (HuggingFaceH4/mt_bench_prompts).
15+
16+
Each row's prompt column is a list of strings (one per user turn, usually
17+
two), so each row becomes one multi-turn Conversation of bare user Turns.
18+
AIPerf's UserSession dispatches the turns sequentially and, under the default
19+
DELTAS_WITHOUT_RESPONSES context mode, feeds the live assistant reply back as
20+
history between turns - the FastChat / Spec-Bench MT-Bench protocol.
21+
22+
Example plugins.yaml entry::
23+
24+
spec_al_mtbench:
25+
class: aiperf.dataset.loader.mt_bench:MTBenchDatasetLoader
26+
metadata:
27+
hf_dataset_name: HuggingFaceH4/mt_bench_prompts
28+
hf_split: train
29+
"""
30+
31+
# mt_bench_prompts stores the per-turn prompt list under this column.
32+
PROMPT_COLUMN = "prompt"
33+
34+
async def convert_to_conversations(
35+
self, data: dict[str, Any]
36+
) -> list[Conversation]:
37+
"""Convert each MT-Bench row into a multi-turn Conversation."""
38+
dataset = data["dataset"]
39+
conversations: list[Conversation] = []
40+
skipped = 0
41+
max_conversations = self._max_conversations()
42+
43+
column_validated = False
44+
for row in dataset:
45+
if (
46+
max_conversations is not None
47+
and len(conversations) >= max_conversations
48+
):
49+
break
50+
51+
if not column_validated:
52+
column_validated = True
53+
if self.PROMPT_COLUMN not in row:
54+
raise DatasetLoaderError(
55+
f"Column '{self.PROMPT_COLUMN}' not found in dataset "
56+
f"'{self.hf_dataset_name}'. Available columns: "
57+
f"{list(row.keys())}."
58+
)
59+
60+
turns_raw = row.get(self.PROMPT_COLUMN)
61+
if not isinstance(turns_raw, list):
62+
skipped += 1
63+
continue
64+
65+
conv_turns: list[Turn] = []
66+
for t in turns_raw:
67+
text = str(t).strip() if t else ""
68+
if text:
69+
conv_turns.append(Turn(texts=[Text(contents=[text])]))
70+
if not conv_turns:
71+
skipped += 1
72+
continue
73+
74+
conversations.append(
75+
Conversation(
76+
session_id=self.session_id_generator.next(),
77+
turns=conv_turns,
78+
)
79+
)
80+
81+
if skipped > 0 and not conversations:
82+
self.warning(
83+
f"All {skipped} rows skipped - no conversations loaded. "
84+
f"Check that '{self.PROMPT_COLUMN}' holds non-empty prompt lists."
85+
)
86+
self.debug(
87+
lambda: f"Converted {len(conversations)} MT-Bench rows (skipped {skipped})"
88+
)
89+
return conversations

src/aiperf/plugin/enums.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@
6363

6464
PublicDatasetTypeStr: TypeAlias = str
6565
PublicDatasetType = plugins.create_enum(PluginType.PUBLIC_DATASET_LOADER, "PublicDatasetType", module=__name__)
66-
"""Dynamic enum for public dataset loader. Example: PublicDatasetType.AIMO, PublicDatasetType.LIBRISPEECH, PublicDatasetType.VOXPOPULI"""
66+
"""Dynamic enum for public dataset loader. Example: PublicDatasetType.AIMO, PublicDatasetType.MMSTAR, PublicDatasetType.VOXPOPULI"""
6767

6868
EndpointTypeStr: TypeAlias = str
6969
EndpointType = plugins.create_enum(PluginType.ENDPOINT, "EndpointType", module=__name__)

src/aiperf/plugin/plugins.yaml

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1754,6 +1754,74 @@ public_dataset_loader:
17541754
SpecBench speculative decoding benchmark dataset from GitHub (hemingkx).
17551755
Loads single-turn questions for evaluating speculative decoding methods.
17561756
1757+
# ---------------------------------------------------------------------------
1758+
# Speculative-decoding acceptance-length benchmarks (auto-downloaded from HF).
1759+
# These are the datasets the spec-decode literature reports acceptance length
1760+
# against (GSM8K, MT-Bench, MATH-500, HumanEval, MBPP). Run each with
1761+
# `--public-dataset spec_al_<name>` then `aiperf speed-bench-report ...
1762+
# --metric accept_length`. Prompts are emitted bare; set greedy decoding with
1763+
# `--extra-inputs temperature:0` to match the literature's headline numbers.
1764+
# ---------------------------------------------------------------------------
1765+
spec_al_gsm8k:
1766+
class: aiperf.dataset.loader.hf_instruction_response:HFInstructionResponseDatasetLoader
1767+
description: |
1768+
GSM8K grade-school math word problems from HuggingFace (openai/gsm8k, main
1769+
config, test split, 1319 prompts). Single-turn prompts for speculative
1770+
decoding acceptance-length measurement. License: MIT.
1771+
metadata:
1772+
hf_dataset_name: openai/gsm8k
1773+
hf_subset: main
1774+
hf_split: test
1775+
prompt_column: question
1776+
1777+
spec_al_math500:
1778+
class: aiperf.dataset.loader.hf_instruction_response:HFInstructionResponseDatasetLoader
1779+
description: |
1780+
MATH-500 competition math problems from HuggingFace (HuggingFaceH4/MATH-500,
1781+
test split, 500 prompts). Single-turn prompts for speculative decoding
1782+
acceptance-length measurement. License: MIT (inherited from Hendrycks MATH).
1783+
metadata:
1784+
hf_dataset_name: HuggingFaceH4/MATH-500
1785+
hf_split: test
1786+
prompt_column: problem
1787+
1788+
spec_al_humaneval:
1789+
class: aiperf.dataset.loader.hf_instruction_response:HFInstructionResponseDatasetLoader
1790+
description: |
1791+
HumanEval Python code-completion prompts from HuggingFace
1792+
(openai/openai_humaneval, test split, 164 prompts). Single-turn prompts for
1793+
speculative decoding acceptance-length measurement. The literature runs
1794+
HumanEval as raw text completion; serving it through a chat template here
1795+
shifts acceptance length away from the papers' headline numbers (MBPP shares
1796+
this caveat to a lesser degree). License: MIT.
1797+
metadata:
1798+
hf_dataset_name: openai/openai_humaneval
1799+
hf_split: test
1800+
prompt_column: prompt
1801+
1802+
spec_al_mbpp:
1803+
class: aiperf.dataset.loader.hf_instruction_response:HFInstructionResponseDatasetLoader
1804+
description: |
1805+
MBPP basic Python programming tasks from HuggingFace
1806+
(google-research-datasets/mbpp, full config, test split, 500 prompts).
1807+
Single-turn prompts for speculative decoding acceptance-length measurement.
1808+
License: CC-BY-4.0.
1809+
metadata:
1810+
hf_dataset_name: google-research-datasets/mbpp
1811+
hf_subset: full
1812+
hf_split: test
1813+
prompt_column: text
1814+
1815+
spec_al_mtbench:
1816+
class: aiperf.dataset.loader.mt_bench:MTBenchDatasetLoader
1817+
description: |
1818+
MT-Bench two-turn chat prompts from HuggingFace (HuggingFaceH4/mt_bench_prompts,
1819+
train split, 80 prompts). Multi-turn conversations for speculative decoding
1820+
acceptance-length measurement. License: Apache-2.0.
1821+
metadata:
1822+
hf_dataset_name: HuggingFaceH4/mt_bench_prompts
1823+
hf_split: train
1824+
17571825
instruct_coder:
17581826
class: aiperf.dataset.loader.hf_instruction_response:HFInstructionResponseDatasetLoader
17591827
description: |

0 commit comments

Comments
 (0)