feat(dataset): add spec_al_* acceptance-length benchmark datasets (#1046)

abatilo · web-flow · commit 90651162ae0f · 2026-06-12T16:45:39.000-07:00
Signed-off-by: Aaron Batilo &lt;AaronBatilo@gmail.com&gt;
diff --git a/docs/cli-options.md b/docs/cli-options.md
@@ -359,7 +359,7 @@ Path to file or directory containing benchmark dataset. Required when using `--c
 #### `--public-dataset` `<str>`
 
 Pre-configured public dataset to download and use for benchmarking (e.g., `sharegpt`). AIPerf automatically downloads and parses these datasets. Mutually exclusive with `--custom-dataset-type`. Run `aiperf plugins public_dataset_loader` to list available datasets. Use `--hf-subset` to override the HuggingFace subset/config for HF-backed datasets.
-<br/>_Choices: [`sharegpt`, `aimo`, `mmstar`, `mmvu`, `vision_arena`, `llava_onevision`, `aimo_aime`, `aimo_numina_cot`, `aimo_numina_1_5`, `spec_bench`, `instruct_coder`, `blazedit_5k`, `blazedit_10k`, `librispeech`, `voxpopuli`, `gigaspeech`, `ami`, `spgispeech`]_
+<br/>_Choices: [`sharegpt`, `aimo`, `mmstar`, `mmvu`, `vision_arena`, `llava_onevision`, `aimo_aime`, `aimo_numina_cot`, `aimo_numina_1_5`, `spec_bench`, `spec_al_gsm8k`, `spec_al_math500`, `spec_al_humaneval`, `spec_al_mbpp`, `spec_al_mtbench`, `instruct_coder`, `blazedit_5k`, `blazedit_10k`, `librispeech`, `voxpopuli`, `gigaspeech`, `ami`, `spgispeech`]_
 
 #### `--hf-subset` `<str>`
 
@@ -1688,7 +1688,7 @@ Path to file or directory containing benchmark dataset. Required when using `--c
 #### `--public-dataset` `<str>`
 
 Pre-configured public dataset to download and use for benchmarking (e.g., `sharegpt`). AIPerf automatically downloads and parses these datasets. Mutually exclusive with `--custom-dataset-type`. Run `aiperf plugins public_dataset_loader` to list available datasets. Use `--hf-subset` to override the HuggingFace subset/config for HF-backed datasets.
-<br/>_Choices: [`sharegpt`, `aimo`, `mmstar`, `mmvu`, `vision_arena`, `llava_onevision`, `aimo_aime`, `aimo_numina_cot`, `aimo_numina_1_5`, `spec_bench`, `instruct_coder`, `blazedit_5k`, `blazedit_10k`, `librispeech`, `voxpopuli`, `gigaspeech`, `ami`, `spgispeech`]_
+<br/>_Choices: [`sharegpt`, `aimo`, `mmstar`, `mmvu`, `vision_arena`, `llava_onevision`, `aimo_aime`, `aimo_numina_cot`, `aimo_numina_1_5`, `spec_bench`, `spec_al_gsm8k`, `spec_al_math500`, `spec_al_humaneval`, `spec_al_mbpp`, `spec_al_mtbench`, `instruct_coder`, `blazedit_5k`, `blazedit_10k`, `librispeech`, `voxpopuli`, `gigaspeech`, `ami`, `spgispeech`]_
 
 #### `--hf-subset` `<str>`
 
diff --git a/docs/tutorials/speed-bench.md b/docs/tutorials/speed-bench.md
@@ -187,7 +187,7 @@ aiperf speed-bench-report ./artifacts/ --format both
 This produces a CSV (`speed_bench_report.csv`) and console table:
 
 ```
-                         SPEED-Bench Acceptance Length Report
+                         Acceptance Length Report
 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
 ┃ Model                      ┃ coding ┃ humanities ┃ math ┃ writing ┃ Overall ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
@@ -209,6 +209,87 @@ aiperf speed-bench-report ./artifacts/ --metric throughput
 
 ---
 
+## Literature Acceptance-Length Datasets (GSM8K, MT-Bench, MATH-500, HumanEval, MBPP)
+
+The speculative-decoding literature overwhelmingly reports acceptance length against five standard benchmarks. AIPerf registers each as a public dataset that is auto-downloaded from HuggingFace at runtime, so there is no prepare-data step: just select one with `--public-dataset` and run the same `aiperf speed-bench-report` workflow shown above.
+
+| Dataset Name | HuggingFace Source | Prompts | Turns | License |
+|---|---|---|---|---|
+| `spec_al_gsm8k` | `openai/gsm8k` (`main`, `test`) | 1,319 | single | MIT |
+| `spec_al_math500` | `HuggingFaceH4/MATH-500` (`test`) | 500 | single | MIT |
+| `spec_al_humaneval` | `openai/openai_humaneval` (`test`) | 164 | single | MIT |
+| `spec_al_mbpp` | `google-research-datasets/mbpp` (`full`, `test`) | 500 | single | CC-BY-4.0 |
+| `spec_al_mtbench` | `HuggingFaceH4/mt_bench_prompts` (`train`) | 80 | two-turn | Apache-2.0 |
+
+Prompts are emitted verbatim (the raw question/problem/prompt field); the served model's chat template wraps them at request time via `--endpoint-type chat`. HumanEval and MBPP are text-completion tasks in the spec-decode literature, so chat-wrapping them keeps the matrix uniform but shifts their acceptance length somewhat from the papers' headline numbers. Acceptance length is correctness-agnostic, so use greedy decoding (`--extra-inputs temperature:0`) to match the headline numbers reported in the literature. Note that `--osl` does not apply to public datasets, so cap generation with `--extra-inputs max_tokens:N` instead. `spec_al_mtbench` is multi-turn: AIPerf dispatches both turns per session and feeds the live assistant reply back as conversation history between them - size it with `--num-conversations` rather than `--request-count` (see below).
+
+### Run All Five with a Matrix Report
+
+```bash
+MODEL="meta/llama-3.1-8b-instruct"
+ART=./artifacts/spec-al   # dedicated root so this matrix never merges with speed_bench_* runs
+
+# Single-turn datasets: size each run to the full dataset with --request-count.
+for pair in spec_al_gsm8k:1319 spec_al_math500:500 spec_al_humaneval:164 spec_al_mbpp:500; do
+  ds="${pair%%:*}"; count="${pair##*:}"
+  echo "=== Running dataset: $ds ($count requests) ==="
+  aiperf profile \
+      --model "$MODEL" \
+      --endpoint-type chat \
+      --streaming \
+      --url localhost:8000 \
+      --public-dataset "$ds" \
+      --server-metrics http://localhost:8000/metrics \
+      --request-count "$count" \
+      --extra-inputs temperature:0 max_tokens:4096 \
+      --concurrency 16 \
+      --output-artifact-dir "$ART/$ds"
+done
+
+# MT-Bench is multi-turn (80 two-turn conversations). Size it with
+# --num-conversations so every session runs exactly once; --request-count
+# recycles the 80 sessions to reach the count and would dispatch each prompt
+# more than once.
+aiperf profile \
+    --model "$MODEL" \
+    --endpoint-type chat \
+    --streaming \
+    --url localhost:8000 \
+    --public-dataset spec_al_mtbench \
+    --server-metrics http://localhost:8000/metrics \
+    --num-conversations 80 \
+    --extra-inputs temperature:0 max_tokens:4096 \
+    --concurrency 16 \
+    --output-artifact-dir "$ART/spec_al_mtbench"
+
+# Assemble the acceptance-length matrix (one column per dataset)
+aiperf speed-bench-report "$ART" --metric accept_length --format both
+```
+
+> Size each run to the full dataset — without an explicit count AIPerf defaults
+> to 10 requests. Single-turn datasets use `--request-count`; the multi-turn
+> `spec_al_mtbench` uses `--num-conversations 80` (one run per conversation),
+> since `--request-count` recycles its 80 sessions to reach the count. Cap
+> generation with `--extra-inputs max_tokens:N` (`--osl` is ignored for public
+> datasets), and keep these runs in their own artifacts directory so
+> `speed-bench-report` does not average them into an unrelated `speed_bench_*`
+> matrix.
+
+The report recognizes these runs the same way it recognizes the `speed_bench_*` runs, producing one matrix column per dataset:
+
+```
+                         Acceptance Length Report
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┓
+┃ Model                      ┃ gsm8k ┃ math500 ┃ mtbench ┃ humaneval ┃ mbpp ┃ Overall ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━┩
+│ meta/llama-3.1-8b-instruct │  2.40 │    2.31 │    1.95 │      2.62 │ 2.55 │    2.37 │
+└────────────────────────────┴───────┴─────────┴─────────┴───────────┴──────┴─────────┘
+```
+
+The `accept_rate` and `throughput` metrics work identically (`aiperf speed-bench-report ./artifacts/ --metric accept_rate`).
+
+---
+
 ## Profile with Aggregate Qualitative Split
 
 To run all 880 prompts in a single benchmark (without per-category breakdown):
diff --git a/src/aiperf/analysis/speed_bench_report.py b/src/aiperf/analysis/speed_bench_report.py
@@ -43,6 +43,15 @@ class SpeedBenchReportError(Exception):
 
 THROUGHPUT_TIERS = ["low_entropy", "mixed", "high_entropy"]
 
+# spec_al_* acceptance-length benchmarks, in a curated order so the report
+# columns read math -> chat -> code rather than alphabetically.
+SPEC_AL_BENCHMARKS = ["gsm8k", "math500", "mtbench", "humaneval", "mbpp"]
+
+# Dataset-selector prefixes that mark an acceptance-length benchmark run. The
+# category is the selector value with the prefix stripped (e.g.
+# "speed_bench_coding" -> "coding", "spec_al_gsm8k" -> "gsm8k").
+CATEGORY_PREFIXES = ("speed_bench_", "spec_al_")
+
 # Server metric names that represent acceptance length, in priority order.
 # Different engines expose this under different names.
 ACCEPT_LENGTH_METRICS = [
@@ -108,11 +117,14 @@ def load_server_metrics(run_dir: Path) -> dict | None:
 
 
 def extract_category(profile: dict) -> str | None:
-    """Extract the SPEED-Bench category from the input config.
-
-    The exporter writes ``input_config`` as a dump of the v2 ``BenchmarkConfig``,
-    so the public dataset enum lives on ``datasets[].dataset``. Returns the
-    suffix of the first entry whose value starts with ``speed_bench_``.
+    """Extract the acceptance-length benchmark category from the input config.
+
+    The exporter writes ``input_config`` as a dump of the v2 ``BenchmarkConfig``.
+    Custom/file datasets (e.g. SPEED-Bench) serialize their selector under
+    ``datasets[].format``; public datasets (e.g. the spec_al_* HuggingFace
+    benchmarks) serialize it under ``datasets[].dataset``. Returns the suffix of
+    the first entry whose selector starts with a recognized prefix
+    (see ``CATEGORY_PREFIXES``).
     """
     try:
         datasets = profile["input_config"]["datasets"]
@@ -123,9 +135,12 @@ def extract_category(profile: dict) -> str | None:
     for entry in datasets:
         if not isinstance(entry, dict):
             continue
-        name = entry.get("format")
-        if isinstance(name, str) and name.startswith("speed_bench_"):
-            return name.removeprefix("speed_bench_")
+        name = entry.get("format") or entry.get("dataset")
+        if not isinstance(name, str):
+            continue
+        for prefix in CATEGORY_PREFIXES:
+            if name.startswith(prefix):
+                return name.removeprefix(prefix)
     return None
 
 
@@ -294,6 +309,8 @@ def detect_columns(results: dict[str, dict[str, float | None]]) -> list[str]:
         return [c for c in QUALITATIVE_CATEGORIES if c in all_cats]
     if all_cats <= set(THROUGHPUT_TIERS):
         return [c for c in THROUGHPUT_TIERS if c in all_cats]
+    if all_cats <= set(SPEC_AL_BENCHMARKS):
+        return [c for c in SPEC_AL_BENCHMARKS if c in all_cats]
     return sorted(all_cats)
 
 
@@ -331,12 +348,12 @@ def print_table(
         from rich.table import Table
 
         title_map = {
-            "accept_length": "SPEED-Bench Acceptance Length Report",
-            "accept_rate": "SPEED-Bench Acceptance Rate Report",
-            "throughput": "SPEED-Bench Throughput Report (tokens/sec)",
+            "accept_length": "Acceptance Length Report",
+            "accept_rate": "Acceptance Rate Report",
+            "throughput": "Throughput Report (tokens/sec)",
         }
         table = Table(
-            title=title_map.get(metric_type, "SPEED-Bench Report"),
+            title=title_map.get(metric_type, "Speculative Decoding Report"),
             show_header=True,
             header_style="bold magenta",
         )
diff --git a/src/aiperf/config/schema/aiperf-config.schema.json b/src/aiperf/config/schema/aiperf-config.schema.json
@@ -9479,6 +9479,11 @@
         "aimo_numina_cot",
         "aimo_numina_1_5",
         "spec_bench",
+        "spec_al_gsm8k",
+        "spec_al_math500",
+        "spec_al_humaneval",
+        "spec_al_mbpp",
+        "spec_al_mtbench",
         "instruct_coder",
         "blazedit_5k",
         "blazedit_10k",
diff --git a/src/aiperf/dataset/loader/mt_bench.py b/src/aiperf/dataset/loader/mt_bench.py
@@ -0,0 +1,89 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+from __future__ import annotations
+
+from typing import Any
+
+from aiperf.common.exceptions import DatasetLoaderError
+from aiperf.common.models import Conversation, Text, Turn
+from aiperf.dataset.loader.base_hf_dataset import BaseHFDatasetLoader
+
+
+class MTBenchDatasetLoader(BaseHFDatasetLoader):
+    """HuggingFace loader for MT-Bench prompts (HuggingFaceH4/mt_bench_prompts).
+
+    Each row's prompt column is a list of strings (one per user turn, usually
+    two), so each row becomes one multi-turn Conversation of bare user Turns.
+    AIPerf's UserSession dispatches the turns sequentially and, under the default
+    DELTAS_WITHOUT_RESPONSES context mode, feeds the live assistant reply back as
+    history between turns - the FastChat / Spec-Bench MT-Bench protocol.
+
+    Example plugins.yaml entry::
+
+        spec_al_mtbench:
+          class: aiperf.dataset.loader.mt_bench:MTBenchDatasetLoader
+          metadata:
+            hf_dataset_name: HuggingFaceH4/mt_bench_prompts
+            hf_split: train
+    """
+
+    # mt_bench_prompts stores the per-turn prompt list under this column.
+    PROMPT_COLUMN = "prompt"
+
+    async def convert_to_conversations(
+        self, data: dict[str, Any]
+    ) -> list[Conversation]:
+        """Convert each MT-Bench row into a multi-turn Conversation."""
+        dataset = data["dataset"]
+        conversations: list[Conversation] = []
+        skipped = 0
+        max_conversations = self._max_conversations()
+
+        column_validated = False
+        for row in dataset:
+            if (
+                max_conversations is not None
+                and len(conversations) >= max_conversations
+            ):
+                break
+
+            if not column_validated:
+                column_validated = True
+                if self.PROMPT_COLUMN not in row:
+                    raise DatasetLoaderError(
+                        f"Column '{self.PROMPT_COLUMN}' not found in dataset "
+                        f"'{self.hf_dataset_name}'. Available columns: "
+                        f"{list(row.keys())}."
+                    )
+
+            turns_raw = row.get(self.PROMPT_COLUMN)
+            if not isinstance(turns_raw, list):
+                skipped += 1
+                continue
+
+            conv_turns: list[Turn] = []
+            for t in turns_raw:
+                text = str(t).strip() if t else ""
+                if text:
+                    conv_turns.append(Turn(texts=[Text(contents=[text])]))
+            if not conv_turns:
+                skipped += 1
+                continue
+
+            conversations.append(
+                Conversation(
+                    session_id=self.session_id_generator.next(),
+                    turns=conv_turns,
+                )
+            )
+
+        if skipped > 0 and not conversations:
+            self.warning(
+                f"All {skipped} rows skipped - no conversations loaded. "
+                f"Check that '{self.PROMPT_COLUMN}' holds non-empty prompt lists."
+            )
+        self.debug(
+            lambda: f"Converted {len(conversations)} MT-Bench rows (skipped {skipped})"
+        )
+        return conversations
diff --git a/src/aiperf/plugin/enums.py b/src/aiperf/plugin/enums.py
@@ -63,7 +63,7 @@
 
 PublicDatasetTypeStr: TypeAlias = str
 PublicDatasetType = plugins.create_enum(PluginType.PUBLIC_DATASET_LOADER, "PublicDatasetType", module=__name__)
-"""Dynamic enum for public dataset loader. Example: PublicDatasetType.AIMO, PublicDatasetType.LIBRISPEECH, PublicDatasetType.VOXPOPULI"""
+"""Dynamic enum for public dataset loader. Example: PublicDatasetType.AIMO, PublicDatasetType.MMSTAR, PublicDatasetType.VOXPOPULI"""
 
 EndpointTypeStr: TypeAlias = str
 EndpointType = plugins.create_enum(PluginType.ENDPOINT, "EndpointType", module=__name__)
diff --git a/src/aiperf/plugin/plugins.yaml b/src/aiperf/plugin/plugins.yaml
@@ -1754,6 +1754,74 @@ public_dataset_loader:
       SpecBench speculative decoding benchmark dataset from GitHub (hemingkx).
       Loads single-turn questions for evaluating speculative decoding methods.
 
+  # ---------------------------------------------------------------------------
+  # Speculative-decoding acceptance-length benchmarks (auto-downloaded from HF).
+  # These are the datasets the spec-decode literature reports acceptance length
+  # against (GSM8K, MT-Bench, MATH-500, HumanEval, MBPP). Run each with
+  # `--public-dataset spec_al_<name>` then `aiperf speed-bench-report ...
+  # --metric accept_length`. Prompts are emitted bare; set greedy decoding with
+  # `--extra-inputs temperature:0` to match the literature's headline numbers.
+  # ---------------------------------------------------------------------------
+  spec_al_gsm8k:
+    class: aiperf.dataset.loader.hf_instruction_response:HFInstructionResponseDatasetLoader
+    description: |
+      GSM8K grade-school math word problems from HuggingFace (openai/gsm8k, main
+      config, test split, 1319 prompts). Single-turn prompts for speculative
+      decoding acceptance-length measurement. License: MIT.
+    metadata:
+      hf_dataset_name: openai/gsm8k
+      hf_subset: main
+      hf_split: test
+      prompt_column: question
+
+  spec_al_math500:
+    class: aiperf.dataset.loader.hf_instruction_response:HFInstructionResponseDatasetLoader
+    description: |
+      MATH-500 competition math problems from HuggingFace (HuggingFaceH4/MATH-500,
+      test split, 500 prompts). Single-turn prompts for speculative decoding
+      acceptance-length measurement. License: MIT (inherited from Hendrycks MATH).
+    metadata:
+      hf_dataset_name: HuggingFaceH4/MATH-500
+      hf_split: test
+      prompt_column: problem
+
+  spec_al_humaneval:
+    class: aiperf.dataset.loader.hf_instruction_response:HFInstructionResponseDatasetLoader
+    description: |
+      HumanEval Python code-completion prompts from HuggingFace
+      (openai/openai_humaneval, test split, 164 prompts). Single-turn prompts for
+      speculative decoding acceptance-length measurement. The literature runs
+      HumanEval as raw text completion; serving it through a chat template here
+      shifts acceptance length away from the papers' headline numbers (MBPP shares
+      this caveat to a lesser degree). License: MIT.
+    metadata:
+      hf_dataset_name: openai/openai_humaneval
+      hf_split: test
+      prompt_column: prompt
+
+  spec_al_mbpp:
+    class: aiperf.dataset.loader.hf_instruction_response:HFInstructionResponseDatasetLoader
+    description: |
+      MBPP basic Python programming tasks from HuggingFace
+      (google-research-datasets/mbpp, full config, test split, 500 prompts).
+      Single-turn prompts for speculative decoding acceptance-length measurement.
+      License: CC-BY-4.0.
+    metadata:
+      hf_dataset_name: google-research-datasets/mbpp
+      hf_subset: full
+      hf_split: test
+      prompt_column: text
+
+  spec_al_mtbench:
+    class: aiperf.dataset.loader.mt_bench:MTBenchDatasetLoader
+    description: |
+      MT-Bench two-turn chat prompts from HuggingFace (HuggingFaceH4/mt_bench_prompts,
+      train split, 80 prompts). Multi-turn conversations for speculative decoding
+      acceptance-length measurement. License: Apache-2.0.
+    metadata:
+      hf_dataset_name: HuggingFaceH4/mt_bench_prompts
+      hf_split: train
+
   instruct_coder:
     class: aiperf.dataset.loader.hf_instruction_response:HFInstructionResponseDatasetLoader
     description: |
diff --git a/tests/unit/analysis/test_speed_bench_report.py b/tests/unit/analysis/test_speed_bench_report.py
diff --git a/tests/unit/dataset/loader/test_mt_bench.py b/tests/unit/dataset/loader/test_mt_bench.py