Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/accuracy/accuracy-benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ system message).
| `mmlu` | `multiple_choice` | 5 | `lighteval/mmlu` (57 subjects) |
| `aime` | `math` | 8 | `Maxwell-Jia/AIME_2024` (trt-llm reference, 8-shot CoT) |
| `hellaswag` | `exact_match` | 10 | `Rowan/hellaswag` (trt-llm/DeepEval reference; one few-shot per unique activity_label) |
| `bigbench` | `exact_match` | 3 | `lukaemon/bbh` (trt-llm/DeepEval reference; 27 subtasks, canonical CoT/non-CoT prompt files) |

## CLI Flags

Expand Down
29 changes: 15 additions & 14 deletions docs/accuracy/accuracy_stubs.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected.

**Status summary:** With the HellaSwag loader landing on top of AIP-874, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, and `HellaSwagBenchmark` are fully implemented; the remaining benchmarks (`bigbench`, `aime24`, `aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
**Status summary:** With the BigBench-Hard loader landing on top of the HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, and `BigBenchBenchmark` are fully implemented; the remaining benchmarks (`aime24`, `aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
Comment thread
coderabbitai[bot] marked this conversation as resolved.

## Table of Contents

Expand Down Expand Up @@ -172,17 +172,17 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
| 1 | `MMLUBenchmark` | `benchmarks/mmlu.py` | `mmlu` | `multiple_choice` | 5 | **IMPLEMENTED in PR #815** — canonical reference for new benchmarks. Downloads via HuggingFace datasets, handles few-shot formatting and CoT. |
| 2 | `AIMEBenchmark` | `benchmarks/aime.py` | `aime` | `math` | 8 | **IMPLEMENTED.** Loads `Maxwell-Jia/AIME_2024`, instructs the model to wrap its final integer in `\boxed{}`, supports few-shot priming and chain-of-thought. `default_enable_cot=true`. |
| 3 | `HellaSwagBenchmark` | `benchmarks/hellaswag.py` | `hellaswag` | `exact_match` | 10 | **IMPLEMENTED.** Loads `Rowan/hellaswag` (validation split filtered per task by `activity_label`; train split feeds the "one few-shot per unique activity_label" rule). Prompt rendering delegates to `deepeval.benchmarks.HellaSwag`'s `HellaSwagTemplate.generate_output`, so output is byte-equal to the trt-llm recipe's DeepEval-backed path. Pairs with `exact_match` for strict `Scorer.exact_match_score` semantics. Requires the `[accuracy]` extras (deepeval). |
| 4 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 | **IMPLEMENTED.** Loads `lukaemon/bbh` (27 BBH subtasks). Prompt rendering delegates to `deepeval.benchmarks.BigBenchHard`'s `BigBenchHardTemplate.generate_output`, which reads the 27 canonical CoT/shot prompt files DeepEval ships as package data. Pairs with `exact_match` for the recipe's strict `Scorer.exact_match_score` semantics. `default_n_shots=3`, `default_enable_cot=true`. Requires the `[accuracy]` extras (deepeval). |

### Still Stubbed

| # | Class | File | Plugin Key | Default Grader | Default N-Shots |
|---|-------|------|------------|----------------|-----------------|
| 1 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 |
| 2 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 |
| 3 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
| 4 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
| 5 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
| 6 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
| 1 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 |
| 2 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
| 3 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
| 4 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
| 5 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |

**Each benchmark has 1 method to implement:**

Expand Down Expand Up @@ -308,26 +308,27 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu

| Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods |
|-----------|-------------|---------------|------------------|-------------------|
| Graders | 1 (`MultipleChoiceGrader`) | 3 | 2 (`grade`, `extract_answer`) | 6 |
| Benchmarks | 1 (`MMLUBenchmark`) | 8 | 1 (`load_problems`) | 8 |
| Graders | 7 (all) | 0 | — | 0 |
| Benchmarks | 4 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) | 5 | 1 (`load_problems`) | 5 |
| Record Processor | 1 (`AccuracyRecordProcessor`) | 0 | — | 0 |
| Results Processor | 1 (`AccuracyResultsProcessor`) | 0 | — | 0 |
| Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 | — | 0 |
| Data Exporter | 1 (`AccuracyDataExporter`) | 0 | — | 0 |
| Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 |
| **Total** | **6** | **13** | | **15** |
| **Total** | **15** | **6** | | **6** |

### Self-Disabling Pattern

Processors and exporters raise their `Disabled` exception **in `__init__`** when accuracy is off. The existing framework catches these and silently skips the plugin. No code changes needed to support this — it uses the same pattern as `RawRecordWriterProcessor` and `ServerMetricsCsvExporter`.

### Suggested Implementation Order

The processors, exporters, and one grader/benchmark pair are already wired end-to-end. Start from the already-working pipeline:
The processors, exporters, all graders, and four benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) are already wired end-to-end. The remaining work is the five stub benchmarks; mirror the existing loader whose grader matches:

1. **Graders** — use `MultipleChoiceGrader` as reference; implement `ExactMatchGrader` next (simplest), then `MathGrader`
2. **Benchmarks** — use `MMLUBenchmark` as reference; implement dataset loading for each remaining benchmark
3. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` when a benchmark or grader moves from stubbed to supported
1. **`aime24`, `aime25`, `math_500`** — mirror `AIMEBenchmark` (`benchmarks/aime.py`); pair with the `math` grader.
2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `multiple_choice` grader.
3. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
4. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.

### Key Files for Reference

Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,7 @@ markers = [
"server_unit: marks tests as unit tests for the mock server",
"fern: marks tests that validate Fern documentation (requires fern CLI)",
"network: marks tests that require network access",
"requires_deepeval: tests that need the real deepeval install (i.e. the [accuracy] extras) — skipped when only the fake-deepeval harness is registered",
]
# Better console output
console_output_style = "progress"
Expand Down
215 changes: 206 additions & 9 deletions src/aiperf/accuracy/benchmarks/bigbench.py
Original file line number Diff line number Diff line change
@@ -1,31 +1,228 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""BigBench-Hard benchmark loader, aligned with the trt-llm DeepEval reference.

The trt-llm benchmark recipe routes ``bigbench`` through DeepEval's
``deepeval.benchmarks.BigBenchHard`` class
(``trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:338-356``). This
loader produces prompts byte-equal to what DeepEval's
``BigBenchHardTemplate.generate_output`` produces, by importing and
calling that template directly. The 27 canonical CoT and non-CoT
prompt files (one per BBH subtask) ship inside DeepEval as package
data — DeepEval's template reads them via ``importlib.resources`` at
load time.

Pair with ``ExactMatchGrader`` for strict ``pred.strip() ==
gold.strip()`` semantics matching DeepEval's
``Scorer.exact_match_score``.

Reference:
deepeval/benchmarks/big_bench_hard/big_bench_hard.py
deepeval/benchmarks/big_bench_hard/template.py
deepeval/benchmarks/big_bench_hard/cot_prompts/*.txt (27 files)
deepeval/benchmarks/big_bench_hard/shot_prompts/*.txt (27 files)
trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:338
"""

from __future__ import annotations

from typing import TYPE_CHECKING
import asyncio
from typing import TYPE_CHECKING, Any

from aiperf.accuracy.models import BenchmarkProblem
from datasets import Dataset, load_dataset
Comment thread
coderabbitai[bot] marked this conversation as resolved.

from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
from aiperf.common.mixins import AIPerfLoggerMixin

if TYPE_CHECKING:
from aiperf.config.resolution.plan import BenchmarkRun

try:
from deepeval.benchmarks.big_bench_hard.big_bench_hard import (
bbh_confinement_statements_dict,
)
from deepeval.benchmarks.big_bench_hard.task import BigBenchHardTask
from deepeval.benchmarks.big_bench_hard.template import (
BigBenchHardTemplate,
)

_HAS_DEEPEVAL = True
except ImportError: # pragma: no cover - exercised only without optional dep
_HAS_DEEPEVAL = False
BigBenchHardTask = None # type: ignore[assignment]
BigBenchHardTemplate = None # type: ignore[assignment]
bbh_confinement_statements_dict = None # type: ignore[assignment]


_MISSING_DEEPEVAL_HINT = (
"deepeval is not installed; BigBench-Hard's prompt templates and "
"the per-task confinement dict (the trt-llm reference) cannot be "
"loaded. Install with: uv pip install 'aiperf[accuracy]'."
)

DATASET_NAME = "lukaemon/bbh"
TASK_NAME = "bigbench"

# DeepEval's BigBenchHard caps n_shots at 3 (the canonical CoT files
# only contain 3 worked examples each). We mirror both bounds.
DEFAULT_N_SHOTS = 3
MAX_N_SHOTS = 3

# DeepEval's BigBenchHard default is ``enable_cot=True``.
DEFAULT_ENABLE_COT = True

# CoT solutions can run several hundred tokens; non-CoT answers are
# typically a single bare token. 1024 covers both with headroom.
DEFAULT_GENERATION_SIZE = 1024

# Schema field names in lukaemon/bbh.
INPUT_FIELD = "input"
TARGET_FIELD = "target"


def _resolve_tasks(tasks: list[str] | None) -> list[Any]:
"""Convert ``--accuracy-tasks`` strings to ``BigBenchHardTask`` enums.

DeepEval evaluates one task at a time. Aiperf accepts either:
- ``None`` / empty / ``["all"]`` (case-insensitive) → every
BigBenchHardTask enum (27 subtasks).
- Lower-snake-case strings matching the enum's ``value``
(e.g. ``"boolean_expressions"``).
- Upper-snake-case enum names (e.g. ``"BOOLEAN_EXPRESSIONS"``)
for parity with the recipe's ``getattr(BigBenchHardTask,
task_name.upper(), None)`` lookup.

Mixing ``"all"`` with other task names is rejected so typos like
``["all", "NOT_A_TASK"]`` don't silently bypass validation — that
used to slip through and return every task while swallowing the
invalid entry (the parallel HellaSwag bug fixed in AIP-877).

Unknown names raise ``ValueError`` with the full valid list so
typos fail loudly.
"""
if not tasks:
return list(BigBenchHardTask)
lowered = [t.lower() for t in tasks]
if "all" in lowered:
if lowered == ["all"]:
return list(BigBenchHardTask)
raise ValueError(
"'all' cannot be mixed with other task names. Pass 'all' "
"by itself (or omit --accuracy-tasks) to select every BBH "
f"subtask, or list specific subtasks. Got: {tasks!r}"
)
valid_values = {t.value for t in BigBenchHardTask}
resolved: list[Any] = []
unknown: list[str] = []
for name in tasks:
if name in valid_values:
resolved.append(next(t for t in BigBenchHardTask if t.value == name))
continue
enum_member = getattr(BigBenchHardTask, name.upper(), None)
if enum_member is not None:
resolved.append(enum_member)
else:
unknown.append(name)
if unknown:
raise ValueError(
f"Unknown BBH subtask(s): {unknown}. Valid subtasks: {sorted(valid_values)}"
)
return resolved


class BigBenchBenchmark(AIPerfLoggerMixin):
"""Registered placeholder for a future BigBench loader.
"""BigBench-Hard benchmark loader, byte-equal to DeepEval's prompts.

`load_problems()` intentionally raises NotImplementedError in this release;
use the MMLU benchmark when a working accuracy loader is required.
Iterates the requested BBH subtasks and renders each problem's
prompt via ``BigBenchHardTemplate.generate_output`` (which reads
DeepEval's bundled CoT/shot prompt files). Pair with
``ExactMatchGrader`` for the recipe's strict equality scoring.
"""

def __init__(self, run: BenchmarkRun, **kwargs) -> None:
def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
super().__init__(**kwargs)
if not _HAS_DEEPEVAL:
raise RuntimeError(_MISSING_DEEPEVAL_HINT)
self.run = run

async def load_problems(
self, tasks: list[str] | None, n_shots: int, enable_cot: bool
) -> list[BenchmarkProblem]:
raise NotImplementedError(
"bigbench benchmark is not yet implemented; only 'mmlu' is available in this release."
)
"""Load BBH problems and format them DeepEval-style.

Args:
tasks: Subtask names (lower-snake-case enum values like
``boolean_expressions`` or upper-snake-case enum names
like ``BOOLEAN_EXPRESSIONS``). ``None`` / ``["all"]``
selects every subtask. Unknown names raise.
n_shots: 0..3 (DeepEval asserts ``n_shots <= 3`` because
the canonical prompt files ship exactly 3 examples).
enable_cot: When True (the DeepEval default), use the
bundled CoT prompt files; when False, use the non-CoT
``shot_prompts/`` files.

Returns:
One ``BenchmarkProblem`` per row across all selected
subtasks. ``task`` is the subtask name so results
aggregate per-subtask.
"""
if n_shots > MAX_N_SHOTS:
raise ValueError(
f"BBH supports at most {MAX_N_SHOTS} few-shot examples "
f"(got {n_shots}); DeepEval asserts ``n_shots <= 3`` "
f"because the canonical prompt files ship exactly "
f"{MAX_N_SHOTS} worked examples per subtask."
)
task_enums = _resolve_tasks(tasks)
problems: list[BenchmarkProblem] = []
for task in task_enums:
ds: Dataset = await asyncio.to_thread(
load_dataset, DATASET_NAME, task.value
)
sub_problems = await asyncio.to_thread(
self._build_subtask_problems,
ds["test"],
task,
n_shots,
enable_cot,
)
problems.extend(sub_problems)
return problems

def _build_subtask_problems(
self,
ds: Any,
task: Any,
n_shots: int,
enable_cot: bool,
) -> list[BenchmarkProblem]:
problems: list[BenchmarkProblem] = []
for row in ds:
template_prompt = BigBenchHardTemplate.generate_output(
input=row[INPUT_FIELD],
task=task,
n_shots=n_shots,
enable_cot=enable_cot,
)
prompt = f"{template_prompt}{bbh_confinement_statements_dict[task]}"
messages: list[AccuracyChatMessage] = [{"role": "user", "content": prompt}]
problems.append(
BenchmarkProblem(
prompt=prompt,
# ``BenchmarkProblem.ground_truth`` is typed ``str`` in
# strict mode; the upstream BBH schema stores targets
# as strings today, but coerce defensively so a future
# numeric column doesn't break the loader. Mirrors
# DeepEval's ``str(expected_output)`` in its grader.
ground_truth=str(row[TARGET_FIELD]),
task=task.value,
metadata={
"bbh_task": task.value,
"confinement": bbh_confinement_statements_dict.get(task, ""),
"generation_size": DEFAULT_GENERATION_SIZE,
},
raw_messages=messages,
)
)
return problems
11 changes: 8 additions & 3 deletions src/aiperf/plugin/plugins.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1240,12 +1240,17 @@ accuracy_benchmark:
bigbench:
class: aiperf.accuracy.benchmarks.bigbench:BigBenchBenchmark
description: |
BigBench benchmark for diverse language understanding tasks spanning
linguistics, reasoning, and world knowledge.
BigBench-Hard benchmark, aligned with the trt-llm benchmark recipe's
DeepEval-backed configuration. Prompts are byte-equal to
``deepeval.benchmarks.BigBenchHard`` (n_shots=3, enable_cot=True,
using DeepEval's bundled per-subtask CoT prompt files). Pairs with
``exact_match`` for the recipe's strict ``Scorer.exact_match_score``
semantics. Requires the ``[accuracy]`` install (deepeval ships the
27 canonical prompt files).
metadata:
default_grader: exact_match
default_n_shots: 3
is_implemented: false
default_enable_cot: true

aime24:
class: aiperf.accuracy.benchmarks.aime24:AIME24Benchmark
Expand Down
Loading
Loading