diff --git a/docs/accuracy/accuracy-benchmarking.md b/docs/accuracy/accuracy-benchmarking.md
index d64ad7fab..c58b7ac96 100644
--- a/docs/accuracy/accuracy-benchmarking.md
+++ b/docs/accuracy/accuracy-benchmarking.md
@@ -73,6 +73,7 @@ system message).
| `mmlu` | `multiple_choice` | 5 | `lighteval/mmlu` (57 subjects) |
| `aime` | `math` | 8 | `Maxwell-Jia/AIME_2024` (trt-llm reference, 8-shot CoT) |
| `hellaswag` | `exact_match` | 10 | `Rowan/hellaswag` (trt-llm/DeepEval reference; one few-shot per unique activity_label) |
+| `bigbench` | `exact_match` | 3 | `lukaemon/bbh` (trt-llm/DeepEval reference; 27 subtasks, canonical CoT/non-CoT prompt files) |
## CLI Flags
diff --git a/docs/accuracy/accuracy_stubs.md b/docs/accuracy/accuracy_stubs.md
index ea85ea2e7..ad8e8c71c 100644
--- a/docs/accuracy/accuracy_stubs.md
+++ b/docs/accuracy/accuracy_stubs.md
@@ -7,7 +7,7 @@
This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected.
-**Status summary:** With the HellaSwag loader landing on top of AIP-874, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, and `HellaSwagBenchmark` are fully implemented; the remaining benchmarks (`bigbench`, `aime24`, `aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
+**Status summary:** With the BigBench-Hard loader landing on top of the HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, and `BigBenchBenchmark` are fully implemented; the remaining benchmarks (`aime24`, `aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
## Table of Contents
@@ -172,17 +172,17 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
| 1 | `MMLUBenchmark` | `benchmarks/mmlu.py` | `mmlu` | `multiple_choice` | 5 | **IMPLEMENTED in PR #815** — canonical reference for new benchmarks. Downloads via HuggingFace datasets, handles few-shot formatting and CoT. |
| 2 | `AIMEBenchmark` | `benchmarks/aime.py` | `aime` | `math` | 8 | **IMPLEMENTED.** Loads `Maxwell-Jia/AIME_2024`, instructs the model to wrap its final integer in `\boxed{}`, supports few-shot priming and chain-of-thought. `default_enable_cot=true`. |
| 3 | `HellaSwagBenchmark` | `benchmarks/hellaswag.py` | `hellaswag` | `exact_match` | 10 | **IMPLEMENTED.** Loads `Rowan/hellaswag` (validation split filtered per task by `activity_label`; train split feeds the "one few-shot per unique activity_label" rule). Prompt rendering delegates to `deepeval.benchmarks.HellaSwag`'s `HellaSwagTemplate.generate_output`, so output is byte-equal to the trt-llm recipe's DeepEval-backed path. Pairs with `exact_match` for strict `Scorer.exact_match_score` semantics. Requires the `[accuracy]` extras (deepeval). |
+| 4 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 | **IMPLEMENTED.** Loads `lukaemon/bbh` (27 BBH subtasks). Prompt rendering delegates to `deepeval.benchmarks.BigBenchHard`'s `BigBenchHardTemplate.generate_output`, which reads the 27 canonical CoT/shot prompt files DeepEval ships as package data. Pairs with `exact_match` for the recipe's strict `Scorer.exact_match_score` semantics. `default_n_shots=3`, `default_enable_cot=true`. Requires the `[accuracy]` extras (deepeval). |
### Still Stubbed
| # | Class | File | Plugin Key | Default Grader | Default N-Shots |
|---|-------|------|------------|----------------|-----------------|
-| 1 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 |
-| 2 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 |
-| 3 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
-| 4 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
-| 5 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
-| 6 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
+| 1 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 |
+| 2 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
+| 3 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
+| 4 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
+| 5 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
**Each benchmark has 1 method to implement:**
@@ -308,14 +308,14 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu
| Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods |
|-----------|-------------|---------------|------------------|-------------------|
-| Graders | 1 (`MultipleChoiceGrader`) | 3 | 2 (`grade`, `extract_answer`) | 6 |
-| Benchmarks | 1 (`MMLUBenchmark`) | 8 | 1 (`load_problems`) | 8 |
+| Graders | 7 (all) | 0 | — | 0 |
+| Benchmarks | 4 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) | 5 | 1 (`load_problems`) | 5 |
| Record Processor | 1 (`AccuracyRecordProcessor`) | 0 | — | 0 |
| Results Processor | 1 (`AccuracyResultsProcessor`) | 0 | — | 0 |
| Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 | — | 0 |
| Data Exporter | 1 (`AccuracyDataExporter`) | 0 | — | 0 |
| Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 |
-| **Total** | **6** | **13** | | **15** |
+| **Total** | **15** | **6** | | **6** |
### Self-Disabling Pattern
@@ -323,11 +323,12 @@ Processors and exporters raise their `Disabled` exception **in `__init__`** when
### Suggested Implementation Order
-The processors, exporters, and one grader/benchmark pair are already wired end-to-end. Start from the already-working pipeline:
+The processors, exporters, all graders, and four benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) are already wired end-to-end. The remaining work is the five stub benchmarks; mirror the existing loader whose grader matches:
-1. **Graders** — use `MultipleChoiceGrader` as reference; implement `ExactMatchGrader` next (simplest), then `MathGrader`
-2. **Benchmarks** — use `MMLUBenchmark` as reference; implement dataset loading for each remaining benchmark
-3. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` when a benchmark or grader moves from stubbed to supported
+1. **`aime24`, `aime25`, `math_500`** — mirror `AIMEBenchmark` (`benchmarks/aime.py`); pair with the `math` grader.
+2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `multiple_choice` grader.
+3. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
+4. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.
### Key Files for Reference
diff --git a/pyproject.toml b/pyproject.toml
index b4fb39ed4..643211651 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -219,6 +219,7 @@ markers = [
"server_unit: marks tests as unit tests for the mock server",
"fern: marks tests that validate Fern documentation (requires fern CLI)",
"network: marks tests that require network access",
+ "requires_deepeval: tests that need the real deepeval install (i.e. the [accuracy] extras) — skipped when only the fake-deepeval harness is registered",
]
# Better console output
console_output_style = "progress"
diff --git a/src/aiperf/accuracy/benchmarks/bigbench.py b/src/aiperf/accuracy/benchmarks/bigbench.py
index cf838f35f..07ede45ad 100644
--- a/src/aiperf/accuracy/benchmarks/bigbench.py
+++ b/src/aiperf/accuracy/benchmarks/bigbench.py
@@ -1,31 +1,228 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
+"""BigBench-Hard benchmark loader, aligned with the trt-llm DeepEval reference.
+
+The trt-llm benchmark recipe routes ``bigbench`` through DeepEval's
+``deepeval.benchmarks.BigBenchHard`` class
+(``trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:338-356``). This
+loader produces prompts byte-equal to what DeepEval's
+``BigBenchHardTemplate.generate_output`` produces, by importing and
+calling that template directly. The 27 canonical CoT and non-CoT
+prompt files (one per BBH subtask) ship inside DeepEval as package
+data — DeepEval's template reads them via ``importlib.resources`` at
+load time.
+
+Pair with ``ExactMatchGrader`` for strict ``pred.strip() ==
+gold.strip()`` semantics matching DeepEval's
+``Scorer.exact_match_score``.
+
+Reference:
+ deepeval/benchmarks/big_bench_hard/big_bench_hard.py
+ deepeval/benchmarks/big_bench_hard/template.py
+ deepeval/benchmarks/big_bench_hard/cot_prompts/*.txt (27 files)
+ deepeval/benchmarks/big_bench_hard/shot_prompts/*.txt (27 files)
+ trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:338
+"""
+
from __future__ import annotations
-from typing import TYPE_CHECKING
+import asyncio
+from typing import TYPE_CHECKING, Any
-from aiperf.accuracy.models import BenchmarkProblem
+from datasets import Dataset, load_dataset
+
+from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
from aiperf.common.mixins import AIPerfLoggerMixin
if TYPE_CHECKING:
from aiperf.config.resolution.plan import BenchmarkRun
+try:
+ from deepeval.benchmarks.big_bench_hard.big_bench_hard import (
+ bbh_confinement_statements_dict,
+ )
+ from deepeval.benchmarks.big_bench_hard.task import BigBenchHardTask
+ from deepeval.benchmarks.big_bench_hard.template import (
+ BigBenchHardTemplate,
+ )
+
+ _HAS_DEEPEVAL = True
+except ImportError: # pragma: no cover - exercised only without optional dep
+ _HAS_DEEPEVAL = False
+ BigBenchHardTask = None # type: ignore[assignment]
+ BigBenchHardTemplate = None # type: ignore[assignment]
+ bbh_confinement_statements_dict = None # type: ignore[assignment]
+
+
+_MISSING_DEEPEVAL_HINT = (
+ "deepeval is not installed; BigBench-Hard's prompt templates and "
+ "the per-task confinement dict (the trt-llm reference) cannot be "
+ "loaded. Install with: uv pip install 'aiperf[accuracy]'."
+)
+
+DATASET_NAME = "lukaemon/bbh"
+TASK_NAME = "bigbench"
+
+# DeepEval's BigBenchHard caps n_shots at 3 (the canonical CoT files
+# only contain 3 worked examples each). We mirror both bounds.
+DEFAULT_N_SHOTS = 3
+MAX_N_SHOTS = 3
+
+# DeepEval's BigBenchHard default is ``enable_cot=True``.
+DEFAULT_ENABLE_COT = True
+
+# CoT solutions can run several hundred tokens; non-CoT answers are
+# typically a single bare token. 1024 covers both with headroom.
+DEFAULT_GENERATION_SIZE = 1024
+
+# Schema field names in lukaemon/bbh.
+INPUT_FIELD = "input"
+TARGET_FIELD = "target"
+
+
+def _resolve_tasks(tasks: list[str] | None) -> list[Any]:
+ """Convert ``--accuracy-tasks`` strings to ``BigBenchHardTask`` enums.
+
+ DeepEval evaluates one task at a time. Aiperf accepts either:
+ - ``None`` / empty / ``["all"]`` (case-insensitive) → every
+ BigBenchHardTask enum (27 subtasks).
+ - Lower-snake-case strings matching the enum's ``value``
+ (e.g. ``"boolean_expressions"``).
+ - Upper-snake-case enum names (e.g. ``"BOOLEAN_EXPRESSIONS"``)
+ for parity with the recipe's ``getattr(BigBenchHardTask,
+ task_name.upper(), None)`` lookup.
+
+ Mixing ``"all"`` with other task names is rejected so typos like
+ ``["all", "NOT_A_TASK"]`` don't silently bypass validation — that
+ used to slip through and return every task while swallowing the
+ invalid entry (the parallel HellaSwag bug fixed in AIP-877).
+
+ Unknown names raise ``ValueError`` with the full valid list so
+ typos fail loudly.
+ """
+ if not tasks:
+ return list(BigBenchHardTask)
+ lowered = [t.lower() for t in tasks]
+ if "all" in lowered:
+ if lowered == ["all"]:
+ return list(BigBenchHardTask)
+ raise ValueError(
+ "'all' cannot be mixed with other task names. Pass 'all' "
+ "by itself (or omit --accuracy-tasks) to select every BBH "
+ f"subtask, or list specific subtasks. Got: {tasks!r}"
+ )
+ valid_values = {t.value for t in BigBenchHardTask}
+ resolved: list[Any] = []
+ unknown: list[str] = []
+ for name in tasks:
+ if name in valid_values:
+ resolved.append(next(t for t in BigBenchHardTask if t.value == name))
+ continue
+ enum_member = getattr(BigBenchHardTask, name.upper(), None)
+ if enum_member is not None:
+ resolved.append(enum_member)
+ else:
+ unknown.append(name)
+ if unknown:
+ raise ValueError(
+ f"Unknown BBH subtask(s): {unknown}. Valid subtasks: {sorted(valid_values)}"
+ )
+ return resolved
+
class BigBenchBenchmark(AIPerfLoggerMixin):
- """Registered placeholder for a future BigBench loader.
+ """BigBench-Hard benchmark loader, byte-equal to DeepEval's prompts.
- `load_problems()` intentionally raises NotImplementedError in this release;
- use the MMLU benchmark when a working accuracy loader is required.
+ Iterates the requested BBH subtasks and renders each problem's
+ prompt via ``BigBenchHardTemplate.generate_output`` (which reads
+ DeepEval's bundled CoT/shot prompt files). Pair with
+ ``ExactMatchGrader`` for the recipe's strict equality scoring.
"""
- def __init__(self, run: BenchmarkRun, **kwargs) -> None:
+ def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
super().__init__(**kwargs)
+ if not _HAS_DEEPEVAL:
+ raise RuntimeError(_MISSING_DEEPEVAL_HINT)
self.run = run
async def load_problems(
self, tasks: list[str] | None, n_shots: int, enable_cot: bool
) -> list[BenchmarkProblem]:
- raise NotImplementedError(
- "bigbench benchmark is not yet implemented; only 'mmlu' is available in this release."
- )
+ """Load BBH problems and format them DeepEval-style.
+
+ Args:
+ tasks: Subtask names (lower-snake-case enum values like
+ ``boolean_expressions`` or upper-snake-case enum names
+ like ``BOOLEAN_EXPRESSIONS``). ``None`` / ``["all"]``
+ selects every subtask. Unknown names raise.
+ n_shots: 0..3 (DeepEval asserts ``n_shots <= 3`` because
+ the canonical prompt files ship exactly 3 examples).
+ enable_cot: When True (the DeepEval default), use the
+ bundled CoT prompt files; when False, use the non-CoT
+ ``shot_prompts/`` files.
+
+ Returns:
+ One ``BenchmarkProblem`` per row across all selected
+ subtasks. ``task`` is the subtask name so results
+ aggregate per-subtask.
+ """
+ if n_shots > MAX_N_SHOTS:
+ raise ValueError(
+ f"BBH supports at most {MAX_N_SHOTS} few-shot examples "
+ f"(got {n_shots}); DeepEval asserts ``n_shots <= 3`` "
+ f"because the canonical prompt files ship exactly "
+ f"{MAX_N_SHOTS} worked examples per subtask."
+ )
+ task_enums = _resolve_tasks(tasks)
+ problems: list[BenchmarkProblem] = []
+ for task in task_enums:
+ ds: Dataset = await asyncio.to_thread(
+ load_dataset, DATASET_NAME, task.value
+ )
+ sub_problems = await asyncio.to_thread(
+ self._build_subtask_problems,
+ ds["test"],
+ task,
+ n_shots,
+ enable_cot,
+ )
+ problems.extend(sub_problems)
+ return problems
+
+ def _build_subtask_problems(
+ self,
+ ds: Any,
+ task: Any,
+ n_shots: int,
+ enable_cot: bool,
+ ) -> list[BenchmarkProblem]:
+ problems: list[BenchmarkProblem] = []
+ for row in ds:
+ template_prompt = BigBenchHardTemplate.generate_output(
+ input=row[INPUT_FIELD],
+ task=task,
+ n_shots=n_shots,
+ enable_cot=enable_cot,
+ )
+ prompt = f"{template_prompt}{bbh_confinement_statements_dict[task]}"
+ messages: list[AccuracyChatMessage] = [{"role": "user", "content": prompt}]
+ problems.append(
+ BenchmarkProblem(
+ prompt=prompt,
+ # ``BenchmarkProblem.ground_truth`` is typed ``str`` in
+ # strict mode; the upstream BBH schema stores targets
+ # as strings today, but coerce defensively so a future
+ # numeric column doesn't break the loader. Mirrors
+ # DeepEval's ``str(expected_output)`` in its grader.
+ ground_truth=str(row[TARGET_FIELD]),
+ task=task.value,
+ metadata={
+ "bbh_task": task.value,
+ "confinement": bbh_confinement_statements_dict.get(task, ""),
+ "generation_size": DEFAULT_GENERATION_SIZE,
+ },
+ raw_messages=messages,
+ )
+ )
+ return problems
diff --git a/src/aiperf/plugin/plugins.yaml b/src/aiperf/plugin/plugins.yaml
index c2c35f3f3..b11c97a2d 100644
--- a/src/aiperf/plugin/plugins.yaml
+++ b/src/aiperf/plugin/plugins.yaml
@@ -1240,12 +1240,17 @@ accuracy_benchmark:
bigbench:
class: aiperf.accuracy.benchmarks.bigbench:BigBenchBenchmark
description: |
- BigBench benchmark for diverse language understanding tasks spanning
- linguistics, reasoning, and world knowledge.
+ BigBench-Hard benchmark, aligned with the trt-llm benchmark recipe's
+ DeepEval-backed configuration. Prompts are byte-equal to
+ ``deepeval.benchmarks.BigBenchHard`` (n_shots=3, enable_cot=True,
+ using DeepEval's bundled per-subtask CoT prompt files). Pairs with
+ ``exact_match`` for the recipe's strict ``Scorer.exact_match_score``
+ semantics. Requires the ``[accuracy]`` install (deepeval ships the
+ 27 canonical prompt files).
metadata:
default_grader: exact_match
default_n_shots: 3
- is_implemented: false
+ default_enable_cot: true
aime24:
class: aiperf.accuracy.benchmarks.aime24:AIME24Benchmark
diff --git a/tests/harness/fake_deepeval.py b/tests/harness/fake_deepeval.py
new file mode 100644
index 000000000..830f2e60d
--- /dev/null
+++ b/tests/harness/fake_deepeval.py
@@ -0,0 +1,141 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""Minimal stand-in for the ``deepeval.benchmarks.big_bench_hard`` subtree.
+
+The ``[accuracy]`` extras (deepeval, lighteval, torch, transformers, ...) add
+roughly 1 GiB to the install footprint. Installing them in the default CI
+matrix would dominate setup time, so the unit-test job runs without them and
+every test that touches a real-deepeval-only contract is opt-in.
+
+This module re-creates just enough of deepeval's surface for the BigBench
+loader tests to exercise their logic against a synthetic but deterministic
+prompt template. Tests that pin byte-equality against deepeval's bundled
+CoT/shot ``.txt`` files still need the real install and are marked
+``@pytest.mark.requires_deepeval``.
+
+Wiring: ``tests/unit/accuracy/conftest.py`` patches the bigbench loader's
+module-level deepeval names with these fakes per-test (autouse, function
+scope) when the real deepeval isn't importable. We deliberately do *not*
+inject into ``sys.modules`` so adjacent tests like HellaSwag's continue
+to use their own ``pytest.importorskip("deepeval")`` skip mechanism
+without interference.
+
+Three names need to be present:
+
+- ``BigBenchHardTask`` — enum of 27 BBH subtasks. Mirrors the real values
+ one-for-one so resolver tests using ``BOOLEAN_EXPRESSIONS`` /
+ ``boolean_expressions`` continue to work.
+- ``BigBenchHardTemplate.generate_output(input, task, n_shots, enable_cot)``
+ — returns a synthetic prompt. The structure is deliberately
+ *not* byte-equal to the real upstream output; tests that need that
+ contract are marked ``requires_deepeval``. The format does honour
+ ``n_shots`` (longer prompt with more shots) and ``enable_cot`` (CoT
+ prompts contain "Let's think step by step.") so
+ ``test_more_shots_make_longer_prompt`` and similar loader-behavior tests
+ pass against the fake.
+- ``bbh_confinement_statements_dict`` — task→confinement-string mapping,
+ mirrored from upstream so the per-task confinement assertions stay
+ meaningful without needing the real install.
+"""
+
+from __future__ import annotations
+
+import enum
+
+
+class BigBenchHardTask(enum.Enum):
+ """27 BBH subtasks. Values mirror the real deepeval enum exactly."""
+
+ BOOLEAN_EXPRESSIONS = "boolean_expressions"
+ CAUSAL_JUDGEMENT = "causal_judgement"
+ DATE_UNDERSTANDING = "date_understanding"
+ DISAMBIGUATION_QA = "disambiguation_qa"
+ DYCK_LANGUAGES = "dyck_languages"
+ FORMAL_FALLACIES = "formal_fallacies"
+ GEOMETRIC_SHAPES = "geometric_shapes"
+ HYPERBATON = "hyperbaton"
+ LOGICAL_DEDUCTION_FIVE_OBJECTS = "logical_deduction_five_objects"
+ LOGICAL_DEDUCTION_SEVEN_OBJECTS = "logical_deduction_seven_objects"
+ LOGICAL_DEDUCTION_THREE_OBJECTS = "logical_deduction_three_objects"
+ MOVIE_RECOMMENDATION = "movie_recommendation"
+ MULTISTEP_ARITHMETIC_TWO = "multistep_arithmetic_two"
+ NAVIGATE = "navigate"
+ OBJECT_COUNTING = "object_counting"
+ PENGUINS_IN_A_TABLE = "penguins_in_a_table"
+ REASONING_ABOUT_COLORED_OBJECTS = "reasoning_about_colored_objects"
+ RUIN_NAMES = "ruin_names"
+ SALIENT_TRANSLATION_ERROR_DETECTION = "salient_translation_error_detection"
+ SNARKS = "snarks"
+ SPORTS_UNDERSTANDING = "sports_understanding"
+ TEMPORAL_SEQUENCES = "temporal_sequences"
+ TRACKING_SHUFFLED_OBJECTS_FIVE_OBJECTS = "tracking_shuffled_objects_five_objects"
+ TRACKING_SHUFFLED_OBJECTS_SEVEN_OBJECTS = "tracking_shuffled_objects_seven_objects"
+ TRACKING_SHUFFLED_OBJECTS_THREE_OBJECTS = "tracking_shuffled_objects_three_objects"
+ WEB_OF_LIES = "web_of_lies"
+ WORD_SORTING = "word_sorting"
+
+
+# Mirrored verbatim from the real ``bbh_confinement_statements_dict``.
+# Stable upstream data; resync if deepeval ever changes a string.
+bbh_confinement_statements_dict: dict[BigBenchHardTask, str] = {
+ BigBenchHardTask.BOOLEAN_EXPRESSIONS: "\n\nOutput 'True' or 'False'. Full answer not needed.",
+ BigBenchHardTask.CAUSAL_JUDGEMENT: "\n\nOutput 'Yes' or 'No'. Full answer not needed.",
+ BigBenchHardTask.DATE_UNDERSTANDING: "\n\nOutput '(A)', '(B)', '(C)', '(D)', '(E)', or '(F)'. Full answer not needed.",
+ BigBenchHardTask.DISAMBIGUATION_QA: "\n\nOutput '(A)', '(B)', or '(C)'. Full answer not needed.",
+ BigBenchHardTask.DYCK_LANGUAGES: "\n\nOutput only the sequence of parentheses characters separated by white space. Full answer not needed.",
+ BigBenchHardTask.FORMAL_FALLACIES: "\n\nOutput 'invalid' or 'valid'. Full answer not needed.",
+ BigBenchHardTask.GEOMETRIC_SHAPES: "\n\nOutput '(A)', '(B)', '(C)', '(D)', '(E)', '(F)', '(G)', '(H)', '(I)', '(J)', or '(K)'. Full answer not needed.",
+ BigBenchHardTask.HYPERBATON: "\n\nOutput '(A)' or'(B)'. Full answer not needed.",
+ BigBenchHardTask.LOGICAL_DEDUCTION_FIVE_OBJECTS: "\n\nOutput '(A)', '(B)', '(C)', '(D)', or '(E)'. Full answer not needed.",
+ BigBenchHardTask.LOGICAL_DEDUCTION_SEVEN_OBJECTS: "\n\nOutput '(A)', '(B)', '(C)', '(D)', '(E)', '(F)', or '(G)'. Full answer not needed.",
+ BigBenchHardTask.LOGICAL_DEDUCTION_THREE_OBJECTS: "\n\nOutput '(A)', '(B)', or '(C)'. Full answer not needed.",
+ BigBenchHardTask.MOVIE_RECOMMENDATION: "\n\nOutput '(A)', '(B)', '(C)', '(D)', or '(E)'. Full answer not needed.",
+ BigBenchHardTask.MULTISTEP_ARITHMETIC_TWO: "\n\nOutput the numerical answer. Full answer not needed.",
+ BigBenchHardTask.NAVIGATE: "\n\nOutput 'Yes' or 'No'. Full answer not needed.",
+ BigBenchHardTask.OBJECT_COUNTING: "\n\nOutput the numerical answer. Full answer not needed.",
+ BigBenchHardTask.PENGUINS_IN_A_TABLE: "\n\nOutput '(A)', '(B)', '(C)', '(D)', or '(E)'. Full answer not needed.",
+ BigBenchHardTask.REASONING_ABOUT_COLORED_OBJECTS: "\n\nOutput '(A)', '(B)', '(C)', '(D)', '(E)', '(F)', '(G)', '(H)', '(I)', '(J)', '(K)', '(L)', '(M)', '(N)', '(O)', '(P)', '(Q)', or '(R)'. Full answer not needed.",
+ BigBenchHardTask.RUIN_NAMES: "\n\nOutput '(A)', '(B)', '(C)', or '(D)'. Full answer not needed.",
+ BigBenchHardTask.SALIENT_TRANSLATION_ERROR_DETECTION: "\n\nOutput '(A)', '(B)', '(C)', '(D)', '(E)', or '(F)'. Full answer not needed.",
+ BigBenchHardTask.SNARKS: "\n\nOutput '(A)' or'(B)'. Full answer not needed.",
+ BigBenchHardTask.SPORTS_UNDERSTANDING: "\n\nOutput 'yes' or 'no'. Full answer not needed.",
+ BigBenchHardTask.TEMPORAL_SEQUENCES: "\n\nOutput '(A)', '(B)', '(C)', or '(D)'. Full answer not needed.",
+ BigBenchHardTask.TRACKING_SHUFFLED_OBJECTS_FIVE_OBJECTS: "\n\nOutput '(A)', '(B)', '(C)', '(D)', or '(E)'. Full answer not needed.",
+ BigBenchHardTask.TRACKING_SHUFFLED_OBJECTS_SEVEN_OBJECTS: "\n\nOutput '(A)', '(B)', '(C)', '(D)', '(E)', '(F)', or '(G)'. Full answer not needed.",
+ BigBenchHardTask.TRACKING_SHUFFLED_OBJECTS_THREE_OBJECTS: "\n\nOutput '(A)', '(B)', or '(C)'. Full answer not needed.",
+ BigBenchHardTask.WEB_OF_LIES: "\n\nOutput 'Yes' or 'No'. Full answer not needed.",
+ BigBenchHardTask.WORD_SORTING: "\n\nOutput only the sequence of words separated by white space. Full answer not needed.",
+}
+
+
+class BigBenchHardTemplate:
+ """Synthetic stand-in for ``deepeval``'s prompt template.
+
+ Output structure is deliberately *not* byte-equal to upstream. Tests
+ that need byte-equality are tagged ``requires_deepeval`` and skip
+ without the real install. The fake honours these contracts so
+ loader-behavior tests still pass:
+
+ - More ``n_shots`` produces a strictly longer prompt.
+ - ``enable_cot=True`` produces a prompt containing ``"step by step"``;
+ ``enable_cot=False`` does not.
+ - The trailing ``"Q: {input}\\nA: "`` matches the real template
+ well enough for the "query is at the end" assertion shape.
+ """
+
+ @classmethod
+ def generate_output(
+ cls,
+ input: str, # noqa: A002 - mirrors upstream kw name
+ task: BigBenchHardTask,
+ n_shots: int,
+ enable_cot: bool,
+ ) -> str:
+ header = f"Task description: [fake] subtask={task.value}."
+ shot_marker = (
+ "\n[fake CoT shot] Let's think step by step.\n"
+ if enable_cot
+ else "\n[fake shot]\n"
+ )
+ shots = shot_marker * n_shots
+ return f"{header}{shots}\n\nQ: {input}\nA: "
diff --git a/tests/unit/accuracy/conftest.py b/tests/unit/accuracy/conftest.py
new file mode 100644
index 000000000..2a7c39481
--- /dev/null
+++ b/tests/unit/accuracy/conftest.py
@@ -0,0 +1,92 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""Accuracy-scoped fixtures.
+
+Carries the fake-deepeval wiring used by the BigBench loader tests so they
+can run in CI without the ``[accuracy]`` extras (which add ~1 GiB and are
+not installed in the default unit-test job).
+
+Two pieces:
+
+- ``_patch_bigbench_deepeval_names`` is an autouse fixture that swaps the
+ bigbench loader's deepeval-imported module attributes for the fake
+ stand-ins. Active only when the real deepeval isn't importable, so the
+ real install wins locally / in any job that opts into ``[accuracy]``.
+ Scoped per-test (function-scope ``monkeypatch``) so it doesn't leak
+ into adjacent tests like HellaSwag, which still use the existing
+ ``pytest.importorskip("deepeval")`` skip mechanism.
+- ``pytest_collection_modifyitems`` skips tests tagged
+ ``@pytest.mark.requires_deepeval`` when only the fake is available —
+ used for byte-equal-prompt assertions that depend on deepeval's
+ bundled ``.txt`` prompt files which the fake doesn't reproduce.
+"""
+
+from __future__ import annotations
+
+import pytest
+
+from tests.harness import fake_deepeval
+
+
+def _real_deepeval_available() -> bool:
+ """True iff the real deepeval (with bundled CoT/shot prompt files) is
+ importable. The fake harness does not satisfy this — it lives under
+ ``tests.harness``."""
+ try:
+ import deepeval.benchmarks.big_bench_hard.template as _t # noqa: F401
+
+ return True
+ except ImportError:
+ return False
+
+
+def pytest_collection_modifyitems(config, items):
+ """Skip ``@pytest.mark.requires_deepeval`` items when the real
+ deepeval install isn't available."""
+ if _real_deepeval_available():
+ return
+ skip_mark = pytest.mark.skip(
+ reason="requires the real deepeval install ([accuracy] extras); "
+ "the fake-deepeval harness cannot reproduce upstream prompt bytes."
+ )
+ for item in items:
+ if "requires_deepeval" in item.keywords:
+ item.add_marker(skip_mark)
+
+
+@pytest.fixture(autouse=True)
+def _patch_bigbench_deepeval_names(request, monkeypatch):
+ """Swap ``bigbench.py``'s deepeval-imported names for the fake when
+ the real install isn't present.
+
+ ``bigbench.py``'s top-level ``try / except ImportError`` already
+ binds the four affected names (``_HAS_DEEPEVAL``, ``BigBenchHardTask``,
+ ``BigBenchHardTemplate``, ``bbh_confinement_statements_dict``) to
+ ``False`` / ``None`` when deepeval is missing. This fixture patches
+ them per-test to the harness fakes so loader tests can run.
+
+ Skipped (no patching) when the real deepeval is importable so the
+ real upstream behavior is exercised end-to-end in ``[accuracy]``
+ environments.
+ """
+ if _real_deepeval_available():
+ return
+ try:
+ import aiperf.accuracy.benchmarks.bigbench as bigbench_mod
+ except ImportError:
+ # bigbench.py couldn't load at all — nothing to patch. Tests
+ # that need it will fail loudly on import, which is what we
+ # want.
+ return
+ monkeypatch.setattr(bigbench_mod, "_HAS_DEEPEVAL", True)
+ monkeypatch.setattr(
+ bigbench_mod, "BigBenchHardTask", fake_deepeval.BigBenchHardTask
+ )
+ monkeypatch.setattr(
+ bigbench_mod, "BigBenchHardTemplate", fake_deepeval.BigBenchHardTemplate
+ )
+ monkeypatch.setattr(
+ bigbench_mod,
+ "bbh_confinement_statements_dict",
+ fake_deepeval.bbh_confinement_statements_dict,
+ )
diff --git a/tests/unit/accuracy/test_accuracy_config.py b/tests/unit/accuracy/test_accuracy_config.py
index 509565d26..5e78125ce 100644
--- a/tests/unit/accuracy/test_accuracy_config.py
+++ b/tests/unit/accuracy/test_accuracy_config.py
@@ -23,7 +23,6 @@
# This branch (AIP-874) implements ``aime``, ``math``, and ``code_execution``,
# so those names are absent from the stub lists.
STUB_BENCHMARKS = (
- "bigbench",
"aime24",
"aime25",
"math_500",
@@ -87,8 +86,8 @@ def test_accuracyconfig_with_uppercase_stub_name_raises_validationerror(
) -> None:
"""Case-insensitive enum lookup must not bypass the validator."""
with pytest.raises(ValidationError) as exc:
- AccuracyConfig(benchmark="BIGBENCH")
- assert "bigbench" in str(exc.value)
+ AccuracyConfig(benchmark="LCB_CODEGENERATION")
+ assert "lcb_codegeneration" in str(exc.value)
class TestRejectsStubGrader:
diff --git a/tests/unit/accuracy/test_bigbench_benchmark.py b/tests/unit/accuracy/test_bigbench_benchmark.py
new file mode 100644
index 000000000..c45a16b8a
--- /dev/null
+++ b/tests/unit/accuracy/test_bigbench_benchmark.py
@@ -0,0 +1,626 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""Unit tests for ``BigBenchBenchmark`` after DeepEval alignment.
+
+Pins:
+1. Prompt is byte-equal to ``deepeval.benchmarks.BigBenchHard``'s
+ ``BigBenchHardTemplate.generate_output`` output (which itself
+ reads the canonical CoT/non-CoT prompt files DeepEval ships).
+2. ``ground_truth`` is the bare ``target`` string from
+ ``lukaemon/bbh`` (DeepEval's convention for exact_match_score).
+3. ``confinement`` carried in metadata maps per-task to the right
+ "Output 'X' or 'Y'..." string.
+4. Per-task task field so the accuracy CSV breaks down per BBH
+ subtask.
+
+Most tests in this file run against ``tests.harness.fake_deepeval`` — a
+small stand-in that mirrors the 27-task enum and confinement dict
+exactly but generates a synthetic (non-byte-equal) prompt template.
+Tests that pin the real upstream prompt bytes are marked
+``@pytest.mark.requires_deepeval`` and skip when only the fake is
+registered. The fake is wired in ``tests/unit/accuracy/conftest.py``
+(autouse, function scope) so the ``aiperf[accuracy]`` extras are no
+longer a hard prerequisite for running this file.
+"""
+
+from __future__ import annotations
+
+from collections.abc import Callable
+from typing import TYPE_CHECKING, Any
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from aiperf.accuracy.benchmarks.bigbench import (
+ DEFAULT_ENABLE_COT,
+ DEFAULT_GENERATION_SIZE,
+ DEFAULT_N_SHOTS,
+ MAX_N_SHOTS,
+ BigBenchBenchmark,
+ _resolve_tasks,
+)
+from aiperf.plugin.enums import AccuracyBenchmarkType, EndpointType
+from tests.unit.conftest import make_benchmark_run
+
+if TYPE_CHECKING:
+ from aiperf.config import BenchmarkRun
+
+
+def _make_run() -> BenchmarkRun:
+ return make_benchmark_run(
+ model_names=["test-model"],
+ endpoint_type=EndpointType.COMPLETIONS,
+ streaming=False,
+ accuracy={"benchmark": AccuracyBenchmarkType.BIGBENCH},
+ )
+
+
+def _make_row(input_text: str = "What is 2+2?", target: str = "4") -> dict[str, Any]:
+ return {"input": input_text, "target": target}
+
+
+def _make_fake_dataset(rows: list[dict[str, Any]]) -> dict[str, Any]:
+ """Mock ``load_dataset`` return value (a dict-like with split keys)."""
+ test_split = MagicMock()
+ test_split.__iter__ = MagicMock(side_effect=lambda: iter(rows))
+ test_split.__len__ = MagicMock(return_value=len(rows))
+ test_split.__getitem__ = MagicMock(side_effect=lambda i: rows[i])
+ return {"test": test_split}
+
+
+def _per_task_loader(
+ per_task: dict[str, list[dict[str, Any]]],
+) -> Callable[..., dict[str, Any]]:
+ """``load_dataset`` patch that dispatches by task name."""
+
+ def loader(
+ _dataset_name: str,
+ task_name: str | None = None,
+ **_kwargs: Any,
+ ) -> dict[str, Any]:
+ return _make_fake_dataset(
+ per_task.get(task_name, []) if task_name is not None else []
+ )
+
+ return loader
+
+
+class TestDefaultsMatchDeepEval:
+ """Defaults mirror ``deepeval.benchmarks.BigBenchHard``."""
+
+ def test_default_n_shots_is_3(self) -> None:
+ assert DEFAULT_N_SHOTS == 3
+
+ def test_max_n_shots_is_3(self) -> None:
+ """DeepEval asserts ``n_shots <= 3`` because the bundled prompt
+ files only contain 3 worked examples."""
+ assert MAX_N_SHOTS == 3
+
+ def test_default_enable_cot_is_true(self) -> None:
+ assert DEFAULT_ENABLE_COT is True
+
+ def test_default_generation_size_is_1024(self) -> None:
+ assert DEFAULT_GENERATION_SIZE == 1024
+
+
+class TestResolveTasks:
+ def test_none_returns_all_27_subtasks(self) -> None:
+ result = _resolve_tasks(None)
+ assert len(result) == 27
+
+ def test_all_returns_all_27_subtasks(self) -> None:
+ result = _resolve_tasks(["all"])
+ assert len(result) == 27
+
+ def test_lower_snake_case_value_resolves(self) -> None:
+ result = _resolve_tasks(["boolean_expressions"])
+ assert len(result) == 1
+ assert result[0].value == "boolean_expressions"
+
+ def test_upper_snake_case_enum_name_resolves(self) -> None:
+ result = _resolve_tasks(["BOOLEAN_EXPRESSIONS"])
+ assert len(result) == 1
+ assert result[0].value == "boolean_expressions"
+
+ def test_unknown_subtask_raises(self) -> None:
+ with pytest.raises(ValueError, match="Unknown BBH subtask"):
+ _resolve_tasks(["not_a_real_task"])
+
+ def test_unknown_subtask_lists_valid(self) -> None:
+ with pytest.raises(ValueError) as exc_info:
+ _resolve_tasks(["not_a_real_task"])
+ # All 27 should appear in the error.
+ assert "boolean_expressions" in str(exc_info.value)
+ assert "navigate" in str(exc_info.value)
+ assert "object_counting" in str(exc_info.value)
+
+
+@pytest.mark.requires_deepeval
+class TestPromptByteEqualWithDeepEval:
+ """The flat prompt must be byte-equal to what
+ ``BigBenchHardTemplate.generate_output`` produces — same template,
+ same CoT files, same n_shots, same enable_cot.
+
+ These assertions read specific strings out of DeepEval's bundled CoT
+ and shot prompt ``.txt`` files (e.g. ``"Task description: Evaluate
+ the result of a random Boolean expression."``). The fake harness
+ cannot reproduce those bytes, so the class is tagged
+ ``requires_deepeval``; the marker skips it when only the fake is
+ registered (i.e. when the ``[accuracy]`` extras aren't installed).
+ """
+
+ @pytest.mark.asyncio
+ async def test_cot_prompt_starts_with_task_description(self) -> None:
+ per_task = {"boolean_expressions": [_make_row("True and False is", "False")]}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["boolean_expressions"],
+ n_shots=3,
+ enable_cot=True,
+ )
+ prompt = problems[0].prompt
+ # DeepEval's template prepends "Task description: " then the
+ # canonical first paragraph. For boolean_expressions that
+ # paragraph is "Evaluate the result of a random Boolean expression."
+ assert prompt.startswith(
+ "Task description: Evaluate the result of a random Boolean expression."
+ )
+
+ @pytest.mark.asyncio
+ async def test_query_appended_before_confinement(self) -> None:
+ per_task = {"boolean_expressions": [_make_row("True and False is", "False")]}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["boolean_expressions"],
+ n_shots=3,
+ enable_cot=True,
+ )
+ prompt = problems[0].prompt
+ # DeepEval's template appends "\n\nQ: \nA: " at the end of
+ # its output. The loader then appends the per-task confinement
+ # statement so the LLM sees the constraint as part of the prompt
+ # (matches the trt-llm benchmark recipe's flow). For
+ # boolean_expressions that confinement starts with "\n\nOutput
+ # 'True' or 'False'." so the Q/A pair sits immediately before it.
+ assert "Q: True and False is\nA: \n\nOutput 'True' or 'False'." in prompt
+ assert prompt.endswith("Full answer not needed.")
+
+ @pytest.mark.asyncio
+ async def test_cot_vs_no_cot_use_different_prompt_files(self) -> None:
+ per_task = {"navigate": [_make_row("Walk forward 5 steps.", "No")]}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ cot = await bench.load_problems(
+ tasks=["navigate"], n_shots=3, enable_cot=True
+ )
+ no_cot = await bench.load_problems(
+ tasks=["navigate"], n_shots=3, enable_cot=False
+ )
+ # CoT version has "Let's think step by step." worked examples;
+ # non-CoT has bare Q/A pairs.
+ assert "step by step" in cot[0].prompt.lower() or "Let's" in cot[0].prompt
+ assert cot[0].prompt != no_cot[0].prompt
+
+ @pytest.mark.asyncio
+ async def test_zero_shot_takes_only_task_description(self) -> None:
+ """``n_shots=0`` should emit just ``"Task description: "`` followed by the test query — no worked examples."""
+ per_task = {"boolean_expressions": [_make_row("True and True is", "True")]}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["boolean_expressions"],
+ n_shots=0,
+ enable_cot=True,
+ )
+ prompt = problems[0].prompt
+ # Only the task description and the query, no worked examples
+ # (the CoT files use "Let's think step by step." in shot
+ # examples; with n_shots=0 that phrase shouldn't appear).
+ assert "Q: True and True is\nA: " in prompt
+ # The 0-shot vs 3-shot length comparison lives in
+ # ``TestNShotsAffectsPromptLength`` below.
+
+
+class TestNShotsAffectsPromptLength:
+ @pytest.mark.asyncio
+ async def test_more_shots_make_longer_prompt(self) -> None:
+ per_task = {"boolean_expressions": [_make_row("True is", "True")]}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ zero = await bench.load_problems(
+ tasks=["boolean_expressions"], n_shots=0, enable_cot=True
+ )
+ three = await bench.load_problems(
+ tasks=["boolean_expressions"], n_shots=3, enable_cot=True
+ )
+ assert len(three[0].prompt) > len(zero[0].prompt)
+
+
+class TestNShotsCap:
+ @pytest.mark.asyncio
+ async def test_n_shots_above_3_raises(self) -> None:
+ bench = BigBenchBenchmark(run=_make_run())
+ with pytest.raises(ValueError, match="at most 3"):
+ await bench.load_problems(tasks=None, n_shots=4, enable_cot=True)
+
+
+class TestGroundTruthIsBareTarget:
+ @pytest.mark.asyncio
+ async def test_ground_truth_is_target_string(self) -> None:
+ per_task = {
+ "navigate": [
+ _make_row("Walk left, then right.", "No"),
+ _make_row("Walk forward 5 steps.", "Yes"),
+ ]
+ }
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["navigate"], n_shots=3, enable_cot=True
+ )
+ assert [p.ground_truth for p in problems] == ["No", "Yes"]
+
+
+class TestConfinementInMetadata:
+ """The per-task confinement string is carried in metadata so callers
+ that need DeepEval's structured-fallback shape (or want to log it)
+ can read it."""
+
+ @pytest.mark.asyncio
+ async def test_boolean_expressions_confinement(self) -> None:
+ per_task = {"boolean_expressions": [_make_row("Q?", "True")]}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["boolean_expressions"], n_shots=3, enable_cot=True
+ )
+ assert "True" in problems[0].metadata["confinement"]
+ assert "False" in problems[0].metadata["confinement"]
+
+ @pytest.mark.asyncio
+ async def test_navigate_confinement(self) -> None:
+ per_task = {"navigate": [_make_row("Q?", "Yes")]}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["navigate"], n_shots=3, enable_cot=True
+ )
+ assert "Yes" in problems[0].metadata["confinement"]
+ assert "No" in problems[0].metadata["confinement"]
+
+
+class TestPerTaskAggregation:
+ @pytest.mark.asyncio
+ async def test_task_field_is_subtask_name(self) -> None:
+ per_task = {
+ "navigate": [_make_row("Q1", "Yes")],
+ "object_counting": [_make_row("Q2", "5")],
+ }
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["navigate", "object_counting"],
+ n_shots=3,
+ enable_cot=True,
+ )
+ tasks = {p.task for p in problems}
+ assert tasks == {"navigate", "object_counting"}
+
+
+class TestPathologicalDatasetRows:
+ @pytest.mark.asyncio
+ async def test_empty_subtask_returns_empty(self) -> None:
+ per_task = {"navigate": []}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["navigate"], n_shots=3, enable_cot=True
+ )
+ assert problems == []
+
+ @pytest.mark.asyncio
+ async def test_unicode_in_target_preserved(self) -> None:
+ per_task = {"navigate": [_make_row("Q?", "café")]}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["navigate"], n_shots=3, enable_cot=True
+ )
+ assert problems[0].ground_truth == "café"
+
+ @pytest.mark.asyncio
+ async def test_chat_message_is_single_user(self) -> None:
+ per_task = {"navigate": [_make_row("Q?", "Yes")]}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["navigate"], n_shots=3, enable_cot=True
+ )
+ msgs = problems[0].raw_messages
+ assert msgs is not None
+ assert len(msgs) == 1
+ assert msgs[0]["role"] == "user"
+
+
+class TestResolveTasksAdversarial:
+ """Edge cases on ``--accuracy-tasks`` parsing not covered by
+ ``TestResolveTasks``."""
+
+ def test_empty_list_returns_all_27_subtasks(self) -> None:
+ """A bare ``--accuracy-tasks`` with no values reaches the resolver
+ as ``[]`` (falsy) — equivalent to ``None`` / ``["all"]``."""
+ assert len(_resolve_tasks([])) == 27
+
+ def test_mixed_case_all_returns_all_27_subtasks(self) -> None:
+ """``"All"`` / ``"ALL"`` must match case-insensitively. The
+ docstring promises this; pin it so a future case-sensitive
+ refactor breaks the test loudly."""
+ assert len(_resolve_tasks(["All"])) == 27
+ assert len(_resolve_tasks(["ALL"])) == 27
+
+ def test_all_mixed_with_typo_raises(self) -> None:
+ """``["all", "NOT_A_REAL_TASK"]`` used to silently return every
+ subtask and swallow the typo (the parallel HellaSwag bug AIP-877
+ fixed). Must now raise so a user typo fails loudly instead of
+ running the whole 27-task benchmark."""
+ with pytest.raises(ValueError, match="'all' cannot be mixed"):
+ _resolve_tasks(["all", "not_a_real_task"])
+
+ def test_all_mixed_with_valid_name_also_raises(self) -> None:
+ """Even when both names would individually be accepted, mixing
+ ``"all"`` with anything else is ambiguous and must fail."""
+ with pytest.raises(ValueError, match="'all' cannot be mixed"):
+ _resolve_tasks(["all", "navigate"])
+
+ def test_whitespace_in_task_name_raises(self) -> None:
+ """A whitespace-bearing name is not silently trimmed — pin the
+ loud-failure mode so accidental YAML spacing is caught."""
+ with pytest.raises(ValueError, match="Unknown BBH subtask"):
+ _resolve_tasks([" boolean_expressions "])
+
+ def test_hyphenated_task_name_raises(self) -> None:
+ """Hyphens aren't normalized. ``"boolean-expressions"`` upper-
+ cases to ``"BOOLEAN-EXPRESSIONS"`` which is not a valid enum
+ attribute, so the resolver raises."""
+ with pytest.raises(ValueError, match="Unknown BBH subtask"):
+ _resolve_tasks(["boolean-expressions"])
+
+ def test_mixed_valid_and_invalid_lists_only_invalid(self) -> None:
+ """When some names resolve and others don't, the unknown-list
+ portion of the error must contain only the unknown name — no
+ false positive on the valid one."""
+ with pytest.raises(ValueError) as exc_info:
+ _resolve_tasks(["navigate", "not_a_real"])
+ msg = str(exc_info.value)
+ assert "not_a_real" in msg
+ # The error also lists the full valid set after "Valid subtasks:"
+ # for guidance, so narrow the check to the unknown-list portion.
+ unknown_portion = msg.split("Valid subtasks:")[0]
+ assert "'navigate'" not in unknown_portion
+
+ def test_duplicate_task_names_resolve_to_duplicate_enums(self) -> None:
+ """The resolver does not deduplicate. Passing the same task
+ twice yields two entries and will trigger ``load_dataset``
+ twice for the same subtask — pin the behavior so callers know
+ the cost."""
+ result = _resolve_tasks(["navigate", "navigate"])
+ assert len(result) == 2
+ assert result[0] is result[1]
+
+
+class TestConstructorWithoutDeepEval:
+ """The constructor refuses to build when the ``[accuracy]`` extras
+ aren't available — otherwise downstream ``BigBenchHardTemplate``
+ calls would crash with an unhelpful ``NameError``."""
+
+ def test_missing_deepeval_raises_with_install_hint(self) -> None:
+ with (
+ patch("aiperf.accuracy.benchmarks.bigbench._HAS_DEEPEVAL", False),
+ pytest.raises(RuntimeError, match=r"aiperf\[accuracy\]"),
+ ):
+ BigBenchBenchmark(run=_make_run())
+
+
+class TestOutputInvariants:
+ """Per-problem fields that should always agree do agree."""
+
+ @pytest.mark.asyncio
+ async def test_prompt_equals_first_chat_message_content(self) -> None:
+ """``prompt`` (the flat completions string) and the lone chat
+ message's ``content`` must be byte-equal — drift here would
+ mean completions vs chat endpoints render different prompts
+ for the same problem."""
+ per_task = {"navigate": [_make_row("Q?", "Yes")]}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["navigate"], n_shots=3, enable_cot=True
+ )
+ msgs = problems[0].raw_messages
+ assert msgs is not None
+ assert problems[0].prompt == msgs[0]["content"]
+
+ @pytest.mark.asyncio
+ async def test_metadata_bbh_task_matches_task_field(self) -> None:
+ """``problem.task`` and ``problem.metadata['bbh_task']`` must
+ match for every problem — the accuracy CSV reads the former,
+ downstream tooling the latter, and both refer to the same
+ subtask."""
+ per_task = {
+ "navigate": [_make_row("Q1", "Yes")],
+ "object_counting": [_make_row("Q2", "5")],
+ }
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["navigate", "object_counting"],
+ n_shots=3,
+ enable_cot=True,
+ )
+ for p in problems:
+ assert p.task == p.metadata["bbh_task"]
+
+ @pytest.mark.asyncio
+ async def test_generation_size_is_plumbed_through_metadata(self) -> None:
+ """``DEFAULT_GENERATION_SIZE=1024`` is carried in per-problem
+ metadata so request-level overrides can read it without
+ round-tripping the module constant."""
+ per_task = {"navigate": [_make_row("Q?", "Yes")]}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["navigate"], n_shots=3, enable_cot=True
+ )
+ assert problems[0].metadata["generation_size"] == DEFAULT_GENERATION_SIZE
+
+ @pytest.mark.asyncio
+ async def test_multitask_order_preserves_task_input_order(self) -> None:
+ """When ``tasks=[A, B]``, every problem for A precedes every
+ problem for B in the output list. The accuracy CSV's per-task
+ grouping depends on this contiguity."""
+ per_task = {
+ "navigate": [_make_row("nav-q", "Yes")],
+ "object_counting": [
+ _make_row("oc-q1", "1"),
+ _make_row("oc-q2", "2"),
+ ],
+ }
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["navigate", "object_counting"],
+ n_shots=3,
+ enable_cot=True,
+ )
+ assert [p.task for p in problems] == [
+ "navigate",
+ "object_counting",
+ "object_counting",
+ ]
+
+
+class TestLoadDatasetInvocation:
+ """Pin the ``load_dataset(DATASET_NAME, task.value)`` call shape so
+ a rename of ``DATASET_NAME`` or an accidental kwarg/positional
+ reorder is caught."""
+
+ @pytest.mark.asyncio
+ async def test_load_dataset_called_once_per_task_with_canonical_args(
+ self,
+ ) -> None:
+ per_task = {
+ "navigate": [_make_row("Q1", "Yes")],
+ "object_counting": [_make_row("Q2", "1")],
+ }
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ) as mock_load:
+ bench = BigBenchBenchmark(run=_make_run())
+ await bench.load_problems(
+ tasks=["navigate", "object_counting"],
+ n_shots=3,
+ enable_cot=True,
+ )
+ # Two requested tasks → exactly two load_dataset calls, each
+ # with the canonical dataset name positional and the subtask
+ # value positional. Asserting via call_args_list catches both a
+ # rename of DATASET_NAME and a future swap to kwargs.
+ assert [c.args for c in mock_load.call_args_list] == [
+ ("lukaemon/bbh", "navigate"),
+ ("lukaemon/bbh", "object_counting"),
+ ]
+
+
+class TestPathologicalRowContent:
+ """Hostile row content the upstream dataset could theoretically
+ ship."""
+
+ @pytest.mark.asyncio
+ async def test_empty_input_string_still_renders_prompt(self) -> None:
+ """A blank ``input`` is unusual but shouldn't crash — DeepEval's
+ template just appends it verbatim. Pin the passthrough so we
+ notice if DeepEval ever rejects empty inputs."""
+ per_task = {"navigate": [_make_row(input_text="", target="Yes")]}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["navigate"], n_shots=3, enable_cot=True
+ )
+ # Prompt still rendered — the task description and shot
+ # examples are always present even when the input is empty.
+ assert len(problems) == 1
+ assert len(problems[0].prompt) > 0
+
+ @pytest.mark.asyncio
+ async def test_numeric_target_coerced_to_string(self) -> None:
+ """A numeric ``target`` (e.g. an int from a future BBH schema
+ change) is coerced to ``str`` by the loader before constructing
+ the ``BenchmarkProblem``. ``BenchmarkProblem.ground_truth`` is
+ in strict mode, so the loader's defensive ``str(...)`` is the
+ contract callers rely on for string equality in graders."""
+ per_task = {"object_counting": [{"input": "Count items", "target": 42}]}
+ with patch(
+ "aiperf.accuracy.benchmarks.bigbench.load_dataset",
+ side_effect=_per_task_loader(per_task),
+ ):
+ bench = BigBenchBenchmark(run=_make_run())
+ problems = await bench.load_problems(
+ tasks=["object_counting"], n_shots=3, enable_cot=True
+ )
+ assert problems[0].ground_truth == "42"