ai-dynamo · ajcasagrande · May 23, 2026 · May 22, 2026 · May 22, 2026 · May 22, 2026
diff --git a/docs/accuracy/accuracy-benchmarking.md b/docs/accuracy/accuracy-benchmarking.md
@@ -73,6 +73,7 @@ system message).
 | `mmlu` | `multiple_choice` | 5 | `lighteval/mmlu` (57 subjects) |
 | `aime` | `math` | 8 | `Maxwell-Jia/AIME_2024` (trt-llm reference, 8-shot CoT) |
 | `hellaswag` | `exact_match` | 10 | `Rowan/hellaswag` (trt-llm/DeepEval reference; one few-shot per unique activity_label) |
+| `bigbench` | `exact_match` | 3 | `lukaemon/bbh` (trt-llm/DeepEval reference; 27 subtasks, canonical CoT/non-CoT prompt files) |
 
 ## CLI Flags
 

diff --git a/docs/accuracy/accuracy_stubs.md b/docs/accuracy/accuracy_stubs.md
@@ -7,7 +7,7 @@
 
 This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected.
 
-**Status summary:** With the HellaSwag loader landing on top of AIP-874, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, and `HellaSwagBenchmark` are fully implemented; the remaining benchmarks (`bigbench`, `aime24`, `aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
+**Status summary:** With the BigBench-Hard loader landing on top of the HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, and `BigBenchBenchmark` are fully implemented; the remaining benchmarks (`aime24`, `aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
 
 ## Table of Contents
 
@@ -172,17 +172,17 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
 | 1 | `MMLUBenchmark` | `benchmarks/mmlu.py` | `mmlu` | `multiple_choice` | 5 | **IMPLEMENTED in PR #815** — canonical reference for new benchmarks. Downloads via HuggingFace datasets, handles few-shot formatting and CoT. |
 | 2 | `AIMEBenchmark` | `benchmarks/aime.py` | `aime` | `math` | 8 | **IMPLEMENTED.** Loads `Maxwell-Jia/AIME_2024`, instructs the model to wrap its final integer in `\boxed{}`, supports few-shot priming and chain-of-thought. `default_enable_cot=true`. |
 | 3 | `HellaSwagBenchmark` | `benchmarks/hellaswag.py` | `hellaswag` | `exact_match` | 10 | **IMPLEMENTED.** Loads `Rowan/hellaswag` (validation split filtered per task by `activity_label`; train split feeds the "one few-shot per unique activity_label" rule). Prompt rendering delegates to `deepeval.benchmarks.HellaSwag`'s `HellaSwagTemplate.generate_output`, so output is byte-equal to the trt-llm recipe's DeepEval-backed path. Pairs with `exact_match` for strict `Scorer.exact_match_score` semantics. Requires the `[accuracy]` extras (deepeval). |
+| 4 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 | **IMPLEMENTED.** Loads `lukaemon/bbh` (27 BBH subtasks). Prompt rendering delegates to `deepeval.benchmarks.BigBenchHard`'s `BigBenchHardTemplate.generate_output`, which reads the 27 canonical CoT/shot prompt files DeepEval ships as package data. Pairs with `exact_match` for the recipe's strict `Scorer.exact_match_score` semantics. `default_n_shots=3`, `default_enable_cot=true`. Requires the `[accuracy]` extras (deepeval). |
 
 ### Still Stubbed
 
 | # | Class | File | Plugin Key | Default Grader | Default N-Shots |
 |---|-------|------|------------|----------------|-----------------|
-| 1 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 |
-| 2 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 |
-| 3 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
-| 4 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
-| 5 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
-| 6 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
+| 1 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 |
+| 2 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
+| 3 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
+| 4 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
+| 5 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
 
 **Each benchmark has 1 method to implement:**
 
@@ -308,26 +308,27 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu
 
 | Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods |
 |-----------|-------------|---------------|------------------|-------------------|
-| Graders | 1 (`MultipleChoiceGrader`) | 3 | 2 (`grade`, `extract_answer`) | 6 |
-| Benchmarks | 1 (`MMLUBenchmark`) | 8 | 1 (`load_problems`) | 8 |
+| Graders | 7 (all) | 0 | — | 0 |
+| Benchmarks | 4 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) | 5 | 1 (`load_problems`) | 5 |
 | Record Processor | 1 (`AccuracyRecordProcessor`) | 0 | — | 0 |
 | Results Processor | 1 (`AccuracyResultsProcessor`) | 0 | — | 0 |
 | Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 | — | 0 |
 | Data Exporter | 1 (`AccuracyDataExporter`) | 0 | — | 0 |
 | Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 |
-| **Total** | **6** | **13** | | **15** |
+| **Total** | **15** | **6** | | **6** |
 
 ### Self-Disabling Pattern
 
 Processors and exporters raise their `Disabled` exception **in `__init__`** when accuracy is off. The existing framework catches these and silently skips the plugin. No code changes needed to support this — it uses the same pattern as `RawRecordWriterProcessor` and `ServerMetricsCsvExporter`.
 
 ### Suggested Implementation Order
 
-The processors, exporters, and one grader/benchmark pair are already wired end-to-end. Start from the already-working pipeline:
+The processors, exporters, all graders, and four benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) are already wired end-to-end. The remaining work is the five stub benchmarks; mirror the existing loader whose grader matches:
 
-1. **Graders** — use `MultipleChoiceGrader` as reference; implement `ExactMatchGrader` next (simplest), then `MathGrader`
-2. **Benchmarks** — use `MMLUBenchmark` as reference; implement dataset loading for each remaining benchmark
-3. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` when a benchmark or grader moves from stubbed to supported
+1. **`aime24`, `aime25`, `math_500`** — mirror `AIMEBenchmark` (`benchmarks/aime.py`); pair with the `math` grader.
+2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `multiple_choice` grader.
+3. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
+4. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.
 
 ### Key Files for Reference
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -219,6 +219,7 @@ markers = [
     "server_unit: marks tests as unit tests for the mock server",
     "fern: marks tests that validate Fern documentation (requires fern CLI)",
     "network: marks tests that require network access",
+    "requires_deepeval: tests that need the real deepeval install (i.e. the [accuracy] extras) — skipped when only the fake-deepeval harness is registered",
 ]
 # Better console output
 console_output_style = "progress"

diff --git a/src/aiperf/accuracy/benchmarks/bigbench.py b/src/aiperf/accuracy/benchmarks/bigbench.py
@@ -1,31 +1,228 @@
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 
+"""BigBench-Hard benchmark loader, aligned with the trt-llm DeepEval reference.
+
+The trt-llm benchmark recipe routes ``bigbench`` through DeepEval's
+``deepeval.benchmarks.BigBenchHard`` class
+(``trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:338-356``). This
+loader produces prompts byte-equal to what DeepEval's
+``BigBenchHardTemplate.generate_output`` produces, by importing and
+calling that template directly. The 27 canonical CoT and non-CoT
+prompt files (one per BBH subtask) ship inside DeepEval as package
+data — DeepEval's template reads them via ``importlib.resources`` at
+load time.
+
+Pair with ``ExactMatchGrader`` for strict ``pred.strip() ==
+gold.strip()`` semantics matching DeepEval's
+``Scorer.exact_match_score``.
+
+Reference:
+    deepeval/benchmarks/big_bench_hard/big_bench_hard.py
+    deepeval/benchmarks/big_bench_hard/template.py
+    deepeval/benchmarks/big_bench_hard/cot_prompts/*.txt (27 files)
+    deepeval/benchmarks/big_bench_hard/shot_prompts/*.txt (27 files)
+    trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:338
+"""
+
 from __future__ import annotations
 
-from typing import TYPE_CHECKING
+import asyncio
+from typing import TYPE_CHECKING, Any
 
-from aiperf.accuracy.models import BenchmarkProblem
+from datasets import Dataset, load_dataset
+
+from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
 from aiperf.common.mixins import AIPerfLoggerMixin
 
 if TYPE_CHECKING:
     from aiperf.config.resolution.plan import BenchmarkRun
 
+try:
+    from deepeval.benchmarks.big_bench_hard.big_bench_hard import (
+        bbh_confinement_statements_dict,
+    )
+    from deepeval.benchmarks.big_bench_hard.task import BigBenchHardTask
+    from deepeval.benchmarks.big_bench_hard.template import (
+        BigBenchHardTemplate,
+    )
+
+    _HAS_DEEPEVAL = True
+except ImportError:  # pragma: no cover - exercised only without optional dep
+    _HAS_DEEPEVAL = False
+    BigBenchHardTask = None  # type: ignore[assignment]
+    BigBenchHardTemplate = None  # type: ignore[assignment]
+    bbh_confinement_statements_dict = None  # type: ignore[assignment]
+
+
+_MISSING_DEEPEVAL_HINT = (
+    "deepeval is not installed; BigBench-Hard's prompt templates and "
+    "the per-task confinement dict (the trt-llm reference) cannot be "
+    "loaded. Install with: uv pip install 'aiperf[accuracy]'."
+)
+
+DATASET_NAME = "lukaemon/bbh"
+TASK_NAME = "bigbench"
+
+# DeepEval's BigBenchHard caps n_shots at 3 (the canonical CoT files
+# only contain 3 worked examples each). We mirror both bounds.
+DEFAULT_N_SHOTS = 3
+MAX_N_SHOTS = 3
+
+# DeepEval's BigBenchHard default is ``enable_cot=True``.
+DEFAULT_ENABLE_COT = True
+
+# CoT solutions can run several hundred tokens; non-CoT answers are
+# typically a single bare token. 1024 covers both with headroom.
+DEFAULT_GENERATION_SIZE = 1024
+
+# Schema field names in lukaemon/bbh.
+INPUT_FIELD = "input"
+TARGET_FIELD = "target"
+
+
+def _resolve_tasks(tasks: list[str] | None) -> list[Any]:
+    """Convert ``--accuracy-tasks`` strings to ``BigBenchHardTask`` enums.
+
+    DeepEval evaluates one task at a time. Aiperf accepts either:
+      - ``None`` / empty / ``["all"]`` (case-insensitive) → every
+        BigBenchHardTask enum (27 subtasks).
+      - Lower-snake-case strings matching the enum's ``value``
+        (e.g. ``"boolean_expressions"``).
+      - Upper-snake-case enum names (e.g. ``"BOOLEAN_EXPRESSIONS"``)
+        for parity with the recipe's ``getattr(BigBenchHardTask,
+        task_name.upper(), None)`` lookup.
+
+    Mixing ``"all"`` with other task names is rejected so typos like
+    ``["all", "NOT_A_TASK"]`` don't silently bypass validation — that
+    used to slip through and return every task while swallowing the
+    invalid entry (the parallel HellaSwag bug fixed in AIP-877).
+
+    Unknown names raise ``ValueError`` with the full valid list so
+    typos fail loudly.
+    """
+    if not tasks:
+        return list(BigBenchHardTask)
+    lowered = [t.lower() for t in tasks]
+    if "all" in lowered:
+        if lowered == ["all"]:
+            return list(BigBenchHardTask)
+        raise ValueError(
+            "'all' cannot be mixed with other task names. Pass 'all' "
+            "by itself (or omit --accuracy-tasks) to select every BBH "
+            f"subtask, or list specific subtasks. Got: {tasks!r}"
+        )
+    valid_values = {t.value for t in BigBenchHardTask}
+    resolved: list[Any] = []
+    unknown: list[str] = []
+    for name in tasks:
+        if name in valid_values:
+            resolved.append(next(t for t in BigBenchHardTask if t.value == name))
+            continue
+        enum_member = getattr(BigBenchHardTask, name.upper(), None)
+        if enum_member is not None:
+            resolved.append(enum_member)
+        else:
+            unknown.append(name)
+    if unknown:
+        raise ValueError(
+            f"Unknown BBH subtask(s): {unknown}. Valid subtasks: {sorted(valid_values)}"
+        )
+    return resolved
+
 
 class BigBenchBenchmark(AIPerfLoggerMixin):
-    """Registered placeholder for a future BigBench loader.
+    """BigBench-Hard benchmark loader, byte-equal to DeepEval's prompts.
 
-    `load_problems()` intentionally raises NotImplementedError in this release;
-    use the MMLU benchmark when a working accuracy loader is required.
+    Iterates the requested BBH subtasks and renders each problem's
+    prompt via ``BigBenchHardTemplate.generate_output`` (which reads
+    DeepEval's bundled CoT/shot prompt files). Pair with
+    ``ExactMatchGrader`` for the recipe's strict equality scoring.
     """
 
-    def __init__(self, run: BenchmarkRun, **kwargs) -> None:
+    def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
         super().__init__(**kwargs)
+        if not _HAS_DEEPEVAL:
+            raise RuntimeError(_MISSING_DEEPEVAL_HINT)
         self.run = run
 
     async def load_problems(
         self, tasks: list[str] | None, n_shots: int, enable_cot: bool
     ) -> list[BenchmarkProblem]:
-        raise NotImplementedError(
-            "bigbench benchmark is not yet implemented; only 'mmlu' is available in this release."
-        )
+        """Load BBH problems and format them DeepEval-style.
+
+        Args:
+            tasks: Subtask names (lower-snake-case enum values like
+                ``boolean_expressions`` or upper-snake-case enum names
+                like ``BOOLEAN_EXPRESSIONS``). ``None`` / ``["all"]``
+                selects every subtask. Unknown names raise.
+            n_shots: 0..3 (DeepEval asserts ``n_shots <= 3`` because
+                the canonical prompt files ship exactly 3 examples).
+            enable_cot: When True (the DeepEval default), use the
+                bundled CoT prompt files; when False, use the non-CoT
+                ``shot_prompts/`` files.
+
+        Returns:
+            One ``BenchmarkProblem`` per row across all selected
+            subtasks. ``task`` is the subtask name so results
+            aggregate per-subtask.
+        """
+        if n_shots > MAX_N_SHOTS:
+            raise ValueError(
+                f"BBH supports at most {MAX_N_SHOTS} few-shot examples "
+                f"(got {n_shots}); DeepEval asserts ``n_shots <= 3`` "
+                f"because the canonical prompt files ship exactly "
+                f"{MAX_N_SHOTS} worked examples per subtask."
+            )
+        task_enums = _resolve_tasks(tasks)
+        problems: list[BenchmarkProblem] = []
+        for task in task_enums:
+            ds: Dataset = await asyncio.to_thread(
+                load_dataset, DATASET_NAME, task.value
+            )
+            sub_problems = await asyncio.to_thread(
+                self._build_subtask_problems,
+                ds["test"],
+                task,
+                n_shots,
+                enable_cot,
+            )
+            problems.extend(sub_problems)
+        return problems
+
+    def _build_subtask_problems(
+        self,
+        ds: Any,
+        task: Any,
+        n_shots: int,
+        enable_cot: bool,
+    ) -> list[BenchmarkProblem]:
+        problems: list[BenchmarkProblem] = []
+        for row in ds:
+            template_prompt = BigBenchHardTemplate.generate_output(
+                input=row[INPUT_FIELD],
+                task=task,
+                n_shots=n_shots,
+                enable_cot=enable_cot,
+            )
+            prompt = f"{template_prompt}{bbh_confinement_statements_dict[task]}"
+            messages: list[AccuracyChatMessage] = [{"role": "user", "content": prompt}]
+            problems.append(
+                BenchmarkProblem(
+                    prompt=prompt,
+                    # ``BenchmarkProblem.ground_truth`` is typed ``str`` in
+                    # strict mode; the upstream BBH schema stores targets
+                    # as strings today, but coerce defensively so a future
+                    # numeric column doesn't break the loader. Mirrors
+                    # DeepEval's ``str(expected_output)`` in its grader.
+                    ground_truth=str(row[TARGET_FIELD]),
+                    task=task.value,
+                    metadata={
+                        "bbh_task": task.value,
+                        "confinement": bbh_confinement_statements_dict.get(task, ""),
+                        "generation_size": DEFAULT_GENERATION_SIZE,
+                    },
+                    raw_messages=messages,
+                )
+            )
+        return problems
diff --git a/src/aiperf/plugin/plugins.yaml b/src/aiperf/plugin/plugins.yaml
@@ -1240,12 +1240,17 @@ accuracy_benchmark:
   bigbench:
     class: aiperf.accuracy.benchmarks.bigbench:BigBenchBenchmark
     description: |
-      BigBench benchmark for diverse language understanding tasks spanning
-      linguistics, reasoning, and world knowledge.
+      BigBench-Hard benchmark, aligned with the trt-llm benchmark recipe's
+      DeepEval-backed configuration. Prompts are byte-equal to
+      ``deepeval.benchmarks.BigBenchHard`` (n_shots=3, enable_cot=True,
+      using DeepEval's bundled per-subtask CoT prompt files). Pairs with
+      ``exact_match`` for the recipe's strict ``Scorer.exact_match_score``
+      semantics. Requires the ``[accuracy]`` install (deepeval ships the
+      27 canonical prompt files).
     metadata:
       default_grader: exact_match
       default_n_shots: 3
-      is_implemented: false
+      default_enable_cot: true
 
   aime24:
     class: aiperf.accuracy.benchmarks.aime24:AIME24Benchmark