Skip to content

feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878)#924

Merged
ajcasagrande merged 10 commits into
mainfrom
dbermudez/aip-878-implement-bigbench-benchmark-loader
May 23, 2026
Merged

feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878)#924
ajcasagrande merged 10 commits into
mainfrom
dbermudez/aip-878-implement-bigbench-benchmark-loader

Conversation

@debermudez
Copy link
Copy Markdown
Contributor

@debermudez debermudez commented May 12, 2026

  • Plugin: registers bigbench benchmark in plugins.yaml (default_grader: exact_match, default_n_shots: 3); scaffold loader raises NotImplementedError until the full DeepEval-backed implementation lands.
  • BigBench covers diverse language understanding tasks spanning linguistics, common-sense reasoning, and world knowledge; 3-shot evaluation mirrors the trt-llm-benchmark-recipe default.
  • Grading uses ExactMatchGrader introduced in AIP-877 (same DeepEval sub-stack).
  • Depends on AIP-877 for the exact_match grader plugin registration.

Reference: trt-llm-benchmark-recipe DeepEval-backed bigbench evaluation

Summary by CodeRabbit

  • New Features

    • BigBench‑Hard benchmark is implemented and available for accuracy testing (defaults to CoT enabled; shots capped at 3; grader: exact_match; source: lukaemon/bbh).
  • Documentation

    • Accuracy docs updated to list BigBench‑Hard as implemented and document its defaults and source.
  • Tests

    • Comprehensive tests added, including a marker for real DeepEval, a deterministic fake harness for CI, and fixtures to conditionally use the real or fake templates.

Review Change Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@91fba387e2eafa1d3d64b02e97e6741af5995321

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@91fba387e2eafa1d3d64b02e97e6741af5995321

Last updated for commit: 91fba38Browse code

@github-actions github-actions Bot added the feat label May 12, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

@debermudez
Copy link
Copy Markdown
Contributor Author

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

  1. AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
  2. AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
  3. AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
  4. AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
  5. AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
  6. AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
  7. AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
  8. AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 3 of 8 — base branch is dbermudez/aip-877-implement-hellaswag-benchmark-loader,
depends on #923 (AIP-877) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

@debermudez debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from 88bafae to 3c150cd Compare May 12, 2026 23:24
@debermudez debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from b5f8fb3 to 362fa36 Compare May 12, 2026 23:24
@debermudez debermudez marked this pull request as draft May 12, 2026 23:27
@debermudez debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from b93469a to 45a2dff Compare May 13, 2026 00:33
@debermudez debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from fed3c38 to 8f988d9 Compare May 13, 2026 00:36
@debermudez debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from 439b1ad to 923b9c9 Compare May 13, 2026 21:17
@debermudez debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from 8f988d9 to 1601f56 Compare May 13, 2026 21:22
@debermudez debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from 923b9c9 to 24e1ef3 Compare May 13, 2026 22:04
@debermudez debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from 1601f56 to 03d59ff Compare May 13, 2026 22:04
@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

❌ Patch coverage is 95.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aiperf/accuracy/benchmarks/bigbench.py 95.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@debermudez debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from 24e1ef3 to dcb42d0 Compare May 13, 2026 23:32
@debermudez debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from 03d59ff to 718f430 Compare May 13, 2026 23:32
@debermudez debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch 4 times, most recently from c71dbac to f10c9a7 Compare May 19, 2026 17:32
@debermudez debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from f3097f6 to 98a2c25 Compare May 22, 2026 03:50
Base automatically changed from dbermudez/aip-877-implement-hellaswag-benchmark-loader to main May 22, 2026 16:52
@debermudez debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from 718f430 to e15f931 Compare May 22, 2026 20:08
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Walkthrough

Implements BigBench‑Hard loader: resolves subtasks, loads lukaemon/bbh per task, generates DeepEval‑aligned prompts (n_shots ≤ 3), builds BenchmarkProblem objects with metadata, updates plugin/docs, adds a fake deepeval test harness and pytest marker, and includes extensive unit tests covering invariants and edge cases.

Changes

BigBench-Hard benchmark implementation and test coverage

Layer / File(s) Summary
Constants and task resolution
src/aiperf/accuracy/benchmarks/bigbench.py
Module constants, optional DeepEval import gating with installation hint, and _resolve_tasks() to normalize/validate subtasks (supports None/["all"], forbids mixing "all", reports valid names).
BigBenchBenchmark core implementation
src/aiperf/accuracy/benchmarks/bigbench.py
__init__ enforces DeepEval presence; load_problems() validates n_shots ≤ 3, resolves tasks, loads lukaemon/bbh per task, and aggregates BenchmarkProblems; _build_subtask_problems() generates DeepEval-byte-equal prompts, wraps as chat messages, coerces ground_truth, and attaches bbh_task, confinement, and generation_size metadata.
Configuration and documentation updates
src/aiperf/plugin/plugins.yaml, docs/accuracy/accuracy-benchmarking.md, docs/accuracy/accuracy_stubs.md
Plugin YAML updated for BigBench‑Hard (default_enable_cot: true, removed is_implemented:false); accuracy docs mark BigBenchBenchmark implemented and add bigbench to the benchmarks table with source/prompt-alignment note.
Test harness and pytest marker
tests/harness/fake_deepeval.py, pyproject.toml
Adds a fake deepeval.benchmarks.big_bench_hard surface (enum, confinement dict, generate_output) and a requires_deepeval pytest marker for byte-pinning tests.
Conftest monkeypatch and skip logic
tests/unit/accuracy/conftest.py
Autouse fixture and collection hook detect real deepeval availability; skip requires_deepeval when missing and monkeypatch bigbench deepeval names to the fake harness for tests.
Test config updates
tests/unit/accuracy/test_accuracy_config.py
Removes "bigbench" from STUB_BENCHMARKS and updates the uppercase-stub regression test to use "LCB_CODEGENERATION".
BigBenchBenchmark unit tests
tests/unit/accuracy/test_bigbench_benchmark.py
Adds utilities and comprehensive tests: defaults, task resolution (including adversarial inputs), prompt byte/pattern invariants (CoT vs non‑CoT, zero‑shot), n_shots effects and cap enforcement, per‑problem fields and metadata, load_dataset invocation checks, aggregation ordering, and hostile-row/pathology cases.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through BBH with a twitching nose,
Prompts byte-pinned where DeepEval flows,
Shots and CoT stitched in tidy line,
Tests guard each task so outputs align,
A carrot of code — precise and composed!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.28% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: implementing a BigBench-Hard benchmark backed by DeepEval.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/accuracy/benchmarks/bigbench.py`:
- Line 33: The current import and type annotation are wrong: change the import
from "Dataset" to "DatasetDict" (i.e., from datasets import DatasetDict,
load_dataset) and update the variable/type annotation for the result of
load_dataset (the variable named "ds" or wherever load_dataset(DATASET_NAME,
task.value) is assigned) from Dataset to DatasetDict so that ds["test"] access
is correctly typed; keep all existing uses (e.g., ds["test"]) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3c9742e7-fe8e-41c5-9c1e-a5afce10ac6c

📥 Commits

Reviewing files that changed from the base of the PR and between bd5467f and 7d1aaa7.

📒 Files selected for processing (6)
  • docs/accuracy/accuracy-benchmarking.md
  • docs/accuracy/accuracy_stubs.md
  • src/aiperf/accuracy/benchmarks/bigbench.py
  • src/aiperf/plugin/plugins.yaml
  • tests/unit/accuracy/test_accuracy_config.py
  • tests/unit/accuracy/test_bigbench_benchmark.py

Comment thread src/aiperf/accuracy/benchmarks/bigbench.py
debermudez and others added 5 commits May 22, 2026 16:06
Implements the BigBench-Hard accuracy benchmark by delegating prompt
rendering to ``deepeval.benchmarks.BigBenchHard``'s
``BigBenchHardTemplate.generate_output``. Output is byte-equal to the
trt-llm benchmark recipe's DeepEval-backed configuration so reference
parity is preserved end-to-end. Pairs with the existing
``ExactMatchGrader`` (landed via AIP-877) for the recipe's strict
``Scorer.exact_match_score`` semantics.

Loader uses the new ``BenchmarkRun`` constructor signature introduced
by PR #912 (no ``UserConfig``), and the test fixture wires through the
``make_benchmark_run`` conftest helper. ``deepeval`` is already pinned
in the ``[accuracy]`` extras via AIP-877 — the test guards on
``pytest.importorskip("deepeval")`` so the suite still runs without
the optional install.

Drops ``bigbench`` from ``STUB_BENCHMARKS``, removes ``is_implemented:
false`` from the ``plugins.yaml`` entry, and updates the accuracy docs
to reflect the new implemented status. The uppercase-stub validator
test now exercises ``LCB_CODEGENERATION`` since ``BIGBENCH`` is no
longer stubbed.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…878)

Previously ``_resolve_tasks(["all", "NOT_A_REAL_TASK"])`` silently
returned every BigBenchHardTask enum and swallowed the typo — the
``"all" in {t.lower() for t in tasks}`` shortcut fired before any
membership validation. Mirror the HellaSwag fix from AIP-877
(``f3097f66`` upstream) and raise ``ValueError`` unless ``tasks`` is
empty / ``["all"]`` (case-insensitive) alone. A typo paired with
``all`` now fails loudly instead of running the entire benchmark.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
Adds 16 tests across five new classes addressing coverage gaps the
existing suite leaves open. Reviewer flagged that accuracy coverage
has been slipping across the recent stack; this pulls BBH back to
parity with HellaSwag's exhaustive resolver/invariant pinning.

New coverage:

* ``TestResolveTasksAdversarial`` (8 tests) — empty list, mixed-case
  ``All`` / ``ALL``, ``all`` paired with a typo or a valid name (both
  raise after the loader fix), whitespace-bearing and hyphenated task
  names, mixed valid+invalid lists, and duplicate task names (pins
  the no-dedupe behavior so callers know they'll hit ``load_dataset``
  twice).
* ``TestConstructorWithoutDeepEval`` — pins the ``RuntimeError`` path
  for environments without the ``[accuracy]`` extras (previously
  unreachable because the module-level ``importorskip`` runs first).
* ``TestOutputInvariants`` (4 tests) — ``prompt`` matches
  ``raw_messages[0]['content']``; ``problem.task`` matches
  ``metadata['bbh_task']``; ``metadata['generation_size']`` carries
  ``DEFAULT_GENERATION_SIZE``; multi-task output order preserves the
  caller's task order so the accuracy CSV's per-task grouping stays
  contiguous.
* ``TestLoadDatasetInvocation`` — asserts ``load_dataset`` is called
  with ``("lukaemon/bbh", task.value)`` positionally for each task.
  Regression guard against ``DATASET_NAME`` renames or a future swap
  to kwargs.
* ``TestPathologicalRowContent`` (2 tests) — empty ``input`` still
  renders; numeric ``target`` is coerced to ``str`` by Pydantic.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…-878)

CI's `make ci-install` installs only `[dev]` extras, so `deepeval` was
absent and `pytest.importorskip("deepeval")` at the top of
`test_bigbench_benchmark.py` skipped all 39 tests at collection. The
loader was getting only ~30.5% import-only coverage in CodeCov.

Replace the module-level skip with a fake-deepeval harness:

- `tests/harness/fake_deepeval.py`: 27-task enum, confinement dict
  (verbatim from upstream), and a synthetic `BigBenchHardTemplate` that
  honours the loader's contracts (n_shots affects length, CoT branch
  differs from non-CoT) without reproducing upstream prompt bytes.
- `tests/unit/accuracy/conftest.py`: autouse function-scope fixture
  patches the loader module's deepeval names with the fakes when the
  real install is absent. Does NOT inject into `sys.modules`, so the
  HellaSwag tests' own `importorskip` skip mechanism keeps working.
  Also skips items tagged `@pytest.mark.requires_deepeval` at
  collection.
- Tag `TestPromptByteEqualWithDeepEval` with `requires_deepeval` — the
  four byte-equality assertions read specific strings out of DeepEval's
  bundled `.txt` files and can't be reproduced by a fake.
- Register the `requires_deepeval` marker in `pyproject.toml`.

Also fix `test_numeric_target_coerced_to_string` by adding defensive
`str(row[TARGET_FIELD])` in `_build_subtask_problems`. The test was
written against an aspirational Pydantic lax-validation contract that
`BenchmarkProblem.ground_truth` doesn't honour; explicit coercion in
the loader mirrors DeepEval's own `str(expected_output)` pattern.

Result on `src/aiperf/accuracy/benchmarks/bigbench.py`:
- Without [accuracy]: 30.5% -> 97% (3 unreachable real-import lines)
- With [accuracy]: 100% (all 39 tests run including byte-equality)

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
@debermudez debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from 36aa3b4 to 56337d7 Compare May 22, 2026 23:06
@debermudez debermudez marked this pull request as ready for review May 22, 2026 23:06
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/aiperf/accuracy/benchmarks/bigbench.py (1)

33-33: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use DatasetDict typing for load_dataset without split.

Line 180 annotates ds as Dataset, but this call form returns split mappings and is used as ds["test"] on Line 185. The type should be DatasetDict for an accurate API contract.

Proposed fix
-from datasets import Dataset, load_dataset
+from datasets import DatasetDict, load_dataset
@@
-            ds: Dataset = await asyncio.to_thread(
+            ds: DatasetDict = await asyncio.to_thread(
                 load_dataset, DATASET_NAME, task.value
             )
For datasets==3.0.0, what is the return type of load_dataset(path, name) when split is omitted?

Also applies to: 180-186

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/accuracy/benchmarks/bigbench.py` at line 33, The variable
annotated as Dataset (ds) is actually a split mapping returned by load_dataset
when no split is provided; update the import to include DatasetDict and change
the type of ds from Dataset to DatasetDict where it’s assigned/annotated (e.g.,
the ds variable used with ds["test"] in the bigbench benchmark code), ensuring
any type hints or function signatures reflect DatasetDict instead of Dataset.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/accuracy/accuracy_stubs.md`:
- Line 10: Update the “Method Count Summary” so it matches the status summary:
mark MultipleChoiceGrader, MathGrader, CodeExecutionGrader, LightevalExprGrader,
LightevalLatexGrader, LightevalGPQAGrader, ExactMatchGrader as implemented
graders and mark MMLUBenchmark, AIMEBenchmark, HellaSwagBenchmark,
BigBenchBenchmark as implemented benchmarks; adjust the counts and remove those
names from the stub/NotImplementedError list (leave aime24, aime25, math_500,
gpqa_diamond, lcb_codegeneration as stubs). Ensure the Method Count Summary uses
the same wording and numbers as the status summary so both sections are
consistent.

In `@tests/unit/accuracy/conftest.py`:
- Line 43: Add explicit type annotations to the pytest hook and fixture
signatures: annotate the pytest_collection_modifyitems function parameters and
return type (use config: pytest.Config, items: list[pytest.Item] -> None) and
likewise add appropriate parameter/return annotations to the fixture defined at
line 58 (e.g., request: pytest.FixtureRequest or the specific types it uses and
-> Generator/Any/None as appropriate). Update imports to include typing or
pytest symbols if needed so the signatures are fully typed.

In `@tests/unit/accuracy/test_bigbench_benchmark.py`:
- Line 46: _add explicit type annotations: give _make_run a concrete return type
instead of implicit Any, and annotate _per_task_loader with a precise Callable
return type and annotate the nested loader's parameter and return types.
Specifically, update the signatures of _make_run, _per_task_loader, and the
inner loader function (named loader) to use typing annotations (e.g., import
Callable, Iterable/Iterator, Union or the specific Run/Task/Example types used
in the test) so the outer function returns the correct typed callable and the
inner loader has explicit parameter and return types rather than untyped
args/returns.

---

Duplicate comments:
In `@src/aiperf/accuracy/benchmarks/bigbench.py`:
- Line 33: The variable annotated as Dataset (ds) is actually a split mapping
returned by load_dataset when no split is provided; update the import to include
DatasetDict and change the type of ds from Dataset to DatasetDict where it’s
assigned/annotated (e.g., the ds variable used with ds["test"] in the bigbench
benchmark code), ensuring any type hints or function signatures reflect
DatasetDict instead of Dataset.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d5e9fc2f-4084-4734-ba75-efc96b3304e3

📥 Commits

Reviewing files that changed from the base of the PR and between 7d1aaa7 and 56337d7.

📒 Files selected for processing (9)
  • docs/accuracy/accuracy-benchmarking.md
  • docs/accuracy/accuracy_stubs.md
  • pyproject.toml
  • src/aiperf/accuracy/benchmarks/bigbench.py
  • src/aiperf/plugin/plugins.yaml
  • tests/harness/fake_deepeval.py
  • tests/unit/accuracy/conftest.py
  • tests/unit/accuracy/test_accuracy_config.py
  • tests/unit/accuracy/test_bigbench_benchmark.py
✅ Files skipped from review due to trivial changes (1)
  • docs/accuracy/accuracy-benchmarking.md

Comment thread docs/accuracy/accuracy_stubs.md
Comment thread tests/unit/accuracy/conftest.py
Comment thread tests/unit/accuracy/test_bigbench_benchmark.py Outdated
The Method Count Summary table in `docs/accuracy/accuracy_stubs.md` was
still claiming `Graders: 1 implemented / 3 stubbed` and
`Benchmarks: 1 implemented / 8 stubbed` — stale since the AIME +
HellaSwag + BigBench loaders and the lighteval / exact_match / math /
code_execution graders all landed. The status-summary paragraph at the
top of the same file was already up to date, so the two sections
disagreed.

Update the table to match the status summary and the
implemented/stubbed grader/benchmark tables further down: 7 graders
implemented (all done), 4 benchmarks implemented (MMLU/AIME/HellaSwag/
BigBench), 5 benchmarks still stubbed (aime24/aime25/math_500/
gpqa_diamond/lcb_codegeneration), total 15 implemented / 6 stubbed /
6 remaining methods.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
Comment thread src/aiperf/accuracy/benchmarks/bigbench.py Outdated
…stubs (AIP-878)

The "Suggested Implementation Order" still recommended implementing
`ExactMatchGrader` and `MathGrader` next and pointed at
`MultipleChoiceGrader` as the canonical grader reference, but all
seven graders are now done. Drop the grader step entirely and
re-target the benchmark step at the five remaining stubs (`aime24`,
`aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) with each
paired to the closest existing loader and its grader (math via
`AIMEBenchmark`, multiple_choice via `MMLUBenchmark`, code_execution
via `MMLUBenchmark` scaffolding).

Pairings match the Default Grader column in the "Still Stubbed"
benchmarks table earlier in the same file.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
`_make_run`, `_per_task_loader`, and the inner `loader` callback all
had implicit `Any` signatures. Add explicit annotations:

- `_make_run() -> BenchmarkRun` (TYPE_CHECKING import to keep the
  runtime cost zero).
- `_per_task_loader(per_task: dict[...]) -> Callable[..., dict[str, Any]]`.
- The inner `loader(_dataset_name: str, task_name: str | None = None,
  **_kwargs: Any) -> dict[str, Any]`.

Inside the inner body, gate the dict lookup on `task_name is not None`
so mypy doesn't complain about passing `None` to `dict[str, …].get`.
Semantics are unchanged — `per_task.get(None, [])` already returned
`[]` because `None` is never a key.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
`_build_subtask_problems` now appends the per-task confinement string
(e.g. "Output 'True' or 'False'. Full answer not needed.") onto the
``BigBenchHardTemplate.generate_output(...)`` result, so the rendered
prompt the LLM sees ends with the constraint. Matches the trt-llm
benchmark recipe's flow and parallels the equivalent change AIP-877
made for HellaSwag (`fix(accuracy): append DeepEval confinement
instruction to HellaSwag prompt`).

Mechanically:

- Bind the raw template output to ``template_prompt``.
- ``prompt = f"{template_prompt}{bbh_confinement_statements_dict[task]}"``.
- Use the combined ``prompt`` for both the chat message content and
  ``BenchmarkProblem.prompt``.

Direct subscript ``[task]`` rather than ``.get(task, "")`` because the
confinement dict is exhaustive across all 27 BBH subtasks; a missing
key would be an upstream regression we want to surface loudly.
``metadata["confinement"]`` retains the ``.get(task, "")`` form so
callers can still read the confinement without re-parsing the prompt.

Rename ``test_query_appended_at_end`` to
``test_query_appended_before_confinement`` and update its assertion
to pin the new shape: the "Q: <input>\\nA: " segment is immediately
followed by the confinement instruction, and the prompt now ends with
the confinement's terminal phrase ("Full answer not needed.").

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/accuracy/test_bigbench_benchmark.py`:
- Around line 72-84: The loader inner function in _per_task_loader currently
types the variadic kwargs as **_kwargs: Any; remove the explicit Any annotation
so the signature uses an unannotated variadic keyword parameter (**_kwargs)
instead. Update the loader definition in _per_task_loader (function name:
loader, parent: _per_task_loader) to drop the ": Any" on **_kwargs while keeping
the parameter name and behavior unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7650396c-89c5-457c-bc5d-a230921f49a0

📥 Commits

Reviewing files that changed from the base of the PR and between 56337d7 and 91fba38.

📒 Files selected for processing (4)
  • docs/accuracy/accuracy_stubs.md
  • pyproject.toml
  • src/aiperf/accuracy/benchmarks/bigbench.py
  • tests/unit/accuracy/test_bigbench_benchmark.py
✅ Files skipped from review due to trivial changes (1)
  • docs/accuracy/accuracy_stubs.md

Comment thread tests/unit/accuracy/test_bigbench_benchmark.py
Copy link
Copy Markdown
Contributor

@ajcasagrande ajcasagrande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work man!

@ajcasagrande ajcasagrande merged commit 8f45778 into main May 23, 2026
26 checks passed
@ajcasagrande ajcasagrande deleted the dbermudez/aip-878-implement-bigbench-benchmark-loader branch May 23, 2026 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants