feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878)#924
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@91fba387e2eafa1d3d64b02e97e6741af5995321Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@91fba387e2eafa1d3d64b02e97e6741af5995321Last updated for commit: |
Stack dependencyThis PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with Merge order:
This PR: position 3 of 8 — base branch is After each upstream PR merges, the downstream PR's branch will be rebased |
88bafae to
3c150cd
Compare
b5f8fb3 to
362fa36
Compare
b93469a to
45a2dff
Compare
fed3c38 to
8f988d9
Compare
439b1ad to
923b9c9
Compare
8f988d9 to
1601f56
Compare
923b9c9 to
24e1ef3
Compare
1601f56 to
03d59ff
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
24e1ef3 to
dcb42d0
Compare
03d59ff to
718f430
Compare
c71dbac to
f10c9a7
Compare
f3097f6 to
98a2c25
Compare
718f430 to
e15f931
Compare
WalkthroughImplements BigBench‑Hard loader: resolves subtasks, loads lukaemon/bbh per task, generates DeepEval‑aligned prompts (n_shots ≤ 3), builds BenchmarkProblem objects with metadata, updates plugin/docs, adds a fake deepeval test harness and pytest marker, and includes extensive unit tests covering invariants and edge cases. ChangesBigBench-Hard benchmark implementation and test coverage
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/aiperf/accuracy/benchmarks/bigbench.py`:
- Line 33: The current import and type annotation are wrong: change the import
from "Dataset" to "DatasetDict" (i.e., from datasets import DatasetDict,
load_dataset) and update the variable/type annotation for the result of
load_dataset (the variable named "ds" or wherever load_dataset(DATASET_NAME,
task.value) is assigned) from Dataset to DatasetDict so that ds["test"] access
is correctly typed; keep all existing uses (e.g., ds["test"]) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3c9742e7-fe8e-41c5-9c1e-a5afce10ac6c
📒 Files selected for processing (6)
docs/accuracy/accuracy-benchmarking.mddocs/accuracy/accuracy_stubs.mdsrc/aiperf/accuracy/benchmarks/bigbench.pysrc/aiperf/plugin/plugins.yamltests/unit/accuracy/test_accuracy_config.pytests/unit/accuracy/test_bigbench_benchmark.py
Implements the BigBench-Hard accuracy benchmark by delegating prompt rendering to ``deepeval.benchmarks.BigBenchHard``'s ``BigBenchHardTemplate.generate_output``. Output is byte-equal to the trt-llm benchmark recipe's DeepEval-backed configuration so reference parity is preserved end-to-end. Pairs with the existing ``ExactMatchGrader`` (landed via AIP-877) for the recipe's strict ``Scorer.exact_match_score`` semantics. Loader uses the new ``BenchmarkRun`` constructor signature introduced by PR #912 (no ``UserConfig``), and the test fixture wires through the ``make_benchmark_run`` conftest helper. ``deepeval`` is already pinned in the ``[accuracy]`` extras via AIP-877 — the test guards on ``pytest.importorskip("deepeval")`` so the suite still runs without the optional install. Drops ``bigbench`` from ``STUB_BENCHMARKS``, removes ``is_implemented: false`` from the ``plugins.yaml`` entry, and updates the accuracy docs to reflect the new implemented status. The uppercase-stub validator test now exercises ``LCB_CODEGENERATION`` since ``BIGBENCH`` is no longer stubbed. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…878)
Previously ``_resolve_tasks(["all", "NOT_A_REAL_TASK"])`` silently
returned every BigBenchHardTask enum and swallowed the typo — the
``"all" in {t.lower() for t in tasks}`` shortcut fired before any
membership validation. Mirror the HellaSwag fix from AIP-877
(``f3097f66`` upstream) and raise ``ValueError`` unless ``tasks`` is
empty / ``["all"]`` (case-insensitive) alone. A typo paired with
``all`` now fails loudly instead of running the entire benchmark.
Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
Adds 16 tests across five new classes addressing coverage gaps the
existing suite leaves open. Reviewer flagged that accuracy coverage
has been slipping across the recent stack; this pulls BBH back to
parity with HellaSwag's exhaustive resolver/invariant pinning.
New coverage:
* ``TestResolveTasksAdversarial`` (8 tests) — empty list, mixed-case
``All`` / ``ALL``, ``all`` paired with a typo or a valid name (both
raise after the loader fix), whitespace-bearing and hyphenated task
names, mixed valid+invalid lists, and duplicate task names (pins
the no-dedupe behavior so callers know they'll hit ``load_dataset``
twice).
* ``TestConstructorWithoutDeepEval`` — pins the ``RuntimeError`` path
for environments without the ``[accuracy]`` extras (previously
unreachable because the module-level ``importorskip`` runs first).
* ``TestOutputInvariants`` (4 tests) — ``prompt`` matches
``raw_messages[0]['content']``; ``problem.task`` matches
``metadata['bbh_task']``; ``metadata['generation_size']`` carries
``DEFAULT_GENERATION_SIZE``; multi-task output order preserves the
caller's task order so the accuracy CSV's per-task grouping stays
contiguous.
* ``TestLoadDatasetInvocation`` — asserts ``load_dataset`` is called
with ``("lukaemon/bbh", task.value)`` positionally for each task.
Regression guard against ``DATASET_NAME`` renames or a future swap
to kwargs.
* ``TestPathologicalRowContent`` (2 tests) — empty ``input`` still
renders; numeric ``target`` is coerced to ``str`` by Pydantic.
Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
for more information, see https://pre-commit.ci
…-878)
CI's `make ci-install` installs only `[dev]` extras, so `deepeval` was
absent and `pytest.importorskip("deepeval")` at the top of
`test_bigbench_benchmark.py` skipped all 39 tests at collection. The
loader was getting only ~30.5% import-only coverage in CodeCov.
Replace the module-level skip with a fake-deepeval harness:
- `tests/harness/fake_deepeval.py`: 27-task enum, confinement dict
(verbatim from upstream), and a synthetic `BigBenchHardTemplate` that
honours the loader's contracts (n_shots affects length, CoT branch
differs from non-CoT) without reproducing upstream prompt bytes.
- `tests/unit/accuracy/conftest.py`: autouse function-scope fixture
patches the loader module's deepeval names with the fakes when the
real install is absent. Does NOT inject into `sys.modules`, so the
HellaSwag tests' own `importorskip` skip mechanism keeps working.
Also skips items tagged `@pytest.mark.requires_deepeval` at
collection.
- Tag `TestPromptByteEqualWithDeepEval` with `requires_deepeval` — the
four byte-equality assertions read specific strings out of DeepEval's
bundled `.txt` files and can't be reproduced by a fake.
- Register the `requires_deepeval` marker in `pyproject.toml`.
Also fix `test_numeric_target_coerced_to_string` by adding defensive
`str(row[TARGET_FIELD])` in `_build_subtask_problems`. The test was
written against an aspirational Pydantic lax-validation contract that
`BenchmarkProblem.ground_truth` doesn't honour; explicit coercion in
the loader mirrors DeepEval's own `str(expected_output)` pattern.
Result on `src/aiperf/accuracy/benchmarks/bigbench.py`:
- Without [accuracy]: 30.5% -> 97% (3 unreachable real-import lines)
- With [accuracy]: 100% (all 39 tests run including byte-equality)
Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
36aa3b4 to
56337d7
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/aiperf/accuracy/benchmarks/bigbench.py (1)
33-33:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse
DatasetDicttyping forload_datasetwithoutsplit.Line 180 annotates
dsasDataset, but this call form returns split mappings and is used asds["test"]on Line 185. The type should beDatasetDictfor an accurate API contract.Proposed fix
-from datasets import Dataset, load_dataset +from datasets import DatasetDict, load_dataset @@ - ds: Dataset = await asyncio.to_thread( + ds: DatasetDict = await asyncio.to_thread( load_dataset, DATASET_NAME, task.value )For datasets==3.0.0, what is the return type of load_dataset(path, name) when split is omitted?Also applies to: 180-186
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/aiperf/accuracy/benchmarks/bigbench.py` at line 33, The variable annotated as Dataset (ds) is actually a split mapping returned by load_dataset when no split is provided; update the import to include DatasetDict and change the type of ds from Dataset to DatasetDict where it’s assigned/annotated (e.g., the ds variable used with ds["test"] in the bigbench benchmark code), ensuring any type hints or function signatures reflect DatasetDict instead of Dataset.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/accuracy/accuracy_stubs.md`:
- Line 10: Update the “Method Count Summary” so it matches the status summary:
mark MultipleChoiceGrader, MathGrader, CodeExecutionGrader, LightevalExprGrader,
LightevalLatexGrader, LightevalGPQAGrader, ExactMatchGrader as implemented
graders and mark MMLUBenchmark, AIMEBenchmark, HellaSwagBenchmark,
BigBenchBenchmark as implemented benchmarks; adjust the counts and remove those
names from the stub/NotImplementedError list (leave aime24, aime25, math_500,
gpqa_diamond, lcb_codegeneration as stubs). Ensure the Method Count Summary uses
the same wording and numbers as the status summary so both sections are
consistent.
In `@tests/unit/accuracy/conftest.py`:
- Line 43: Add explicit type annotations to the pytest hook and fixture
signatures: annotate the pytest_collection_modifyitems function parameters and
return type (use config: pytest.Config, items: list[pytest.Item] -> None) and
likewise add appropriate parameter/return annotations to the fixture defined at
line 58 (e.g., request: pytest.FixtureRequest or the specific types it uses and
-> Generator/Any/None as appropriate). Update imports to include typing or
pytest symbols if needed so the signatures are fully typed.
In `@tests/unit/accuracy/test_bigbench_benchmark.py`:
- Line 46: _add explicit type annotations: give _make_run a concrete return type
instead of implicit Any, and annotate _per_task_loader with a precise Callable
return type and annotate the nested loader's parameter and return types.
Specifically, update the signatures of _make_run, _per_task_loader, and the
inner loader function (named loader) to use typing annotations (e.g., import
Callable, Iterable/Iterator, Union or the specific Run/Task/Example types used
in the test) so the outer function returns the correct typed callable and the
inner loader has explicit parameter and return types rather than untyped
args/returns.
---
Duplicate comments:
In `@src/aiperf/accuracy/benchmarks/bigbench.py`:
- Line 33: The variable annotated as Dataset (ds) is actually a split mapping
returned by load_dataset when no split is provided; update the import to include
DatasetDict and change the type of ds from Dataset to DatasetDict where it’s
assigned/annotated (e.g., the ds variable used with ds["test"] in the bigbench
benchmark code), ensuring any type hints or function signatures reflect
DatasetDict instead of Dataset.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d5e9fc2f-4084-4734-ba75-efc96b3304e3
📒 Files selected for processing (9)
docs/accuracy/accuracy-benchmarking.mddocs/accuracy/accuracy_stubs.mdpyproject.tomlsrc/aiperf/accuracy/benchmarks/bigbench.pysrc/aiperf/plugin/plugins.yamltests/harness/fake_deepeval.pytests/unit/accuracy/conftest.pytests/unit/accuracy/test_accuracy_config.pytests/unit/accuracy/test_bigbench_benchmark.py
✅ Files skipped from review due to trivial changes (1)
- docs/accuracy/accuracy-benchmarking.md
The Method Count Summary table in `docs/accuracy/accuracy_stubs.md` was still claiming `Graders: 1 implemented / 3 stubbed` and `Benchmarks: 1 implemented / 8 stubbed` — stale since the AIME + HellaSwag + BigBench loaders and the lighteval / exact_match / math / code_execution graders all landed. The status-summary paragraph at the top of the same file was already up to date, so the two sections disagreed. Update the table to match the status summary and the implemented/stubbed grader/benchmark tables further down: 7 graders implemented (all done), 4 benchmarks implemented (MMLU/AIME/HellaSwag/ BigBench), 5 benchmarks still stubbed (aime24/aime25/math_500/ gpqa_diamond/lcb_codegeneration), total 15 implemented / 6 stubbed / 6 remaining methods. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…stubs (AIP-878) The "Suggested Implementation Order" still recommended implementing `ExactMatchGrader` and `MathGrader` next and pointed at `MultipleChoiceGrader` as the canonical grader reference, but all seven graders are now done. Drop the grader step entirely and re-target the benchmark step at the five remaining stubs (`aime24`, `aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) with each paired to the closest existing loader and its grader (math via `AIMEBenchmark`, multiple_choice via `MMLUBenchmark`, code_execution via `MMLUBenchmark` scaffolding). Pairings match the Default Grader column in the "Still Stubbed" benchmarks table earlier in the same file. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
`_make_run`, `_per_task_loader`, and the inner `loader` callback all had implicit `Any` signatures. Add explicit annotations: - `_make_run() -> BenchmarkRun` (TYPE_CHECKING import to keep the runtime cost zero). - `_per_task_loader(per_task: dict[...]) -> Callable[..., dict[str, Any]]`. - The inner `loader(_dataset_name: str, task_name: str | None = None, **_kwargs: Any) -> dict[str, Any]`. Inside the inner body, gate the dict lookup on `task_name is not None` so mypy doesn't complain about passing `None` to `dict[str, …].get`. Semantics are unchanged — `per_task.get(None, [])` already returned `[]` because `None` is never a key. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
`_build_subtask_problems` now appends the per-task confinement string
(e.g. "Output 'True' or 'False'. Full answer not needed.") onto the
``BigBenchHardTemplate.generate_output(...)`` result, so the rendered
prompt the LLM sees ends with the constraint. Matches the trt-llm
benchmark recipe's flow and parallels the equivalent change AIP-877
made for HellaSwag (`fix(accuracy): append DeepEval confinement
instruction to HellaSwag prompt`).
Mechanically:
- Bind the raw template output to ``template_prompt``.
- ``prompt = f"{template_prompt}{bbh_confinement_statements_dict[task]}"``.
- Use the combined ``prompt`` for both the chat message content and
``BenchmarkProblem.prompt``.
Direct subscript ``[task]`` rather than ``.get(task, "")`` because the
confinement dict is exhaustive across all 27 BBH subtasks; a missing
key would be an upstream regression we want to surface loudly.
``metadata["confinement"]`` retains the ``.get(task, "")`` form so
callers can still read the confinement without re-parsing the prompt.
Rename ``test_query_appended_at_end`` to
``test_query_appended_before_confinement`` and update its assertion
to pin the new shape: the "Q: <input>\\nA: " segment is immediately
followed by the confinement instruction, and the prompt now ends with
the confinement's terminal phrase ("Full answer not needed.").
Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/unit/accuracy/test_bigbench_benchmark.py`:
- Around line 72-84: The loader inner function in _per_task_loader currently
types the variadic kwargs as **_kwargs: Any; remove the explicit Any annotation
so the signature uses an unannotated variadic keyword parameter (**_kwargs)
instead. Update the loader definition in _per_task_loader (function name:
loader, parent: _per_task_loader) to drop the ": Any" on **_kwargs while keeping
the parameter name and behavior unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7650396c-89c5-457c-bc5d-a230921f49a0
📒 Files selected for processing (4)
docs/accuracy/accuracy_stubs.mdpyproject.tomlsrc/aiperf/accuracy/benchmarks/bigbench.pytests/unit/accuracy/test_bigbench_benchmark.py
✅ Files skipped from review due to trivial changes (1)
- docs/accuracy/accuracy_stubs.md
bigbenchbenchmark inplugins.yaml(default_grader: exact_match,default_n_shots: 3); scaffold loader raisesNotImplementedErroruntil the full DeepEval-backed implementation lands.ExactMatchGraderintroduced in AIP-877 (same DeepEval sub-stack).exact_matchgrader plugin registration.Reference: trt-llm-benchmark-recipe DeepEval-backed bigbench evaluation
Summary by CodeRabbit
New Features
Documentation
Tests