feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) by debermudez · Pull Request #924 · ai-dynamo/aiperf

debermudez · 2026-05-12T23:06:31Z

Plugin: registers bigbench benchmark in plugins.yaml (default_grader: exact_match, default_n_shots: 3); scaffold loader raises NotImplementedError until the full DeepEval-backed implementation lands.
BigBench covers diverse language understanding tasks spanning linguistics, common-sense reasoning, and world knowledge; 3-shot evaluation mirrors the trt-llm-benchmark-recipe default.
Grading uses ExactMatchGrader introduced in AIP-877 (same DeepEval sub-stack).
Depends on AIP-877 for the exact_match grader plugin registration.

Reference: trt-llm-benchmark-recipe DeepEval-backed bigbench evaluation

Summary by CodeRabbit

New Features
- BigBench‑Hard benchmark is implemented and available for accuracy testing (defaults to CoT enabled; shots capped at 3; grader: exact_match; source: lukaemon/bbh).
Documentation
- Accuracy docs updated to list BigBench‑Hard as implemented and document its defaults and source.
Tests
- Comprehensive tests added, including a marker for real DeepEval, a deterministic fake harness for CI, and fixtures to conditionally use the real or fake templates.

copy-pr-bot · 2026-05-12T23:06:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-12T23:06:47Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@91fba387e2eafa1d3d64b02e97e6741af5995321

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@91fba387e2eafa1d3d64b02e97e6741af5995321

Last updated for commit: 91fba38 • Browse code

github-actions · 2026-05-12T23:07:10Z

Fern Docs Preview: https://nvidia-preview-a1e36707-50b9-415d-9d22-4a7e98e2f038.docs.buildwithfern.com/aiperf/dev

debermudez · 2026-05-12T23:10:37Z

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 3 of 8 — base branch is dbermudez/aip-877-implement-hellaswag-benchmark-loader,
depends on #923 (AIP-877) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

codecov · 2026-05-13T22:15:15Z

Codecov Report

❌ Patch coverage is 95.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/aiperf/accuracy/benchmarks/bigbench.py	95.00%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai · 2026-05-22T20:10:52Z

Walkthrough

Implements BigBench‑Hard loader: resolves subtasks, loads lukaemon/bbh per task, generates DeepEval‑aligned prompts (n_shots ≤ 3), builds BenchmarkProblem objects with metadata, updates plugin/docs, adds a fake deepeval test harness and pytest marker, and includes extensive unit tests covering invariants and edge cases.

Changes

BigBench-Hard benchmark implementation and test coverage

Layer / File(s)	Summary
Constants and task resolution `src/aiperf/accuracy/benchmarks/bigbench.py`	Module constants, optional DeepEval import gating with installation hint, and `_resolve_tasks()` to normalize/validate subtasks (supports None/["all"], forbids mixing "all", reports valid names).
BigBenchBenchmark core implementation `src/aiperf/accuracy/benchmarks/bigbench.py`	`__init__` enforces DeepEval presence; `load_problems()` validates `n_shots` ≤ 3, resolves tasks, loads `lukaemon/bbh` per task, and aggregates `BenchmarkProblem`s; `_build_subtask_problems()` generates DeepEval-byte-equal prompts, wraps as chat messages, coerces `ground_truth`, and attaches `bbh_task`, `confinement`, and `generation_size` metadata.
Configuration and documentation updates `src/aiperf/plugin/plugins.yaml`, `docs/accuracy/accuracy-benchmarking.md`, `docs/accuracy/accuracy_stubs.md`	Plugin YAML updated for BigBench‑Hard (`default_enable_cot: true`, removed `is_implemented:false`); accuracy docs mark `BigBenchBenchmark` implemented and add `bigbench` to the benchmarks table with source/prompt-alignment note.
Test harness and pytest marker `tests/harness/fake_deepeval.py`, `pyproject.toml`	Adds a fake `deepeval.benchmarks.big_bench_hard` surface (enum, confinement dict, `generate_output`) and a `requires_deepeval` pytest marker for byte-pinning tests.
Conftest monkeypatch and skip logic `tests/unit/accuracy/conftest.py`	Autouse fixture and collection hook detect real deepeval availability; skip `requires_deepeval` when missing and monkeypatch bigbench deepeval names to the fake harness for tests.
Test config updates `tests/unit/accuracy/test_accuracy_config.py`	Removes `"bigbench"` from `STUB_BENCHMARKS` and updates the uppercase-stub regression test to use `"LCB_CODEGENERATION"`.
BigBenchBenchmark unit tests `tests/unit/accuracy/test_bigbench_benchmark.py`	Adds utilities and comprehensive tests: defaults, task resolution (including adversarial inputs), prompt byte/pattern invariants (CoT vs non‑CoT, zero‑shot), `n_shots` effects and cap enforcement, per‑problem fields and metadata, load_dataset invocation checks, aggregation ordering, and hostile-row/pathology cases.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through BBH with a twitching nose,
Prompts byte-pinned where DeepEval flows,
Shots and CoT stitched in tidy line,
Tests guard each task so outputs align,
A carrot of code — precise and composed!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.28% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: implementing a BigBench-Hard benchmark backed by DeepEval.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/accuracy/benchmarks/bigbench.py`:
- Line 33: The current import and type annotation are wrong: change the import
from "Dataset" to "DatasetDict" (i.e., from datasets import DatasetDict,
load_dataset) and update the variable/type annotation for the result of
load_dataset (the variable named "ds" or wherever load_dataset(DATASET_NAME,
task.value) is assigned) from Dataset to DatasetDict so that ds["test"] access
is correctly typed; keep all existing uses (e.g., ds["test"]) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3c9742e7-fe8e-41c5-9c1e-a5afce10ac6c

📥 Commits

Reviewing files that changed from the base of the PR and between bd5467f and 7d1aaa7.

📒 Files selected for processing (6)

docs/accuracy/accuracy-benchmarking.md
docs/accuracy/accuracy_stubs.md
src/aiperf/accuracy/benchmarks/bigbench.py
src/aiperf/plugin/plugins.yaml
tests/unit/accuracy/test_accuracy_config.py
tests/unit/accuracy/test_bigbench_benchmark.py

Implements the BigBench-Hard accuracy benchmark by delegating prompt rendering to ``deepeval.benchmarks.BigBenchHard``'s ``BigBenchHardTemplate.generate_output``. Output is byte-equal to the trt-llm benchmark recipe's DeepEval-backed configuration so reference parity is preserved end-to-end. Pairs with the existing ``ExactMatchGrader`` (landed via AIP-877) for the recipe's strict ``Scorer.exact_match_score`` semantics. Loader uses the new ``BenchmarkRun`` constructor signature introduced by PR #912 (no ``UserConfig``), and the test fixture wires through the ``make_benchmark_run`` conftest helper. ``deepeval`` is already pinned in the ``[accuracy]`` extras via AIP-877 — the test guards on ``pytest.importorskip("deepeval")`` so the suite still runs without the optional install. Drops ``bigbench`` from ``STUB_BENCHMARKS``, removes ``is_implemented: false`` from the ``plugins.yaml`` entry, and updates the accuracy docs to reflect the new implemented status. The uppercase-stub validator test now exercises ``LCB_CODEGENERATION`` since ``BIGBENCH`` is no longer stubbed. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

…878) Previously ``_resolve_tasks(["all", "NOT_A_REAL_TASK"])`` silently returned every BigBenchHardTask enum and swallowed the typo — the ``"all" in {t.lower() for t in tasks}`` shortcut fired before any membership validation. Mirror the HellaSwag fix from AIP-877 (``f3097f66`` upstream) and raise ``ValueError`` unless ``tasks`` is empty / ``["all"]`` (case-insensitive) alone. A typo paired with ``all`` now fails loudly instead of running the entire benchmark. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

Adds 16 tests across five new classes addressing coverage gaps the existing suite leaves open. Reviewer flagged that accuracy coverage has been slipping across the recent stack; this pulls BBH back to parity with HellaSwag's exhaustive resolver/invariant pinning. New coverage: * ``TestResolveTasksAdversarial`` (8 tests) — empty list, mixed-case ``All`` / ``ALL``, ``all`` paired with a typo or a valid name (both raise after the loader fix), whitespace-bearing and hyphenated task names, mixed valid+invalid lists, and duplicate task names (pins the no-dedupe behavior so callers know they'll hit ``load_dataset`` twice). * ``TestConstructorWithoutDeepEval`` — pins the ``RuntimeError`` path for environments without the ``[accuracy]`` extras (previously unreachable because the module-level ``importorskip`` runs first). * ``TestOutputInvariants`` (4 tests) — ``prompt`` matches ``raw_messages[0]['content']``; ``problem.task`` matches ``metadata['bbh_task']``; ``metadata['generation_size']`` carries ``DEFAULT_GENERATION_SIZE``; multi-task output order preserves the caller's task order so the accuracy CSV's per-task grouping stays contiguous. * ``TestLoadDatasetInvocation`` — asserts ``load_dataset`` is called with ``("lukaemon/bbh", task.value)`` positionally for each task. Regression guard against ``DATASET_NAME`` renames or a future swap to kwargs. * ``TestPathologicalRowContent`` (2 tests) — empty ``input`` still renders; numeric ``target`` is coerced to ``str`` by Pydantic. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

for more information, see https://pre-commit.ci

…-878) CI's `make ci-install` installs only `[dev]` extras, so `deepeval` was absent and `pytest.importorskip("deepeval")` at the top of `test_bigbench_benchmark.py` skipped all 39 tests at collection. The loader was getting only ~30.5% import-only coverage in CodeCov. Replace the module-level skip with a fake-deepeval harness: - `tests/harness/fake_deepeval.py`: 27-task enum, confinement dict (verbatim from upstream), and a synthetic `BigBenchHardTemplate` that honours the loader's contracts (n_shots affects length, CoT branch differs from non-CoT) without reproducing upstream prompt bytes. - `tests/unit/accuracy/conftest.py`: autouse function-scope fixture patches the loader module's deepeval names with the fakes when the real install is absent. Does NOT inject into `sys.modules`, so the HellaSwag tests' own `importorskip` skip mechanism keeps working. Also skips items tagged `@pytest.mark.requires_deepeval` at collection. - Tag `TestPromptByteEqualWithDeepEval` with `requires_deepeval` — the four byte-equality assertions read specific strings out of DeepEval's bundled `.txt` files and can't be reproduced by a fake. - Register the `requires_deepeval` marker in `pyproject.toml`. Also fix `test_numeric_target_coerced_to_string` by adding defensive `str(row[TARGET_FIELD])` in `_build_subtask_problems`. The test was written against an aspirational Pydantic lax-validation contract that `BenchmarkProblem.ground_truth` doesn't honour; explicit coercion in the loader mirrors DeepEval's own `str(expected_output)` pattern. Result on `src/aiperf/accuracy/benchmarks/bigbench.py`: - Without [accuracy]: 30.5% -> 97% (3 unreachable real-import lines) - With [accuracy]: 100% (all 39 tests run including byte-equality) Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

src/aiperf/accuracy/benchmarks/bigbench.py (1)

33-33: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use DatasetDict typing for load_dataset without split.

Line 180 annotates ds as Dataset, but this call form returns split mappings and is used as ds["test"] on Line 185. The type should be DatasetDict for an accurate API contract.

Proposed fix

-from datasets import Dataset, load_dataset
+from datasets import DatasetDict, load_dataset
@@
-            ds: Dataset = await asyncio.to_thread(
+            ds: DatasetDict = await asyncio.to_thread(
                 load_dataset, DATASET_NAME, task.value
             )

For datasets==3.0.0, what is the return type of load_dataset(path, name) when split is omitted?

Also applies to: 180-186

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/accuracy/benchmarks/bigbench.py` at line 33, The variable
annotated as Dataset (ds) is actually a split mapping returned by load_dataset
when no split is provided; update the import to include DatasetDict and change
the type of ds from Dataset to DatasetDict where it’s assigned/annotated (e.g.,
the ds variable used with ds["test"] in the bigbench benchmark code), ensuring
any type hints or function signatures reflect DatasetDict instead of Dataset.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/accuracy/accuracy_stubs.md`:
- Line 10: Update the “Method Count Summary” so it matches the status summary:
mark MultipleChoiceGrader, MathGrader, CodeExecutionGrader, LightevalExprGrader,
LightevalLatexGrader, LightevalGPQAGrader, ExactMatchGrader as implemented
graders and mark MMLUBenchmark, AIMEBenchmark, HellaSwagBenchmark,
BigBenchBenchmark as implemented benchmarks; adjust the counts and remove those
names from the stub/NotImplementedError list (leave aime24, aime25, math_500,
gpqa_diamond, lcb_codegeneration as stubs). Ensure the Method Count Summary uses
the same wording and numbers as the status summary so both sections are
consistent.

In `@tests/unit/accuracy/conftest.py`:
- Line 43: Add explicit type annotations to the pytest hook and fixture
signatures: annotate the pytest_collection_modifyitems function parameters and
return type (use config: pytest.Config, items: list[pytest.Item] -> None) and
likewise add appropriate parameter/return annotations to the fixture defined at
line 58 (e.g., request: pytest.FixtureRequest or the specific types it uses and
-> Generator/Any/None as appropriate). Update imports to include typing or
pytest symbols if needed so the signatures are fully typed.

In `@tests/unit/accuracy/test_bigbench_benchmark.py`:
- Line 46: _add explicit type annotations: give _make_run a concrete return type
instead of implicit Any, and annotate _per_task_loader with a precise Callable
return type and annotate the nested loader's parameter and return types.
Specifically, update the signatures of _make_run, _per_task_loader, and the
inner loader function (named loader) to use typing annotations (e.g., import
Callable, Iterable/Iterator, Union or the specific Run/Task/Example types used
in the test) so the outer function returns the correct typed callable and the
inner loader has explicit parameter and return types rather than untyped
args/returns.

---

Duplicate comments:
In `@src/aiperf/accuracy/benchmarks/bigbench.py`:
- Line 33: The variable annotated as Dataset (ds) is actually a split mapping
returned by load_dataset when no split is provided; update the import to include
DatasetDict and change the type of ds from Dataset to DatasetDict where it’s
assigned/annotated (e.g., the ds variable used with ds["test"] in the bigbench
benchmark code), ensuring any type hints or function signatures reflect
DatasetDict instead of Dataset.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d5e9fc2f-4084-4734-ba75-efc96b3304e3

📥 Commits

Reviewing files that changed from the base of the PR and between 7d1aaa7 and 56337d7.

📒 Files selected for processing (9)

docs/accuracy/accuracy-benchmarking.md
docs/accuracy/accuracy_stubs.md
pyproject.toml
src/aiperf/accuracy/benchmarks/bigbench.py
src/aiperf/plugin/plugins.yaml
tests/harness/fake_deepeval.py
tests/unit/accuracy/conftest.py
tests/unit/accuracy/test_accuracy_config.py
tests/unit/accuracy/test_bigbench_benchmark.py

✅ Files skipped from review due to trivial changes (1)

docs/accuracy/accuracy-benchmarking.md

The Method Count Summary table in `docs/accuracy/accuracy_stubs.md` was still claiming `Graders: 1 implemented / 3 stubbed` and `Benchmarks: 1 implemented / 8 stubbed` — stale since the AIME + HellaSwag + BigBench loaders and the lighteval / exact_match / math / code_execution graders all landed. The status-summary paragraph at the top of the same file was already up to date, so the two sections disagreed. Update the table to match the status summary and the implemented/stubbed grader/benchmark tables further down: 7 graders implemented (all done), 4 benchmarks implemented (MMLU/AIME/HellaSwag/ BigBench), 5 benchmarks still stubbed (aime24/aime25/math_500/ gpqa_diamond/lcb_codegeneration), total 15 implemented / 6 stubbed / 6 remaining methods. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

…stubs (AIP-878) The "Suggested Implementation Order" still recommended implementing `ExactMatchGrader` and `MathGrader` next and pointed at `MultipleChoiceGrader` as the canonical grader reference, but all seven graders are now done. Drop the grader step entirely and re-target the benchmark step at the five remaining stubs (`aime24`, `aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) with each paired to the closest existing loader and its grader (math via `AIMEBenchmark`, multiple_choice via `MMLUBenchmark`, code_execution via `MMLUBenchmark` scaffolding). Pairings match the Default Grader column in the "Still Stubbed" benchmarks table earlier in the same file. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

`_make_run`, `_per_task_loader`, and the inner `loader` callback all had implicit `Any` signatures. Add explicit annotations: - `_make_run() -> BenchmarkRun` (TYPE_CHECKING import to keep the runtime cost zero). - `_per_task_loader(per_task: dict[...]) -> Callable[..., dict[str, Any]]`. - The inner `loader(_dataset_name: str, task_name: str | None = None, **_kwargs: Any) -> dict[str, Any]`. Inside the inner body, gate the dict lookup on `task_name is not None` so mypy doesn't complain about passing `None` to `dict[str, …].get`. Semantics are unchanged — `per_task.get(None, [])` already returned `[]` because `None` is never a key. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

`_build_subtask_problems` now appends the per-task confinement string (e.g. "Output 'True' or 'False'. Full answer not needed.") onto the ``BigBenchHardTemplate.generate_output(...)`` result, so the rendered prompt the LLM sees ends with the constraint. Matches the trt-llm benchmark recipe's flow and parallels the equivalent change AIP-877 made for HellaSwag (`fix(accuracy): append DeepEval confinement instruction to HellaSwag prompt`). Mechanically: - Bind the raw template output to ``template_prompt``. - ``prompt = f"{template_prompt}{bbh_confinement_statements_dict[task]}"``. - Use the combined ``prompt`` for both the chat message content and ``BenchmarkProblem.prompt``. Direct subscript ``[task]`` rather than ``.get(task, "")`` because the confinement dict is exhaustive across all 27 BBH subtasks; a missing key would be an upstream regression we want to surface loudly. ``metadata["confinement"]`` retains the ``.get(task, "")`` form so callers can still read the confinement without re-parsing the prompt. Rename ``test_query_appended_at_end`` to ``test_query_appended_before_confinement`` and update its assertion to pin the new shape: the "Q: <input>\\nA: " segment is immediately followed by the confinement instruction, and the prompt now ends with the confinement's terminal phrase ("Full answer not needed."). Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

…rk-loader

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/accuracy/test_bigbench_benchmark.py`:
- Around line 72-84: The loader inner function in _per_task_loader currently
types the variadic kwargs as **_kwargs: Any; remove the explicit Any annotation
so the signature uses an unannotated variadic keyword parameter (**_kwargs)
instead. Update the loader definition in _per_task_loader (function name:
loader, parent: _per_task_loader) to drop the ": Any" on **_kwargs while keeping
the parameter name and behavior unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7650396c-89c5-457c-bc5d-a230921f49a0

📥 Commits

Reviewing files that changed from the base of the PR and between 56337d7 and 91fba38.

📒 Files selected for processing (4)

docs/accuracy/accuracy_stubs.md
pyproject.toml
src/aiperf/accuracy/benchmarks/bigbench.py
tests/unit/accuracy/test_bigbench_benchmark.py

✅ Files skipped from review due to trivial changes (1)

docs/accuracy/accuracy_stubs.md

ajcasagrande

Great work man!

github-actions Bot added the feat label May 12, 2026

This was referenced May 12, 2026

feat(accuracy): implement AIME accuracy benchmark #849

Merged

feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923

Merged

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from 88bafae to 3c150cd Compare May 12, 2026 23:24

debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from b5f8fb3 to 362fa36 Compare May 12, 2026 23:24

debermudez marked this pull request as draft May 12, 2026 23:27

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from b93469a to 45a2dff Compare May 13, 2026 00:33

debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from fed3c38 to 8f988d9 Compare May 13, 2026 00:36

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from 439b1ad to 923b9c9 Compare May 13, 2026 21:17

debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from 8f988d9 to 1601f56 Compare May 13, 2026 21:22

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from 923b9c9 to 24e1ef3 Compare May 13, 2026 22:04

debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from 1601f56 to 03d59ff Compare May 13, 2026 22:04

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from 24e1ef3 to dcb42d0 Compare May 13, 2026 23:32

debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from 03d59ff to 718f430 Compare May 13, 2026 23:32

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch 4 times, most recently from c71dbac to f10c9a7 Compare May 19, 2026 17:32

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from f3097f6 to 98a2c25 Compare May 22, 2026 03:50

Base automatically changed from dbermudez/aip-877-implement-hellaswag-benchmark-loader to main May 22, 2026 16:52

debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from 718f430 to e15f931 Compare May 22, 2026 20:08

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/aiperf/accuracy/benchmarks/bigbench.py

debermudez and others added 5 commits May 22, 2026 16:06

[pre-commit.ci] auto fixes from pre-commit.com hooks

5e9cab6

for more information, see https://pre-commit.ci

debermudez force-pushed the dbermudez/aip-878-implement-bigbench-benchmark-loader branch from 36aa3b4 to 56337d7 Compare May 22, 2026 23:06

debermudez marked this pull request as ready for review May 22, 2026 23:06

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread docs/accuracy/accuracy_stubs.md

Comment thread tests/unit/accuracy/conftest.py

Comment thread tests/unit/accuracy/test_bigbench_benchmark.py Outdated

dynamo-ops reviewed May 22, 2026

View reviewed changes

Comment thread src/aiperf/accuracy/benchmarks/bigbench.py Outdated

debermudez added 3 commits May 22, 2026 16:24

dynamo-ops approved these changes May 22, 2026

View reviewed changes

Merge branch 'main' into dbermudez/aip-878-implement-bigbench-benchma…

91fba38

…rk-loader

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

Comment thread tests/unit/accuracy/test_bigbench_benchmark.py

dynamo-ops approved these changes May 23, 2026

View reviewed changes

ajcasagrande approved these changes May 23, 2026

View reviewed changes

ajcasagrande merged commit 8f45778 into main May 23, 2026
26 checks passed

ajcasagrande deleted the dbermudez/aip-878-implement-bigbench-benchmark-loader branch May 23, 2026 05:57

Conversation

debermudez commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

debermudez commented May 12, 2026

Stack dependency

Uh oh!

codecov Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ajcasagrande left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

debermudez commented May 12, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading

codecov Bot commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading