feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) by debermudez · Pull Request #923 · ai-dynamo/aiperf

debermudez · 2026-05-12T23:05:35Z

Plugin: registers hellaswag benchmark in plugins.yaml (default_grader: multiple_choice, default_n_shots: 0); scaffold loader raises NotImplementedError until the full DeepEval-backed implementation lands.
ExactMatchGrader: registers exact_match grader plugin in plugins.yaml; scaffold raises NotImplementedError — wired here because it is needed by the upcoming BigBench PR (AIP-878).
HellaSwag is a commonsense reasoning benchmark for sentence-completion tasks aligned with the trt-llm-benchmark-recipe DeepEval-backed hellaswag path.
Depends on AIP-874 for the grader infrastructure ([accuracy] optional group).

Reference: trt-llm-benchmark-recipe DeepEval-backed hellaswag evaluation

Summary by CodeRabbit

New Features
- HellaSwag benchmark implemented with exact-match grading and a default of 10 few-shot examples.
Documentation
- Accuracy docs updated to mark HellaSwag and Exact Match as implemented and document defaults and semantics.
Tests
- Added comprehensive unit tests for HellaSwag prompt/problem construction and Exact Match grading behavior.
Chores
- Added optional support for an upstream evaluation library to enable benchmark prompt generation.

copy-pr-bot · 2026-05-12T23:05:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-12T23:05:45Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@be3620de923fa3d63f51713882aebcd07e56f0d4

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@be3620de923fa3d63f51713882aebcd07e56f0d4

Last updated for commit: be3620d • Browse code

github-actions · 2026-05-12T23:06:27Z

Fern Docs Preview: https://nvidia-preview-a58760fc-7f56-43de-844c-c0e81136bac4.docs.buildwithfern.com/aiperf/dev

debermudez · 2026-05-12T23:10:30Z

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 2 of 8 — base branch is dbermudez/aip-874-implement-aime-benchmark-loader,
depends on #849 (AIP-874) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

codecov · 2026-05-13T22:15:34Z

Codecov Report

❌ Patch coverage is 33.69565% with 61 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/aiperf/accuracy/benchmarks/hellaswag.py	26.50%	61 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai · 2026-05-14T18:07:08Z

Caution

Review failed

Pull request was closed or merged during review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d2fc67a-8e1e-43b5-8fea-d938366cd0c8

📥 Commits

Reviewing files that changed from the base of the PR and between 98a2c25 and be3620d.

📒 Files selected for processing (1)

pyproject.toml

Walkthrough

Implements ExactMatchGrader and a DeepEval-aligned HellaSwagBenchmark, adds deepeval>=2.9.0, updates plugin metadata and docs, and adds unit tests validating grader and benchmark behaviors.

Changes

HellaSwag benchmark + ExactMatch grader

Layer / File(s)	Summary
ExactMatchGrader implementation & contract `src/aiperf/accuracy/graders/exact_match.py`	Implements `extract_answer` (returns `response_text.strip()` or `""`) and `grade` (strict equality of stripped pred vs ground truth, sets `unparsed`, `confidence`, `reasoning`, and returns `GradingResult`).
ExactMatchGrader unit tests `tests/unit/accuracy/test_exact_match_grader.py`	New test suite covering strict whitespace stripping, case sensitivity, punctuation differences, empty/whitespace handling (unparsed semantics), multi-line/explanation-prefix rejection, Unicode behavior, extract_answer contract, and GradingResult fields.
HellaSwag benchmark foundation & helpers `src/aiperf/accuracy/benchmarks/hellaswag.py`	Adds module docstring, conditional DeepEval imports with install hint, dataset/task constants, `_build_unique_activity_label_shots_set`, and `_resolve_tasks` (case-insensitive value and enum-name fallback).
HellaSwag benchmark runtime & problem building `src/aiperf/accuracy/benchmarks/hellaswag.py`	Implements `__init__` (fail-fast if deepeval missing), `load_problems` (n_shots validation, async dataset load, task resolution), `_build_problems` (filter by activity_label, render DeepEval-aligned prompts with `HellaSwagTemplate.format_question` + `generate_output`, map labels to `A`–`D`, build `BenchmarkProblem`), and `_build_chat_messages` (single user message).
HellaSwagBenchmark unit tests `tests/unit/accuracy/test_hellaswag_benchmark.py`	New tests (skipped when deepeval not installed) patch `load_dataset`, assert constants, unique-shot selection, `_resolve_tasks` behaviors (None/all/value/enum-name/case-insensitive), prompt formatting parity zero-/few-shot, ground-truth letter mapping, activity_label filtering, enable_cot transparency, n_shots cap, empty/invalid-label handling, and raw_messages shape.
Plugin, dependency, test config, and docs `pyproject.toml`, `src/aiperf/plugin/plugins.yaml`, `tests/unit/accuracy/test_accuracy_config.py`, `docs/accuracy/*`	Adds `deepeval>=2.9.0` to accuracy extras; marks `exact_match` and `hellaswag` implemented in plugins.yaml and updates defaults (`default_grader: exact_match`, `default_n_shots: 10`); removes `hellaswag`/`exact_match` from stub lists in tests and updates uppercase validation; moves HellaSwag and ExactMatch entries from stubbed to implemented in documentation and adds `hellaswag` to benchmarking table.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through prompts with whiskers bright,
I strip the edges, match by byte.
HellaSwag arrives with templates true,
Tests and prompts align, byte-equal too.
Hop, assert, and pass — a rabbit’s delight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.32% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877)' accurately and specifically summarizes the main changes: implementing a HellaSwag benchmark and ExactMatch grader using DeepEval.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

src/aiperf/accuracy/benchmarks/hellaswag.py (3)
153-153: ⚡ Quick win

Drop the Any annotation on **kwargs.

Per the established convention in this repo, variadic keyword arguments should be left untyped rather than annotated as **kwargs: Any. The AI summary calls out this typing change explicitly, so it's worth correcting before merge.
♻️ Proposed fix
-    def __init__(self, user_config: UserConfig, **kwargs: Any) -> None:
+    def __init__(self, user_config: UserConfig, **kwargs) -> None:
Based on learnings: "In Python projects (e.g., in aiperf), avoid adding type annotations to **kwargs like **kwargs: Any. The variadic keyword arguments are inherently dynamic; leave **kwargs untyped or replace with explicit, named keyword parameters if a concrete contract is needed."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/accuracy/benchmarks/hellaswag.py` at line 153, The __init__
signature in the HellaSWAG benchmark currently types the variadic keyword
parameter as **kwargs: Any; remove the explicit Any annotation and leave it
untyped (use **kwargs) in the __init__ method (the constructor taking
user_config: UserConfig) so it follows the repository convention for variadic
keyword arguments.
219-219: 💤 Low value

Defensive bound on the gold-letter index.

choices[int(label_raw)] will IndexError if a row ever has a label outside 0–3 (or int()-incompatible content like "abc"). The Rowan/hellaswag schema today is 0–3 so this is mostly defensive, but a clearer try/except (or explicit range check) drops the row and logs rather than aborting the whole load. Optional — feel free to skip if the dataset contract is considered authoritative.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/accuracy/benchmarks/hellaswag.py` at line 219, Wrap the
conversion/indexing of choices[int(label_raw)] in a defensive try/except that
catches ValueError and IndexError, logs the offending label_raw (and row
identifier if available) and skips that row instead of letting the exception
abort the load; update the code around gold_letter, choices and label_raw to
validate int(label_raw) is in range(len(choices)) or use try: gold_letter =
choices[idx] except (ValueError, IndexError): logger.warning(...); continue so
malformed or out-of-range labels are dropped gracefully.
192-234: ⚡ Quick win

Quadratic scan of the validation split for tasks=["all"].

_build_problems iterates the full val_set once per task. When the caller passes None/"all", _resolve_tasks returns every HellaSwagTask member (~190 categories), so this becomes an O(tasks × val_rows) walk over a ~10k-row split — roughly 1.9M iterations per load_problems call, all on the executor thread. Indexing the validation rows by activity_label once and then iterating the right bucket per task makes this O(N + M).
♻️ Proposed refactor
     def _build_problems(
         self, ds: DatasetDict, tasks: list[Any], n_shots: int
     ) -> list[BenchmarkProblem]:
         train_set = ds["train"]
         shots_set = _build_unique_activity_label_shots_set(train_set)
         val_set = ds["validation"]
         problems: list[BenchmarkProblem] = []
         choices = ["A", "B", "C", "D"]
+        # Pre-bucket validation rows by activity_label so each task
+        # only scans its own subset instead of the whole split.
+        val_by_label: dict[str, list[dict[str, Any]]] = {}
+        for row in val_set:
+            label = row.get(ACTIVITY_LABEL_FIELD)
+            if label is None:
+                continue
+            val_by_label.setdefault(label, []).append(row)
         for task in tasks:
-            for row in val_set:
-                if row.get(ACTIVITY_LABEL_FIELD) != task.value:
-                    continue
+            for row in val_by_label.get(task.value, ()):
                 label_raw = row.get(LABEL_FIELD)
                 if label_raw == "" or label_raw is None:
                     continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/accuracy/benchmarks/hellaswag.py` around lines 192 - 234, The
current _build_problems scans val_set once per task causing O(tasks × val_rows);
fix it by first building a dict index mapping activity_label -> list of
validation rows (keyed by ACTIVITY_LABEL_FIELD) once, then iterate tasks and for
each use bucket = index.get(task.value, []) instead of scanning val_set; keep
the existing logic that builds shots_set, uses
HellaSwagTemplate.format_question/generate_output, maps label_raw to gold_letter
via choices and appends BenchmarkProblem with the same fields and raw_messages
from self._build_chat_messages(prompt).
tests/unit/accuracy/test_hellaswag_benchmark.py (1)
327-332: 💤 Low value

Consider asserting on the constant rather than the literal "15".

The match="at most 15" substring couples the test to the hardcoded message text. Using match=f"at most {MAX_N_SHOTS}" (the constant is already imported) keeps the test correct if the cap ever changes. Minor.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/accuracy/test_hellaswag_benchmark.py` around lines 327 - 332, The
test hardcodes the string "at most 15" in the pytest.raises match; change it to
reference the imported constant so the test follows the configured cap. In
TestNShotsCap::test_n_shots_above_15_raises replace match="at most 15" with
match=f"at most {MAX_N_SHOTS}" (the constant MAX_N_SHOTS is already imported) so
the assertion uses the canonical limit used by HellaSwagBenchmark.load_problems.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/accuracy/accuracy_stubs.md`:
- Line 173: The table entry for HellaSwagBenchmark is out of sync with
plugins.yaml: update the default grader and n-shots to match the plugin settings
by changing the grader from `multiple_choice` to `exact_match` and the `0` shots
to `10` in the HellaSwagBenchmark row; ensure the descriptive text (the
`HellaSwagBenchmark` row contents and any mention of "renders 4-choice MCQs") is
consistent with using `exact_match` and default_n_shots 10 so readers and
HellaSwagBenchmark references reflect the plugin defaults.

In `@src/aiperf/accuracy/benchmarks/hellaswag.py`:
- Around line 105-141: The docstring says matching is case-insensitive but
_resolve_tasks currently checks activity_label strings case-sensitively against
valid_values (HellaSwagTask.value); make value-matching truly case-insensitive
by building a mapping from lowercased enum values to the enum member (e.g.,
lower_value_map = {t.value.lower(): t for t in HellaSwagTask}), then when
iterating names use name_lower = name.lower() to look up in that map (and append
the found HellaSwagTask), keep the existing upper-snake-case fallback using
getattr(HellaSwagTask, name.upper(), None), and preserve the unknown collection
and ValueError behavior if neither lookup succeeds so the error message remains
the same.

In `@tests/unit/accuracy/test_hellaswag_benchmark.py`:
- Around line 208-212: The failing codespell match comes from the literal
sequences containing "nD" in the test assertions (the prompt string and the
options block asserted against variable prompt); instead of relying on "#
codespell:ignore", break or rework the string literals so "nD" never appears as
a single token—e.g., use implicit adjacent string concatenation or split the
offending token across two strings when forming the prompt and the option block
used in the assertions (the lines with the prompt variable and the assertion
containing "A. applies\nB. watches\nC. sings\nD. cooks\nAnswer:"); update those
assertions to compare against the restructured strings and remove the
ineffective pragma.

---

Nitpick comments:
In `@src/aiperf/accuracy/benchmarks/hellaswag.py`:
- Line 153: The __init__ signature in the HellaSWAG benchmark currently types
the variadic keyword parameter as **kwargs: Any; remove the explicit Any
annotation and leave it untyped (use **kwargs) in the __init__ method (the
constructor taking user_config: UserConfig) so it follows the repository
convention for variadic keyword arguments.
- Line 219: Wrap the conversion/indexing of choices[int(label_raw)] in a
defensive try/except that catches ValueError and IndexError, logs the offending
label_raw (and row identifier if available) and skips that row instead of
letting the exception abort the load; update the code around gold_letter,
choices and label_raw to validate int(label_raw) is in range(len(choices)) or
use try: gold_letter = choices[idx] except (ValueError, IndexError):
logger.warning(...); continue so malformed or out-of-range labels are dropped
gracefully.
- Around line 192-234: The current _build_problems scans val_set once per task
causing O(tasks × val_rows); fix it by first building a dict index mapping
activity_label -> list of validation rows (keyed by ACTIVITY_LABEL_FIELD) once,
then iterate tasks and for each use bucket = index.get(task.value, []) instead
of scanning val_set; keep the existing logic that builds shots_set, uses
HellaSwagTemplate.format_question/generate_output, maps label_raw to gold_letter
via choices and appends BenchmarkProblem with the same fields and raw_messages
from self._build_chat_messages(prompt).

In `@tests/unit/accuracy/test_hellaswag_benchmark.py`:
- Around line 327-332: The test hardcodes the string "at most 15" in the
pytest.raises match; change it to reference the imported constant so the test
follows the configured cap. In TestNShotsCap::test_n_shots_above_15_raises
replace match="at most 15" with match=f"at most {MAX_N_SHOTS}" (the constant
MAX_N_SHOTS is already imported) so the assertion uses the canonical limit used
by HellaSwagBenchmark.load_problems.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5a065e80-9e1f-4b07-87d5-d5129709a5fe

📥 Commits

Reviewing files that changed from the base of the PR and between 837bda0 and fcee141.

📒 Files selected for processing (9)

docs/accuracy/accuracy-benchmarking.md
docs/accuracy/accuracy_stubs.md
pyproject.toml
src/aiperf/accuracy/benchmarks/hellaswag.py
src/aiperf/accuracy/graders/exact_match.py
src/aiperf/plugin/plugins.yaml
tests/unit/accuracy/test_accuracy_config.py
tests/unit/accuracy/test_exact_match_grader.py
tests/unit/accuracy/test_hellaswag_benchmark.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/accuracy/benchmarks/hellaswag.py`:
- Around line 189-194: The current guard only enforces the upper bound
(MAX_N_SHOTS) but lets negative n_shots proceed; update the validation around
n_shots (the same block that references MAX_N_SHOTS and the HellaSwag benchmark)
to also reject negative values by raising a ValueError when n_shots < 0 (or
n_shots is negative), with a clear message like "n_shots must be non-negative"
that mirrors the existing style; keep the existing upper-bound check for
MAX_N_SHOTS and combine both checks in the same validation block that currently
contains the f"HellaSwag supports at most {MAX_N_SHOTS} ..." error.
- Around line 120-121: The current selector logic returns list(HellaSwagTask)
whenever "all" appears in tasks, which lets inputs like ["all","typo"] bypass
validation; change the guard in the task-selection routine that references tasks
and HellaSwagTask so that you only treat "all" as meaning "select every task"
when tasks is empty or exactly ["all"] (case-insensitive). Specifically, if not
tasks or tasks == ["all"] (ignoring case) return list(HellaSwagTask); if "all"
is present alongside other selectors raise a ValueError (or similar) explaining
that "all" cannot be mixed with other task names so invalid selectors don’t get
silently ignored.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0d738e6e-7dfd-4eb5-b2a5-0b9dec70791c

📥 Commits

Reviewing files that changed from the base of the PR and between 41d5d73 and c71dbac.

📒 Files selected for processing (5)

docs/accuracy/accuracy-benchmarking.md
docs/accuracy/accuracy_stubs.md
pyproject.toml
src/aiperf/accuracy/benchmarks/hellaswag.py
src/aiperf/accuracy/graders/exact_match.py

✅ Files skipped from review due to trivial changes (1)

docs/accuracy/accuracy-benchmarking.md

🚧 Files skipped from review as they are similar to previous changes (1)

src/aiperf/accuracy/graders/exact_match.py

matthewkotila

Solid PR. The reference-parity discipline (byte-equal-to-DeepEval pinned via direct HellaSwagTemplate.generate_output() comparison in the tests) and the three principled fixup commits (validate-before-download, confinement-string append, reject 'all' mixed with other selectors) raise the bar above "ship a benchmark that runs."

Verification I did locally before writing this:

58/58 unit tests pass against the installed deepeval==4.0.3 (a full major above the floor) — the deep import paths into deepeval.benchmarks.hellaswag.template survived the bump cleanly.
End-to-end smoke run: HellaSwag → mock server → grader → exporter. Pipeline completes, 5/5 requests dispatched, accuracy CSV + console output render correctly with spaced task name "Applying sunscreen". inputs.json confirms the DEEPEVAL_CONFINEMENT string is appended to live prompts exactly as the fixup commit pins.
Console + CSV exporters correctly handle spaces in task names (cross-checked via direct exercise — no false delimiter splits, no transform).
enable_cot=True confirmed silently dropped (see inline comment on the docstring).

Closed by verification: the original "deep imports will silently break across deepeval majors" concern is substantially weakened; spaced-task-name exporter handling is fine; confinement string reaches real prompts.

Recommended before merge:

Add the enable_cot log line (see inline).
Cap deepeval (see inline).
Rebase to current main — branch is 3 commits behind merge-base on unrelated bits, inflating the visible diff against main.
Optional: pin the HF dataset revision (see inline).

…ked (AIP-877) Adds HellaSwag commonsense reasoning benchmark + the strict ExactMatchGrader that DeepEval's ``Scorer.exact_match_score`` uses. Both align byte-for- byte with the trt-llm benchmark recipe's DeepEval-backed configuration. **HellaSwagBenchmark** (`src/aiperf/accuracy/benchmarks/hellaswag.py`) - Loads ``Rowan/hellaswag``: validation split filtered per task by ``activity_label``, train split feeds the "one few-shot per unique activity_label" rule (mirrors ``deepeval.benchmarks.HellaSwag``'s ``categories_seen`` dedupe loop). - Prompt rendering delegates to ``deepeval.benchmarks.hellaswag.template.HellaSwagTemplate.generate_output`` — output is byte-equal to what the trt-llm recipe ships. - Defaults: ``n_shots=10`` (DeepEval cap is 15), ``generation_size=5``, ``default_grader=exact_match``. - Ground truth: bare ``A``/``B``/``C``/``D`` letter (DeepEval's convention for ``Scorer.exact_match_score``). - ``_resolve_tasks`` matches activity labels case-insensitively via a lowercased-value map; falls back to upper-snake-case enum name (``HellaSwagTask.APPLYING_SUNSCREEN`` form) for the recipe's ``getattr(HellaSwagTask, name.upper(), None)`` parity. **ExactMatchGrader** (`src/aiperf/accuracy/graders/exact_match.py`) - Strict ``pred.strip() == gold.strip()`` semantics matching DeepEval's ``Scorer.exact_match_score``: case-sensitive, no normalization, empty response → ``unparsed=True``. - Used by HellaSwag and (in AIP-878) BigBench-Hard for reference parity. **Plugin registration** (`src/aiperf/plugin/plugins.yaml`) - ``hellaswag`` → ``default_grader: exact_match``, ``default_n_shots: 10`` with the DeepEval-backed description. - ``exact_match`` → strict-equality description; drops the ``is_implemented: false`` flag. **Dependencies** (`pyproject.toml`) - Adds ``deepeval>=2.9.0`` to the ``[accuracy]`` optional-dependency group. Aiperf calls DeepEval's bundled prompt template directly so the dep is required for HellaSwag (and BigBench-Hard in AIP-878). **Tests** (`tests/unit/accuracy/`) - ``test_hellaswag_benchmark.py``: ~22 tests covering DeepEval prompt byte-equality, the unique-activity-label shots set rule, validation filtering, task resolution (exact, lower, upper, mixed case), and pathological dataset rows (empty validation, unlabeled rows). - ``test_exact_match_grader.py``: strict-equality semantics including the empty-response → ``unparsed=True`` path and case-sensitivity. - ``test_accuracy_config.py``: drops ``hellaswag`` from ``STUB_BENCHMARKS`` and ``exact_match`` from ``STUB_GRADERS``; the uppercase-stub test now uses ``BIGBENCH`` (a still-stub name). **Docs** (`docs/accuracy/`) - ``accuracy-benchmarking.md``: add HellaSwag row to the benchmarks table. - ``accuracy_stubs.md``: status summary + move HellaSwag from "Still Stubbed" to "Implemented"; move ExactMatchGrader to "Implemented". **Constructor signature** Loader + grader use the v2 ``BenchmarkRun`` API (post-#912 refactor on main) rather than the legacy ``UserConfig`` shape — matches how ``MMLUBenchmark`` and ``AIMEBenchmark`` are wired on current main. Validation: - 70/70 accuracy tests pass (HellaSwag + ExactMatch + AccuracyConfig). - Ruff format + ruff check clean on all modified Python files. - Codespell clean (v2.4.2, matches CI). - HellaSwag prompts verified byte-equal against ``HellaSwagTemplate.generate_output`` on synthetic fixtures. Reference: - ``deepeval/benchmarks/hellaswag/hellaswag.py`` - ``deepeval/benchmarks/hellaswag/template.py`` - ``deepeval/scorer/scorer.py:Scorer.exact_match_score`` - ``trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:319-336`` Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

…IP-877) ``HellaSwagBenchmark.load_problems`` called ``load_dataset()`` before ``_resolve_tasks(tasks)``, so an invalid ``--accuracy-tasks`` value triggered a multi-MB HF download (with potential network/cache failures) just to surface the validation error. Swap the order: validate tasks first, then download. Neither call depends on the other, so the reorder is mechanical. Adds a regression test (``TestTaskValidationPrecedesDatasetDownload``) that patches ``load_dataset`` and asserts it's never called when ``tasks`` resolves to a ``ValueError``. Full HellaSwag suite: 26/26 pass. Ruff + codespell clean. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

…rompt (AIP-877) ``HellaSwagTemplate.generate_output()`` renders a prompt that ends with ``"\nAnswer:"`` and never instructs the model to emit just a letter. DeepEval enforces bare-letter output two ways: primary: ``model.generate(prompt, schema=MultipleChoiceSchema)`` — structured generation pinned to A/B/C/D. fallback: on ``TypeError`` (when the model client doesn't accept ``schema=``), append ``self.confinement_instructions`` to the prompt and call ``model.generate(prompt)``. The default ``confinement_instructions`` set when the user passes ``None`` is: "Output 'A', 'B', 'C', or 'D'. Full answer not needed." Aiperf takes neither path: we send the rendered template prompt verbatim to whatever OpenAI-compatible endpoint the user pointed at, which corresponds to DeepEval's fallback path WITHOUT the confinement-string append. ``ExactMatchGrader`` (mirroring ``Scorer.exact_match_score``) then strictly compares the response against the bare ``A``/``B``/``C``/``D`` gold, so verbose-but-correct responses like ``"The answer is A."`` grade as wrong — systematically under-grading vs DeepEval reference numbers. Append the confinement string to the rendered prompt in ``_build_problems``, separated by ``\n\n`` (matching DeepEval's ``prompt += f"\n\n{self.confinement_instructions}"`` in ``predict()``). Pull the literal into a module-level constant ``DEEPEVAL_CONFINEMENT`` so the source-of-truth is documented and the test suite can pin it byte-equal. Adds three regression tests: * ``test_constant_matches_deepeval_default`` — pins the constant matches DeepEval's hardcoded default in ``HellaSwag.__init__``. * ``test_prompt_ends_with_confinement`` — every rendered prompt ends with ``"\n\n{DEEPEVAL_CONFINEMENT}"``. * ``test_template_output_is_a_prefix_of_prompt`` — preserves the ``HellaSwagTemplate.generate_output`` byte-equal pin by asserting the template output is a strict prefix of the final prompt, with only the confinement suffix added. Full HellaSwag suite: 29/29 pass (was 26). Ruff + codespell clean. Reference: deepeval/benchmarks/hellaswag/hellaswag.py:HellaSwag.__init__ deepeval/benchmarks/hellaswag/hellaswag.py:HellaSwag.predict Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

… (AIP-877) The previous ``_resolve_tasks`` guard treated ANY occurrence of ``"all"`` in the task list as "select every task": if not tasks or "all" in {t.lower() for t in tasks}: return list(HellaSwagTask) So ``["all", "NOT_A_REAL_TASK"]`` returned all 192 tasks and silently swallowed the typo. Same trap with ``["all", "Applying sunscreen"]`` — the model would evaluate every task instead of just the one specified, with no error surfaced. Tighten the guard: * ``None`` / empty list / exactly ``["all"]`` (case-insensitive) → return every task. * ``"all"`` mixed with any other name → raise ``ValueError`` so the user sees their mistake instead of getting an unexpected 192-task run. Adds two parametrized regression test groups: * ``test_all_alone_is_case_insensitive`` (4 cases: ``all``, ``ALL``, ``All``, ``aLl``) — every casing of the bare sentinel still selects every task. * ``test_all_mixed_with_other_selectors_raises`` (4 cases: typo, real-task, real-task-then-all, duplicate-all) — every mixed input raises with the documented "'all' cannot be mixed" message. Full HellaSwag suite: 37/37 pass (was 29). Ruff + codespell clean. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

Adds an explicit upper bound to the deepeval pin so the accuracy extras follow the same >=X,<Y style used elsewhere in this file (opentelemetry-*, etc.). Reviewer verified deepeval==4.0.3 still passes the HellaSwag unit tests and an E2E run, so the cap at the next major (5.0.0) leaves headroom while preserving version-policy hygiene. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

…g (AIP-877) HellaSwag accepts ``enable_cot`` for protocol uniformity but DeepEval's HellaSwag template has no CoT variant, so the flag is silently dropped. A user setting it had no way to discover this — the rendered prompts were byte-identical with and without ``--accuracy-enable-cot`` and no log signal fired. Emits a single ``info`` line at the top of ``load_problems`` so the drop is visible, keeping the accept-but-ignore stance otherwise unchanged. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

``load_dataset("Rowan/hellaswag")`` is intentionally unpinned to match the trt-llm benchmark recipe's resolution behavior. A reviewer flagged this as an observability gap — if the upstream HF dataset is rebased or re-uploaded, the byte-equal pin against ``HellaSwagTemplate.generate_output`` silently follows along with no test-suite signal. Spells out the policy in the module docstring so the trade-off (parity-with-recipe over pin-stability) is explicit, and flags that a SHA pin should only be added when/if the recipe pins one. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

…877) The lowercased-value → ``HellaSwagTask`` enum dict was being rebuilt inside ``_resolve_tasks`` on every ``load_problems`` call (~190 entries each). Since ``HellaSwagTask`` is immutable, the dict can live at module scope and be built once. Guarded with ``if _HAS_DEEPEVAL`` so the module still imports cleanly when the optional accuracy extras aren't installed. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

…IP-877) ``_build_problems`` previously scanned the full validation set once per selected task. With ``--accuracy-tasks=all`` that's ~190 tasks over a ~10K-row split — ~1.9M activity_label comparisons per load. Pre-bucketing val_set by ``activity_label`` in one pass collapses the inner scan to a direct dict lookup, making the loop O(val_rows + tasks) without changing problem order or content. DeepEval's internal ``val_set.filter(...)`` per-task is a reference-only detail (only prompt rendering is part of the parity goal), so the faster shape is free here. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

tests/unit/accuracy/test_exact_match_grader.py (1)

54-153: ⚡ Quick win

Normalize test names to test_<function>_<scenario>_<expected>.

Several names don’t clearly encode the function-under-test and expected outcome in the required format, which hurts suite consistency.

As per coding guidelines: "Test naming: test_<function>_<scenario>_<expected> e.g. test_parse_config_missing_field_raises_error".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/accuracy/test_exact_match_grader.py` around lines 54 - 153, The
test names in TestEmptyAndUnparsed, TestMultiLineNotForgiven,
TestUnicodeAndNonAscii, TestExtractAnswerInterface and TestGradingResultFields
do not follow the required test_<function>_<scenario>_<expected> convention;
rename each test function to include the function-under-test (grade or
extract_answer), a brief scenario, and the expected outcome (for example change
test_empty_response_unparsed_and_incorrect ->
test_grade_empty_response_unparsed_false, test_whitespace_only_response_unparsed
-> test_grade_whitespace_only_unparsed_true,
test_empty_pred_and_empty_gold_neither_correct_nor_unparsed ->
test_grade_empty_pred_and_gold_not_unparsed_false,
test_multi_line_response_does_not_match_single_letter ->
test_grade_multiline_response_mismatch_false,
test_explanation_prefix_does_not_match ->
test_grade_explanation_prefix_mismatch_false, test_unicode_match ->
test_grade_unicode_exact_match_true, test_unicode_case_sensitive ->
test_grade_unicode_case_sensitive_false, test_extract_answer_strips_only ->
test_extract_answer_strips_only_returns_stripped,
test_extract_answer_preserves_inner -> test_extract_answer_preserves_inner_text,
test_extract_answer_empty -> test_extract_answer_empty_returns_empty,
test_reasoning_includes_stripped_forms ->
test_grade_reasoning_contains_stripped_forms, and
test_extracted_answer_is_stripped_pred ->
test_grade_extracted_answer_is_stripped; update any references or
parametrization to use these new names and ensure they still call
ExactMatchGrader.grade and grader.extract_answer unchanged.

tests/unit/accuracy/test_hellaswag_benchmark.py (1)

51-57: ⚡ Quick win

Add a return type hint to _make_run.

Please annotate the helper return type so all functions in this module are fully typed.

🔧 Suggested fix

-def _make_run():
+def _make_run() -> Any:
     return make_benchmark_run(
         model_names=["test-model"],
         endpoint_type=EndpointType.COMPLETIONS,
         streaming=False,
         accuracy={"benchmark": AccuracyBenchmarkType.HELLASWAG},
     )

As per coding guidelines: `**/*.py`: “Type hints on ALL functions (params and return)”.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/accuracy/test_hellaswag_benchmark.py` around lines 51 - 57, The
helper function _make_run lacks a return type annotation; add a precise return
type hint that matches the value returned by calling make_benchmark_run (e.g.,
the BenchmarkRun or equivalent type used across tests) so the signature reads
something like def _make_run() -> BenchmarkRun:, ensuring imports or forward
references for that type are added if necessary; reference _make_run,
make_benchmark_run, EndpointType and AccuracyBenchmarkType to locate the code.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/accuracy/benchmarks/hellaswag.py`:
- Line 262: Replace the non-ASCII multiplication glyph in the comment "loop is
O(val_rows + tasks) instead of O(tasks × val_rows)." by changing the `×`
character to an ASCII `x` (so it reads "O(tasks x val_rows)" or use `*` if
preferred), updating the comment in the same location in
src/aiperf/accuracy/benchmarks/hellaswag.py to satisfy Ruff RUF003.

In `@tests/unit/accuracy/test_exact_match_grader.py`:
- Around line 22-28: The helper function _make_run is missing a return type
annotation; add an explicit return type that matches the value returned by
make_benchmark_run (e.g., the BenchMarkRun/BenchmarkRun type used by
make_benchmark_run in this test suite) so the signature becomes def _make_run()
-> <appropriate_return_type>: and keep the existing body unchanged; reference
the function name _make_run and the factory call make_benchmark_run (and related
enums EndpointType and AccuracyBenchmarkType) to locate where to apply the
annotation.

---

Nitpick comments:
In `@tests/unit/accuracy/test_exact_match_grader.py`:
- Around line 54-153: The test names in TestEmptyAndUnparsed,
TestMultiLineNotForgiven, TestUnicodeAndNonAscii, TestExtractAnswerInterface and
TestGradingResultFields do not follow the required
test_<function>_<scenario>_<expected> convention; rename each test function to
include the function-under-test (grade or extract_answer), a brief scenario, and
the expected outcome (for example change
test_empty_response_unparsed_and_incorrect ->
test_grade_empty_response_unparsed_false, test_whitespace_only_response_unparsed
-> test_grade_whitespace_only_unparsed_true,
test_empty_pred_and_empty_gold_neither_correct_nor_unparsed ->
test_grade_empty_pred_and_gold_not_unparsed_false,
test_multi_line_response_does_not_match_single_letter ->
test_grade_multiline_response_mismatch_false,
test_explanation_prefix_does_not_match ->
test_grade_explanation_prefix_mismatch_false, test_unicode_match ->
test_grade_unicode_exact_match_true, test_unicode_case_sensitive ->
test_grade_unicode_case_sensitive_false, test_extract_answer_strips_only ->
test_extract_answer_strips_only_returns_stripped,
test_extract_answer_preserves_inner -> test_extract_answer_preserves_inner_text,
test_extract_answer_empty -> test_extract_answer_empty_returns_empty,
test_reasoning_includes_stripped_forms ->
test_grade_reasoning_contains_stripped_forms, and
test_extracted_answer_is_stripped_pred ->
test_grade_extracted_answer_is_stripped; update any references or
parametrization to use these new names and ensure they still call
ExactMatchGrader.grade and grader.extract_answer unchanged.

In `@tests/unit/accuracy/test_hellaswag_benchmark.py`:
- Around line 51-57: The helper function _make_run lacks a return type
annotation; add a precise return type hint that matches the value returned by
calling make_benchmark_run (e.g., the BenchmarkRun or equivalent type used
across tests) so the signature reads something like def _make_run() ->
BenchmarkRun:, ensuring imports or forward references for that type are added if
necessary; reference _make_run, make_benchmark_run, EndpointType and
AccuracyBenchmarkType to locate the code.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 23a76a96-7459-4a65-b142-8665bded0979

📥 Commits

Reviewing files that changed from the base of the PR and between c71dbac and 98a2c25.

📒 Files selected for processing (9)

docs/accuracy/accuracy-benchmarking.md
docs/accuracy/accuracy_stubs.md
pyproject.toml
src/aiperf/accuracy/benchmarks/hellaswag.py
src/aiperf/accuracy/graders/exact_match.py
src/aiperf/plugin/plugins.yaml
tests/unit/accuracy/test_accuracy_config.py
tests/unit/accuracy/test_exact_match_grader.py
tests/unit/accuracy/test_hellaswag_benchmark.py

✅ Files skipped from review due to trivial changes (3)

docs/accuracy/accuracy-benchmarking.md
src/aiperf/accuracy/graders/exact_match.py
docs/accuracy/accuracy_stubs.md

matthewkotila

Nice work 🎉

dynamo-ops

Previous review comments have been addressed. Approving.

…ark-loader

github-actions Bot added the feat label May 12, 2026

debermudez mentioned this pull request May 12, 2026

feat(accuracy): implement AIME accuracy benchmark #849

Merged

debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from 5824827 to 03e10c5 Compare May 12, 2026 23:23

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from 88bafae to 3c150cd Compare May 12, 2026 23:24

debermudez marked this pull request as draft May 12, 2026 23:27

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from b93469a to 45a2dff Compare May 13, 2026 00:33

debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from f87725a to 09a0337 Compare May 13, 2026 17:06

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch 2 times, most recently from 923b9c9 to 24e1ef3 Compare May 13, 2026 22:04

debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from 118534c to 8b16196 Compare May 13, 2026 23:21

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from 24e1ef3 to dcb42d0 Compare May 13, 2026 23:32

debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from 8b16196 to 44663c3 Compare May 14, 2026 04:08

Base automatically changed from dbermudez/aip-874-implement-aime-benchmark-loader to main May 14, 2026 04:25

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from dcb42d0 to fcee141 Compare May 14, 2026 18:02

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Comment thread docs/accuracy/accuracy_stubs.md Outdated

Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py

Comment thread tests/unit/accuracy/test_hellaswag_benchmark.py Outdated

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from e3dea75 to 41d5d73 Compare May 14, 2026 20:02

debermudez marked this pull request as ready for review May 14, 2026 20:58

dynamo-ops reviewed May 14, 2026

View reviewed changes

Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py Outdated

Comment thread src/aiperf/plugin/plugins.yaml

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch 2 times, most recently from c71dbac to f10c9a7 Compare May 19, 2026 17:32

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py Outdated

Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py

matthewkotila reviewed May 22, 2026

View reviewed changes

debermudez added 9 commits May 21, 2026 20:17

debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from f3097f6 to 98a2c25 Compare May 22, 2026 03:50

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py

Comment thread tests/unit/accuracy/test_exact_match_grader.py

matthewkotila approved these changes May 22, 2026

View reviewed changes

dynamo-ops approved these changes May 22, 2026

View reviewed changes

Merge branch 'main' into dbermudez/aip-877-implement-hellaswag-benchm…

be3620d

…ark-loader

debermudez enabled auto-merge (squash) May 22, 2026 16:51

debermudez merged commit a0c173c into main May 22, 2026
23 of 25 checks passed

debermudez deleted the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch May 22, 2026 16:52

Conversation

debermudez commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

debermudez commented May 12, 2026

Stack dependency

Uh oh!

codecov Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Reviews paused

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

matthewkotila left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

matthewkotila left a comment

Choose a reason for hiding this comment

Uh oh!

dynamo-ops left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

debermudez commented May 12, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading

codecov Bot commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading