Skip to content

feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877)#923

Merged
debermudez merged 10 commits into
mainfrom
dbermudez/aip-877-implement-hellaswag-benchmark-loader
May 22, 2026
Merged

feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877)#923
debermudez merged 10 commits into
mainfrom
dbermudez/aip-877-implement-hellaswag-benchmark-loader

Conversation

@debermudez
Copy link
Copy Markdown
Contributor

@debermudez debermudez commented May 12, 2026

  • Plugin: registers hellaswag benchmark in plugins.yaml (default_grader: multiple_choice, default_n_shots: 0); scaffold loader raises NotImplementedError until the full DeepEval-backed implementation lands.
  • ExactMatchGrader: registers exact_match grader plugin in plugins.yaml; scaffold raises NotImplementedError — wired here because it is needed by the upcoming BigBench PR (AIP-878).
  • HellaSwag is a commonsense reasoning benchmark for sentence-completion tasks aligned with the trt-llm-benchmark-recipe DeepEval-backed hellaswag path.
  • Depends on AIP-874 for the grader infrastructure ([accuracy] optional group).

Reference: trt-llm-benchmark-recipe DeepEval-backed hellaswag evaluation

Summary by CodeRabbit

  • New Features

    • HellaSwag benchmark implemented with exact-match grading and a default of 10 few-shot examples.
  • Documentation

    • Accuracy docs updated to mark HellaSwag and Exact Match as implemented and document defaults and semantics.
  • Tests

    • Added comprehensive unit tests for HellaSwag prompt/problem construction and Exact Match grading behavior.
  • Chores

    • Added optional support for an upstream evaluation library to enable benchmark prompt generation.

Review Change Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the feat label May 12, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@be3620de923fa3d63f51713882aebcd07e56f0d4

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@be3620de923fa3d63f51713882aebcd07e56f0d4

Last updated for commit: be3620dBrowse code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

@debermudez
Copy link
Copy Markdown
Contributor Author

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

  1. AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
  2. AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
  3. AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
  4. AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
  5. AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
  6. AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
  7. AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
  8. AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 2 of 8 — base branch is dbermudez/aip-874-implement-aime-benchmark-loader,
depends on #849 (AIP-874) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

❌ Patch coverage is 33.69565% with 61 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aiperf/accuracy/benchmarks/hellaswag.py 26.50% 61 Missing ⚠️

📢 Thoughts on this report? Let us know!

@debermudez debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from 118534c to 8b16196 Compare May 13, 2026 23:21
@debermudez debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from 24e1ef3 to dcb42d0 Compare May 13, 2026 23:32
@debermudez debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from 8b16196 to 44663c3 Compare May 14, 2026 04:08
Base automatically changed from dbermudez/aip-874-implement-aime-benchmark-loader to main May 14, 2026 04:25
@debermudez debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from dcb42d0 to fcee141 Compare May 14, 2026 18:02
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Caution

Review failed

Pull request was closed or merged during review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d2fc67a-8e1e-43b5-8fea-d938366cd0c8

📥 Commits

Reviewing files that changed from the base of the PR and between 98a2c25 and be3620d.

📒 Files selected for processing (1)
  • pyproject.toml

Walkthrough

Implements ExactMatchGrader and a DeepEval-aligned HellaSwagBenchmark, adds deepeval>=2.9.0, updates plugin metadata and docs, and adds unit tests validating grader and benchmark behaviors.

Changes

HellaSwag benchmark + ExactMatch grader

Layer / File(s) Summary
ExactMatchGrader implementation & contract
src/aiperf/accuracy/graders/exact_match.py
Implements extract_answer (returns response_text.strip() or "") and grade (strict equality of stripped pred vs ground truth, sets unparsed, confidence, reasoning, and returns GradingResult).
ExactMatchGrader unit tests
tests/unit/accuracy/test_exact_match_grader.py
New test suite covering strict whitespace stripping, case sensitivity, punctuation differences, empty/whitespace handling (unparsed semantics), multi-line/explanation-prefix rejection, Unicode behavior, extract_answer contract, and GradingResult fields.
HellaSwag benchmark foundation & helpers
src/aiperf/accuracy/benchmarks/hellaswag.py
Adds module docstring, conditional DeepEval imports with install hint, dataset/task constants, _build_unique_activity_label_shots_set, and _resolve_tasks (case-insensitive value and enum-name fallback).
HellaSwag benchmark runtime & problem building
src/aiperf/accuracy/benchmarks/hellaswag.py
Implements __init__ (fail-fast if deepeval missing), load_problems (n_shots validation, async dataset load, task resolution), _build_problems (filter by activity_label, render DeepEval-aligned prompts with HellaSwagTemplate.format_question + generate_output, map labels to AD, build BenchmarkProblem), and _build_chat_messages (single user message).
HellaSwagBenchmark unit tests
tests/unit/accuracy/test_hellaswag_benchmark.py
New tests (skipped when deepeval not installed) patch load_dataset, assert constants, unique-shot selection, _resolve_tasks behaviors (None/all/value/enum-name/case-insensitive), prompt formatting parity zero-/few-shot, ground-truth letter mapping, activity_label filtering, enable_cot transparency, n_shots cap, empty/invalid-label handling, and raw_messages shape.
Plugin, dependency, test config, and docs
pyproject.toml, src/aiperf/plugin/plugins.yaml, tests/unit/accuracy/test_accuracy_config.py, docs/accuracy/*
Adds deepeval>=2.9.0 to accuracy extras; marks exact_match and hellaswag implemented in plugins.yaml and updates defaults (default_grader: exact_match, default_n_shots: 10); removes hellaswag/exact_match from stub lists in tests and updates uppercase validation; moves HellaSwag and ExactMatch entries from stubbed to implemented in documentation and adds hellaswag to benchmarking table.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through prompts with whiskers bright,
I strip the edges, match by byte.
HellaSwag arrives with templates true,
Tests and prompts align, byte-equal too.
Hop, assert, and pass — a rabbit’s delight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.32% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877)' accurately and specifically summarizes the main changes: implementing a HellaSwag benchmark and ExactMatch grader using DeepEval.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
src/aiperf/accuracy/benchmarks/hellaswag.py (3)

153-153: ⚡ Quick win

Drop the Any annotation on **kwargs.

Per the established convention in this repo, variadic keyword arguments should be left untyped rather than annotated as **kwargs: Any. The AI summary calls out this typing change explicitly, so it's worth correcting before merge.

♻️ Proposed fix
-    def __init__(self, user_config: UserConfig, **kwargs: Any) -> None:
+    def __init__(self, user_config: UserConfig, **kwargs) -> None:

Based on learnings: "In Python projects (e.g., in aiperf), avoid adding type annotations to **kwargs like **kwargs: Any. The variadic keyword arguments are inherently dynamic; leave **kwargs untyped or replace with explicit, named keyword parameters if a concrete contract is needed."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/accuracy/benchmarks/hellaswag.py` at line 153, The __init__
signature in the HellaSWAG benchmark currently types the variadic keyword
parameter as **kwargs: Any; remove the explicit Any annotation and leave it
untyped (use **kwargs) in the __init__ method (the constructor taking
user_config: UserConfig) so it follows the repository convention for variadic
keyword arguments.

219-219: 💤 Low value

Defensive bound on the gold-letter index.

choices[int(label_raw)] will IndexError if a row ever has a label outside 0–3 (or int()-incompatible content like "abc"). The Rowan/hellaswag schema today is 0–3 so this is mostly defensive, but a clearer try/except (or explicit range check) drops the row and logs rather than aborting the whole load. Optional — feel free to skip if the dataset contract is considered authoritative.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/accuracy/benchmarks/hellaswag.py` at line 219, Wrap the
conversion/indexing of choices[int(label_raw)] in a defensive try/except that
catches ValueError and IndexError, logs the offending label_raw (and row
identifier if available) and skips that row instead of letting the exception
abort the load; update the code around gold_letter, choices and label_raw to
validate int(label_raw) is in range(len(choices)) or use try: gold_letter =
choices[idx] except (ValueError, IndexError): logger.warning(...); continue so
malformed or out-of-range labels are dropped gracefully.

192-234: ⚡ Quick win

Quadratic scan of the validation split for tasks=["all"].

_build_problems iterates the full val_set once per task. When the caller passes None/"all", _resolve_tasks returns every HellaSwagTask member (~190 categories), so this becomes an O(tasks × val_rows) walk over a ~10k-row split — roughly 1.9M iterations per load_problems call, all on the executor thread. Indexing the validation rows by activity_label once and then iterating the right bucket per task makes this O(N + M).

♻️ Proposed refactor
     def _build_problems(
         self, ds: DatasetDict, tasks: list[Any], n_shots: int
     ) -> list[BenchmarkProblem]:
         train_set = ds["train"]
         shots_set = _build_unique_activity_label_shots_set(train_set)
         val_set = ds["validation"]
         problems: list[BenchmarkProblem] = []
         choices = ["A", "B", "C", "D"]
+        # Pre-bucket validation rows by activity_label so each task
+        # only scans its own subset instead of the whole split.
+        val_by_label: dict[str, list[dict[str, Any]]] = {}
+        for row in val_set:
+            label = row.get(ACTIVITY_LABEL_FIELD)
+            if label is None:
+                continue
+            val_by_label.setdefault(label, []).append(row)
         for task in tasks:
-            for row in val_set:
-                if row.get(ACTIVITY_LABEL_FIELD) != task.value:
-                    continue
+            for row in val_by_label.get(task.value, ()):
                 label_raw = row.get(LABEL_FIELD)
                 if label_raw == "" or label_raw is None:
                     continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/accuracy/benchmarks/hellaswag.py` around lines 192 - 234, The
current _build_problems scans val_set once per task causing O(tasks × val_rows);
fix it by first building a dict index mapping activity_label -> list of
validation rows (keyed by ACTIVITY_LABEL_FIELD) once, then iterate tasks and for
each use bucket = index.get(task.value, []) instead of scanning val_set; keep
the existing logic that builds shots_set, uses
HellaSwagTemplate.format_question/generate_output, maps label_raw to gold_letter
via choices and appends BenchmarkProblem with the same fields and raw_messages
from self._build_chat_messages(prompt).
tests/unit/accuracy/test_hellaswag_benchmark.py (1)

327-332: 💤 Low value

Consider asserting on the constant rather than the literal "15".

The match="at most 15" substring couples the test to the hardcoded message text. Using match=f"at most {MAX_N_SHOTS}" (the constant is already imported) keeps the test correct if the cap ever changes. Minor.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/accuracy/test_hellaswag_benchmark.py` around lines 327 - 332, The
test hardcodes the string "at most 15" in the pytest.raises match; change it to
reference the imported constant so the test follows the configured cap. In
TestNShotsCap::test_n_shots_above_15_raises replace match="at most 15" with
match=f"at most {MAX_N_SHOTS}" (the constant MAX_N_SHOTS is already imported) so
the assertion uses the canonical limit used by HellaSwagBenchmark.load_problems.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/accuracy/accuracy_stubs.md`:
- Line 173: The table entry for HellaSwagBenchmark is out of sync with
plugins.yaml: update the default grader and n-shots to match the plugin settings
by changing the grader from `multiple_choice` to `exact_match` and the `0` shots
to `10` in the HellaSwagBenchmark row; ensure the descriptive text (the
`HellaSwagBenchmark` row contents and any mention of "renders 4-choice MCQs") is
consistent with using `exact_match` and default_n_shots 10 so readers and
HellaSwagBenchmark references reflect the plugin defaults.

In `@src/aiperf/accuracy/benchmarks/hellaswag.py`:
- Around line 105-141: The docstring says matching is case-insensitive but
_resolve_tasks currently checks activity_label strings case-sensitively against
valid_values (HellaSwagTask.value); make value-matching truly case-insensitive
by building a mapping from lowercased enum values to the enum member (e.g.,
lower_value_map = {t.value.lower(): t for t in HellaSwagTask}), then when
iterating names use name_lower = name.lower() to look up in that map (and append
the found HellaSwagTask), keep the existing upper-snake-case fallback using
getattr(HellaSwagTask, name.upper(), None), and preserve the unknown collection
and ValueError behavior if neither lookup succeeds so the error message remains
the same.

In `@tests/unit/accuracy/test_hellaswag_benchmark.py`:
- Around line 208-212: The failing codespell match comes from the literal
sequences containing "nD" in the test assertions (the prompt string and the
options block asserted against variable prompt); instead of relying on "#
codespell:ignore", break or rework the string literals so "nD" never appears as
a single token—e.g., use implicit adjacent string concatenation or split the
offending token across two strings when forming the prompt and the option block
used in the assertions (the lines with the prompt variable and the assertion
containing "A. applies\nB. watches\nC. sings\nD. cooks\nAnswer:"); update those
assertions to compare against the restructured strings and remove the
ineffective pragma.

---

Nitpick comments:
In `@src/aiperf/accuracy/benchmarks/hellaswag.py`:
- Line 153: The __init__ signature in the HellaSWAG benchmark currently types
the variadic keyword parameter as **kwargs: Any; remove the explicit Any
annotation and leave it untyped (use **kwargs) in the __init__ method (the
constructor taking user_config: UserConfig) so it follows the repository
convention for variadic keyword arguments.
- Line 219: Wrap the conversion/indexing of choices[int(label_raw)] in a
defensive try/except that catches ValueError and IndexError, logs the offending
label_raw (and row identifier if available) and skips that row instead of
letting the exception abort the load; update the code around gold_letter,
choices and label_raw to validate int(label_raw) is in range(len(choices)) or
use try: gold_letter = choices[idx] except (ValueError, IndexError):
logger.warning(...); continue so malformed or out-of-range labels are dropped
gracefully.
- Around line 192-234: The current _build_problems scans val_set once per task
causing O(tasks × val_rows); fix it by first building a dict index mapping
activity_label -> list of validation rows (keyed by ACTIVITY_LABEL_FIELD) once,
then iterate tasks and for each use bucket = index.get(task.value, []) instead
of scanning val_set; keep the existing logic that builds shots_set, uses
HellaSwagTemplate.format_question/generate_output, maps label_raw to gold_letter
via choices and appends BenchmarkProblem with the same fields and raw_messages
from self._build_chat_messages(prompt).

In `@tests/unit/accuracy/test_hellaswag_benchmark.py`:
- Around line 327-332: The test hardcodes the string "at most 15" in the
pytest.raises match; change it to reference the imported constant so the test
follows the configured cap. In TestNShotsCap::test_n_shots_above_15_raises
replace match="at most 15" with match=f"at most {MAX_N_SHOTS}" (the constant
MAX_N_SHOTS is already imported) so the assertion uses the canonical limit used
by HellaSwagBenchmark.load_problems.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5a065e80-9e1f-4b07-87d5-d5129709a5fe

📥 Commits

Reviewing files that changed from the base of the PR and between 837bda0 and fcee141.

📒 Files selected for processing (9)
  • docs/accuracy/accuracy-benchmarking.md
  • docs/accuracy/accuracy_stubs.md
  • pyproject.toml
  • src/aiperf/accuracy/benchmarks/hellaswag.py
  • src/aiperf/accuracy/graders/exact_match.py
  • src/aiperf/plugin/plugins.yaml
  • tests/unit/accuracy/test_accuracy_config.py
  • tests/unit/accuracy/test_exact_match_grader.py
  • tests/unit/accuracy/test_hellaswag_benchmark.py

Comment thread docs/accuracy/accuracy_stubs.md Outdated
Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py
Comment thread tests/unit/accuracy/test_hellaswag_benchmark.py Outdated
@debermudez debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from e3dea75 to 41d5d73 Compare May 14, 2026 20:02
@debermudez debermudez marked this pull request as ready for review May 14, 2026 20:58
Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py Outdated
Comment thread src/aiperf/plugin/plugins.yaml
@debermudez debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch 2 times, most recently from c71dbac to f10c9a7 Compare May 19, 2026 17:32
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/accuracy/benchmarks/hellaswag.py`:
- Around line 189-194: The current guard only enforces the upper bound
(MAX_N_SHOTS) but lets negative n_shots proceed; update the validation around
n_shots (the same block that references MAX_N_SHOTS and the HellaSwag benchmark)
to also reject negative values by raising a ValueError when n_shots < 0 (or
n_shots is negative), with a clear message like "n_shots must be non-negative"
that mirrors the existing style; keep the existing upper-bound check for
MAX_N_SHOTS and combine both checks in the same validation block that currently
contains the f"HellaSwag supports at most {MAX_N_SHOTS} ..." error.
- Around line 120-121: The current selector logic returns list(HellaSwagTask)
whenever "all" appears in tasks, which lets inputs like ["all","typo"] bypass
validation; change the guard in the task-selection routine that references tasks
and HellaSwagTask so that you only treat "all" as meaning "select every task"
when tasks is empty or exactly ["all"] (case-insensitive). Specifically, if not
tasks or tasks == ["all"] (ignoring case) return list(HellaSwagTask); if "all"
is present alongside other selectors raise a ValueError (or similar) explaining
that "all" cannot be mixed with other task names so invalid selectors don’t get
silently ignored.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0d738e6e-7dfd-4eb5-b2a5-0b9dec70791c

📥 Commits

Reviewing files that changed from the base of the PR and between 41d5d73 and c71dbac.

📒 Files selected for processing (5)
  • docs/accuracy/accuracy-benchmarking.md
  • docs/accuracy/accuracy_stubs.md
  • pyproject.toml
  • src/aiperf/accuracy/benchmarks/hellaswag.py
  • src/aiperf/accuracy/graders/exact_match.py
✅ Files skipped from review due to trivial changes (1)
  • docs/accuracy/accuracy-benchmarking.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/aiperf/accuracy/graders/exact_match.py

Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py Outdated
Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py
Copy link
Copy Markdown
Contributor

@matthewkotila matthewkotila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid PR. The reference-parity discipline (byte-equal-to-DeepEval pinned via direct HellaSwagTemplate.generate_output() comparison in the tests) and the three principled fixup commits (validate-before-download, confinement-string append, reject 'all' mixed with other selectors) raise the bar above "ship a benchmark that runs."

Verification I did locally before writing this:

  • 58/58 unit tests pass against the installed deepeval==4.0.3 (a full major above the floor) — the deep import paths into deepeval.benchmarks.hellaswag.template survived the bump cleanly.
  • End-to-end smoke run: HellaSwag → mock server → grader → exporter. Pipeline completes, 5/5 requests dispatched, accuracy CSV + console output render correctly with spaced task name "Applying sunscreen". inputs.json confirms the DEEPEVAL_CONFINEMENT string is appended to live prompts exactly as the fixup commit pins.
  • Console + CSV exporters correctly handle spaces in task names (cross-checked via direct exercise — no false delimiter splits, no transform).
  • enable_cot=True confirmed silently dropped (see inline comment on the docstring).

Closed by verification: the original "deep imports will silently break across deepeval majors" concern is substantially weakened; spaced-task-name exporter handling is fine; confinement string reaches real prompts.

Recommended before merge:

  1. Add the enable_cot log line (see inline).
  2. Cap deepeval (see inline).
  3. Rebase to current main — branch is 3 commits behind merge-base on unrelated bits, inflating the visible diff against main.
  4. Optional: pin the HF dataset revision (see inline).

Comment thread pyproject.toml Outdated
Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py
Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py
Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py Outdated
Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py Outdated
Comment thread tests/unit/accuracy/test_accuracy_config.py
…ked (AIP-877)

Adds HellaSwag commonsense reasoning benchmark + the strict ExactMatchGrader
that DeepEval's ``Scorer.exact_match_score`` uses. Both align byte-for-
byte with the trt-llm benchmark recipe's DeepEval-backed configuration.

**HellaSwagBenchmark** (`src/aiperf/accuracy/benchmarks/hellaswag.py`)
- Loads ``Rowan/hellaswag``: validation split filtered per task by
  ``activity_label``, train split feeds the "one few-shot per unique
  activity_label" rule (mirrors ``deepeval.benchmarks.HellaSwag``'s
  ``categories_seen`` dedupe loop).
- Prompt rendering delegates to
  ``deepeval.benchmarks.hellaswag.template.HellaSwagTemplate.generate_output``
  — output is byte-equal to what the trt-llm recipe ships.
- Defaults: ``n_shots=10`` (DeepEval cap is 15), ``generation_size=5``,
  ``default_grader=exact_match``.
- Ground truth: bare ``A``/``B``/``C``/``D`` letter (DeepEval's
  convention for ``Scorer.exact_match_score``).
- ``_resolve_tasks`` matches activity labels case-insensitively via a
  lowercased-value map; falls back to upper-snake-case enum name
  (``HellaSwagTask.APPLYING_SUNSCREEN`` form) for the recipe's
  ``getattr(HellaSwagTask, name.upper(), None)`` parity.

**ExactMatchGrader** (`src/aiperf/accuracy/graders/exact_match.py`)
- Strict ``pred.strip() == gold.strip()`` semantics matching DeepEval's
  ``Scorer.exact_match_score``: case-sensitive, no normalization, empty
  response → ``unparsed=True``.
- Used by HellaSwag and (in AIP-878) BigBench-Hard for reference parity.

**Plugin registration** (`src/aiperf/plugin/plugins.yaml`)
- ``hellaswag`` → ``default_grader: exact_match``, ``default_n_shots: 10``
  with the DeepEval-backed description.
- ``exact_match`` → strict-equality description; drops the
  ``is_implemented: false`` flag.

**Dependencies** (`pyproject.toml`)
- Adds ``deepeval>=2.9.0`` to the ``[accuracy]`` optional-dependency
  group. Aiperf calls DeepEval's bundled prompt template directly so
  the dep is required for HellaSwag (and BigBench-Hard in AIP-878).

**Tests** (`tests/unit/accuracy/`)
- ``test_hellaswag_benchmark.py``: ~22 tests covering DeepEval prompt
  byte-equality, the unique-activity-label shots set rule, validation
  filtering, task resolution (exact, lower, upper, mixed case), and
  pathological dataset rows (empty validation, unlabeled rows).
- ``test_exact_match_grader.py``: strict-equality semantics including
  the empty-response → ``unparsed=True`` path and case-sensitivity.
- ``test_accuracy_config.py``: drops ``hellaswag`` from
  ``STUB_BENCHMARKS`` and ``exact_match`` from ``STUB_GRADERS``; the
  uppercase-stub test now uses ``BIGBENCH`` (a still-stub name).

**Docs** (`docs/accuracy/`)
- ``accuracy-benchmarking.md``: add HellaSwag row to the benchmarks
  table.
- ``accuracy_stubs.md``: status summary + move HellaSwag from "Still
  Stubbed" to "Implemented"; move ExactMatchGrader to "Implemented".

**Constructor signature**
Loader + grader use the v2 ``BenchmarkRun`` API (post-#912 refactor on
main) rather than the legacy ``UserConfig`` shape — matches how
``MMLUBenchmark`` and ``AIMEBenchmark`` are wired on current main.

Validation:
- 70/70 accuracy tests pass (HellaSwag + ExactMatch + AccuracyConfig).
- Ruff format + ruff check clean on all modified Python files.
- Codespell clean (v2.4.2, matches CI).
- HellaSwag prompts verified byte-equal against
  ``HellaSwagTemplate.generate_output`` on synthetic fixtures.

Reference:
- ``deepeval/benchmarks/hellaswag/hellaswag.py``
- ``deepeval/benchmarks/hellaswag/template.py``
- ``deepeval/scorer/scorer.py:Scorer.exact_match_score``
- ``trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:319-336``

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…IP-877)

``HellaSwagBenchmark.load_problems`` called ``load_dataset()`` before
``_resolve_tasks(tasks)``, so an invalid ``--accuracy-tasks`` value
triggered a multi-MB HF download (with potential network/cache
failures) just to surface the validation error.

Swap the order: validate tasks first, then download. Neither call
depends on the other, so the reorder is mechanical.

Adds a regression test
(``TestTaskValidationPrecedesDatasetDownload``) that patches
``load_dataset`` and asserts it's never called when ``tasks`` resolves
to a ``ValueError``.

Full HellaSwag suite: 26/26 pass. Ruff + codespell clean.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…rompt (AIP-877)

``HellaSwagTemplate.generate_output()`` renders a prompt that ends with
``"\nAnswer:"`` and never instructs the model to emit just a letter.
DeepEval enforces bare-letter output two ways:

  primary:   ``model.generate(prompt, schema=MultipleChoiceSchema)``
             — structured generation pinned to A/B/C/D.
  fallback:  on ``TypeError`` (when the model client doesn't accept
             ``schema=``), append ``self.confinement_instructions`` to
             the prompt and call ``model.generate(prompt)``.

The default ``confinement_instructions`` set when the user passes
``None`` is:

    "Output 'A', 'B', 'C', or 'D'. Full answer not needed."

Aiperf takes neither path: we send the rendered template prompt
verbatim to whatever OpenAI-compatible endpoint the user pointed at,
which corresponds to DeepEval's fallback path WITHOUT the
confinement-string append. ``ExactMatchGrader`` (mirroring
``Scorer.exact_match_score``) then strictly compares the response
against the bare ``A``/``B``/``C``/``D`` gold, so verbose-but-correct
responses like ``"The answer is A."`` grade as wrong — systematically
under-grading vs DeepEval reference numbers.

Append the confinement string to the rendered prompt in
``_build_problems``, separated by ``\n\n`` (matching DeepEval's
``prompt += f"\n\n{self.confinement_instructions}"`` in
``predict()``). Pull the literal into a module-level constant
``DEEPEVAL_CONFINEMENT`` so the source-of-truth is documented and
the test suite can pin it byte-equal.

Adds three regression tests:

* ``test_constant_matches_deepeval_default`` — pins the constant
  matches DeepEval's hardcoded default in ``HellaSwag.__init__``.
* ``test_prompt_ends_with_confinement`` — every rendered prompt ends
  with ``"\n\n{DEEPEVAL_CONFINEMENT}"``.
* ``test_template_output_is_a_prefix_of_prompt`` — preserves the
  ``HellaSwagTemplate.generate_output`` byte-equal pin by asserting
  the template output is a strict prefix of the final prompt, with
  only the confinement suffix added.

Full HellaSwag suite: 29/29 pass (was 26). Ruff + codespell clean.

Reference:
  deepeval/benchmarks/hellaswag/hellaswag.py:HellaSwag.__init__
  deepeval/benchmarks/hellaswag/hellaswag.py:HellaSwag.predict
Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
… (AIP-877)

The previous ``_resolve_tasks`` guard treated ANY occurrence of ``"all"``
in the task list as "select every task":

    if not tasks or "all" in {t.lower() for t in tasks}:
        return list(HellaSwagTask)

So ``["all", "NOT_A_REAL_TASK"]`` returned all 192 tasks and silently
swallowed the typo. Same trap with ``["all", "Applying sunscreen"]``
— the model would evaluate every task instead of just the one
specified, with no error surfaced.

Tighten the guard:

* ``None`` / empty list / exactly ``["all"]`` (case-insensitive) →
  return every task.
* ``"all"`` mixed with any other name → raise ``ValueError`` so the
  user sees their mistake instead of getting an unexpected 192-task
  run.

Adds two parametrized regression test groups:

* ``test_all_alone_is_case_insensitive`` (4 cases: ``all``, ``ALL``,
  ``All``, ``aLl``) — every casing of the bare sentinel still selects
  every task.
* ``test_all_mixed_with_other_selectors_raises`` (4 cases: typo,
  real-task, real-task-then-all, duplicate-all) — every mixed input
  raises with the documented "'all' cannot be mixed" message.

Full HellaSwag suite: 37/37 pass (was 29). Ruff + codespell clean.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
Adds an explicit upper bound to the deepeval pin so the accuracy
extras follow the same >=X,<Y style used elsewhere in this file
(opentelemetry-*, etc.). Reviewer verified deepeval==4.0.3 still
passes the HellaSwag unit tests and an E2E run, so the cap at the
next major (5.0.0) leaves headroom while preserving version-policy
hygiene.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…g (AIP-877)

HellaSwag accepts ``enable_cot`` for protocol uniformity but DeepEval's
HellaSwag template has no CoT variant, so the flag is silently dropped.
A user setting it had no way to discover this — the rendered prompts
were byte-identical with and without ``--accuracy-enable-cot`` and no
log signal fired. Emits a single ``info`` line at the top of
``load_problems`` so the drop is visible, keeping the accept-but-ignore
stance otherwise unchanged.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
``load_dataset("Rowan/hellaswag")`` is intentionally unpinned to match
the trt-llm benchmark recipe's resolution behavior. A reviewer flagged
this as an observability gap — if the upstream HF dataset is rebased
or re-uploaded, the byte-equal pin against
``HellaSwagTemplate.generate_output`` silently follows along with no
test-suite signal. Spells out the policy in the module docstring so
the trade-off (parity-with-recipe over pin-stability) is explicit, and
flags that a SHA pin should only be added when/if the recipe pins one.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…877)

The lowercased-value → ``HellaSwagTask`` enum dict was being rebuilt
inside ``_resolve_tasks`` on every ``load_problems`` call (~190 entries
each). Since ``HellaSwagTask`` is immutable, the dict can live at
module scope and be built once. Guarded with ``if _HAS_DEEPEVAL`` so
the module still imports cleanly when the optional accuracy extras
aren't installed.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…IP-877)

``_build_problems`` previously scanned the full validation set once
per selected task. With ``--accuracy-tasks=all`` that's ~190 tasks
over a ~10K-row split — ~1.9M activity_label comparisons per load.
Pre-bucketing val_set by ``activity_label`` in one pass collapses the
inner scan to a direct dict lookup, making the loop O(val_rows + tasks)
without changing problem order or content. DeepEval's internal
``val_set.filter(...)`` per-task is a reference-only detail (only
prompt rendering is part of the parity goal), so the faster shape is
free here.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
@debermudez debermudez force-pushed the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch from f3097f6 to 98a2c25 Compare May 22, 2026 03:50
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/unit/accuracy/test_exact_match_grader.py (1)

54-153: ⚡ Quick win

Normalize test names to test_<function>_<scenario>_<expected>.

Several names don’t clearly encode the function-under-test and expected outcome in the required format, which hurts suite consistency.

As per coding guidelines: "Test naming: test_<function>_<scenario>_<expected> e.g. test_parse_config_missing_field_raises_error".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/accuracy/test_exact_match_grader.py` around lines 54 - 153, The
test names in TestEmptyAndUnparsed, TestMultiLineNotForgiven,
TestUnicodeAndNonAscii, TestExtractAnswerInterface and TestGradingResultFields
do not follow the required test_<function>_<scenario>_<expected> convention;
rename each test function to include the function-under-test (grade or
extract_answer), a brief scenario, and the expected outcome (for example change
test_empty_response_unparsed_and_incorrect ->
test_grade_empty_response_unparsed_false, test_whitespace_only_response_unparsed
-> test_grade_whitespace_only_unparsed_true,
test_empty_pred_and_empty_gold_neither_correct_nor_unparsed ->
test_grade_empty_pred_and_gold_not_unparsed_false,
test_multi_line_response_does_not_match_single_letter ->
test_grade_multiline_response_mismatch_false,
test_explanation_prefix_does_not_match ->
test_grade_explanation_prefix_mismatch_false, test_unicode_match ->
test_grade_unicode_exact_match_true, test_unicode_case_sensitive ->
test_grade_unicode_case_sensitive_false, test_extract_answer_strips_only ->
test_extract_answer_strips_only_returns_stripped,
test_extract_answer_preserves_inner -> test_extract_answer_preserves_inner_text,
test_extract_answer_empty -> test_extract_answer_empty_returns_empty,
test_reasoning_includes_stripped_forms ->
test_grade_reasoning_contains_stripped_forms, and
test_extracted_answer_is_stripped_pred ->
test_grade_extracted_answer_is_stripped; update any references or
parametrization to use these new names and ensure they still call
ExactMatchGrader.grade and grader.extract_answer unchanged.
tests/unit/accuracy/test_hellaswag_benchmark.py (1)

51-57: ⚡ Quick win

Add a return type hint to _make_run.

Please annotate the helper return type so all functions in this module are fully typed.

🔧 Suggested fix
-def _make_run():
+def _make_run() -> Any:
     return make_benchmark_run(
         model_names=["test-model"],
         endpoint_type=EndpointType.COMPLETIONS,
         streaming=False,
         accuracy={"benchmark": AccuracyBenchmarkType.HELLASWAG},
     )
As per coding guidelines: `**/*.py`: “Type hints on ALL functions (params and return)”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/accuracy/test_hellaswag_benchmark.py` around lines 51 - 57, The
helper function _make_run lacks a return type annotation; add a precise return
type hint that matches the value returned by calling make_benchmark_run (e.g.,
the BenchmarkRun or equivalent type used across tests) so the signature reads
something like def _make_run() -> BenchmarkRun:, ensuring imports or forward
references for that type are added if necessary; reference _make_run,
make_benchmark_run, EndpointType and AccuracyBenchmarkType to locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/accuracy/benchmarks/hellaswag.py`:
- Line 262: Replace the non-ASCII multiplication glyph in the comment "loop is
O(val_rows + tasks) instead of O(tasks × val_rows)." by changing the `×`
character to an ASCII `x` (so it reads "O(tasks x val_rows)" or use `*` if
preferred), updating the comment in the same location in
src/aiperf/accuracy/benchmarks/hellaswag.py to satisfy Ruff RUF003.

In `@tests/unit/accuracy/test_exact_match_grader.py`:
- Around line 22-28: The helper function _make_run is missing a return type
annotation; add an explicit return type that matches the value returned by
make_benchmark_run (e.g., the BenchMarkRun/BenchmarkRun type used by
make_benchmark_run in this test suite) so the signature becomes def _make_run()
-> <appropriate_return_type>: and keep the existing body unchanged; reference
the function name _make_run and the factory call make_benchmark_run (and related
enums EndpointType and AccuracyBenchmarkType) to locate where to apply the
annotation.

---

Nitpick comments:
In `@tests/unit/accuracy/test_exact_match_grader.py`:
- Around line 54-153: The test names in TestEmptyAndUnparsed,
TestMultiLineNotForgiven, TestUnicodeAndNonAscii, TestExtractAnswerInterface and
TestGradingResultFields do not follow the required
test_<function>_<scenario>_<expected> convention; rename each test function to
include the function-under-test (grade or extract_answer), a brief scenario, and
the expected outcome (for example change
test_empty_response_unparsed_and_incorrect ->
test_grade_empty_response_unparsed_false, test_whitespace_only_response_unparsed
-> test_grade_whitespace_only_unparsed_true,
test_empty_pred_and_empty_gold_neither_correct_nor_unparsed ->
test_grade_empty_pred_and_gold_not_unparsed_false,
test_multi_line_response_does_not_match_single_letter ->
test_grade_multiline_response_mismatch_false,
test_explanation_prefix_does_not_match ->
test_grade_explanation_prefix_mismatch_false, test_unicode_match ->
test_grade_unicode_exact_match_true, test_unicode_case_sensitive ->
test_grade_unicode_case_sensitive_false, test_extract_answer_strips_only ->
test_extract_answer_strips_only_returns_stripped,
test_extract_answer_preserves_inner -> test_extract_answer_preserves_inner_text,
test_extract_answer_empty -> test_extract_answer_empty_returns_empty,
test_reasoning_includes_stripped_forms ->
test_grade_reasoning_contains_stripped_forms, and
test_extracted_answer_is_stripped_pred ->
test_grade_extracted_answer_is_stripped; update any references or
parametrization to use these new names and ensure they still call
ExactMatchGrader.grade and grader.extract_answer unchanged.

In `@tests/unit/accuracy/test_hellaswag_benchmark.py`:
- Around line 51-57: The helper function _make_run lacks a return type
annotation; add a precise return type hint that matches the value returned by
calling make_benchmark_run (e.g., the BenchmarkRun or equivalent type used
across tests) so the signature reads something like def _make_run() ->
BenchmarkRun:, ensuring imports or forward references for that type are added if
necessary; reference _make_run, make_benchmark_run, EndpointType and
AccuracyBenchmarkType to locate the code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 23a76a96-7459-4a65-b142-8665bded0979

📥 Commits

Reviewing files that changed from the base of the PR and between c71dbac and 98a2c25.

📒 Files selected for processing (9)
  • docs/accuracy/accuracy-benchmarking.md
  • docs/accuracy/accuracy_stubs.md
  • pyproject.toml
  • src/aiperf/accuracy/benchmarks/hellaswag.py
  • src/aiperf/accuracy/graders/exact_match.py
  • src/aiperf/plugin/plugins.yaml
  • tests/unit/accuracy/test_accuracy_config.py
  • tests/unit/accuracy/test_exact_match_grader.py
  • tests/unit/accuracy/test_hellaswag_benchmark.py
✅ Files skipped from review due to trivial changes (3)
  • docs/accuracy/accuracy-benchmarking.md
  • src/aiperf/accuracy/graders/exact_match.py
  • docs/accuracy/accuracy_stubs.md

Comment thread src/aiperf/accuracy/benchmarks/hellaswag.py
Comment thread tests/unit/accuracy/test_exact_match_grader.py
Copy link
Copy Markdown
Contributor

@matthewkotila matthewkotila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 🎉

Copy link
Copy Markdown

@dynamo-ops dynamo-ops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous review comments have been addressed. Approving.

@debermudez debermudez enabled auto-merge (squash) May 22, 2026 16:51
@debermudez debermudez merged commit a0c173c into main May 22, 2026
23 of 25 checks passed
@debermudez debermudez deleted the dbermudez/aip-877-implement-hellaswag-benchmark-loader branch May 22, 2026 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants