feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877)#923
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@be3620de923fa3d63f51713882aebcd07e56f0d4Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@be3620de923fa3d63f51713882aebcd07e56f0d4Last updated for commit: |
Stack dependencyThis PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with Merge order:
This PR: position 2 of 8 — base branch is After each upstream PR merges, the downstream PR's branch will be rebased |
5824827 to
03e10c5
Compare
88bafae to
3c150cd
Compare
b93469a to
45a2dff
Compare
f87725a to
09a0337
Compare
923b9c9 to
24e1ef3
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
118534c to
8b16196
Compare
24e1ef3 to
dcb42d0
Compare
8b16196 to
44663c3
Compare
dcb42d0 to
fcee141
Compare
|
Caution Review failedPull request was closed or merged during review Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughImplements ExactMatchGrader and a DeepEval-aligned HellaSwagBenchmark, adds deepeval>=2.9.0, updates plugin metadata and docs, and adds unit tests validating grader and benchmark behaviors. ChangesHellaSwag benchmark + ExactMatch grader
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
src/aiperf/accuracy/benchmarks/hellaswag.py (3)
153-153: ⚡ Quick winDrop the
Anyannotation on**kwargs.Per the established convention in this repo, variadic keyword arguments should be left untyped rather than annotated as
**kwargs: Any. The AI summary calls out this typing change explicitly, so it's worth correcting before merge.♻️ Proposed fix
- def __init__(self, user_config: UserConfig, **kwargs: Any) -> None: + def __init__(self, user_config: UserConfig, **kwargs) -> None:Based on learnings: "In Python projects (e.g., in aiperf), avoid adding type annotations to **kwargs like **kwargs: Any. The variadic keyword arguments are inherently dynamic; leave **kwargs untyped or replace with explicit, named keyword parameters if a concrete contract is needed."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/aiperf/accuracy/benchmarks/hellaswag.py` at line 153, The __init__ signature in the HellaSWAG benchmark currently types the variadic keyword parameter as **kwargs: Any; remove the explicit Any annotation and leave it untyped (use **kwargs) in the __init__ method (the constructor taking user_config: UserConfig) so it follows the repository convention for variadic keyword arguments.
219-219: 💤 Low valueDefensive bound on the gold-letter index.
choices[int(label_raw)]willIndexErrorif a row ever has a label outside 0–3 (orint()-incompatible content like"abc"). The Rowan/hellaswag schema today is 0–3 so this is mostly defensive, but a clearertry/except(or explicit range check) drops the row and logs rather than aborting the whole load. Optional — feel free to skip if the dataset contract is considered authoritative.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/aiperf/accuracy/benchmarks/hellaswag.py` at line 219, Wrap the conversion/indexing of choices[int(label_raw)] in a defensive try/except that catches ValueError and IndexError, logs the offending label_raw (and row identifier if available) and skips that row instead of letting the exception abort the load; update the code around gold_letter, choices and label_raw to validate int(label_raw) is in range(len(choices)) or use try: gold_letter = choices[idx] except (ValueError, IndexError): logger.warning(...); continue so malformed or out-of-range labels are dropped gracefully.
192-234: ⚡ Quick winQuadratic scan of the validation split for
tasks=["all"].
_build_problemsiterates the fullval_setonce per task. When the caller passesNone/"all",_resolve_tasksreturns everyHellaSwagTaskmember (~190 categories), so this becomes an O(tasks × val_rows) walk over a ~10k-row split — roughly 1.9M iterations perload_problemscall, all on the executor thread. Indexing the validation rows byactivity_labelonce and then iterating the right bucket per task makes this O(N + M).♻️ Proposed refactor
def _build_problems( self, ds: DatasetDict, tasks: list[Any], n_shots: int ) -> list[BenchmarkProblem]: train_set = ds["train"] shots_set = _build_unique_activity_label_shots_set(train_set) val_set = ds["validation"] problems: list[BenchmarkProblem] = [] choices = ["A", "B", "C", "D"] + # Pre-bucket validation rows by activity_label so each task + # only scans its own subset instead of the whole split. + val_by_label: dict[str, list[dict[str, Any]]] = {} + for row in val_set: + label = row.get(ACTIVITY_LABEL_FIELD) + if label is None: + continue + val_by_label.setdefault(label, []).append(row) for task in tasks: - for row in val_set: - if row.get(ACTIVITY_LABEL_FIELD) != task.value: - continue + for row in val_by_label.get(task.value, ()): label_raw = row.get(LABEL_FIELD) if label_raw == "" or label_raw is None: continue🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/aiperf/accuracy/benchmarks/hellaswag.py` around lines 192 - 234, The current _build_problems scans val_set once per task causing O(tasks × val_rows); fix it by first building a dict index mapping activity_label -> list of validation rows (keyed by ACTIVITY_LABEL_FIELD) once, then iterate tasks and for each use bucket = index.get(task.value, []) instead of scanning val_set; keep the existing logic that builds shots_set, uses HellaSwagTemplate.format_question/generate_output, maps label_raw to gold_letter via choices and appends BenchmarkProblem with the same fields and raw_messages from self._build_chat_messages(prompt).tests/unit/accuracy/test_hellaswag_benchmark.py (1)
327-332: 💤 Low valueConsider asserting on the constant rather than the literal "15".
The
match="at most 15"substring couples the test to the hardcoded message text. Usingmatch=f"at most {MAX_N_SHOTS}"(the constant is already imported) keeps the test correct if the cap ever changes. Minor.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/accuracy/test_hellaswag_benchmark.py` around lines 327 - 332, The test hardcodes the string "at most 15" in the pytest.raises match; change it to reference the imported constant so the test follows the configured cap. In TestNShotsCap::test_n_shots_above_15_raises replace match="at most 15" with match=f"at most {MAX_N_SHOTS}" (the constant MAX_N_SHOTS is already imported) so the assertion uses the canonical limit used by HellaSwagBenchmark.load_problems.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/accuracy/accuracy_stubs.md`:
- Line 173: The table entry for HellaSwagBenchmark is out of sync with
plugins.yaml: update the default grader and n-shots to match the plugin settings
by changing the grader from `multiple_choice` to `exact_match` and the `0` shots
to `10` in the HellaSwagBenchmark row; ensure the descriptive text (the
`HellaSwagBenchmark` row contents and any mention of "renders 4-choice MCQs") is
consistent with using `exact_match` and default_n_shots 10 so readers and
HellaSwagBenchmark references reflect the plugin defaults.
In `@src/aiperf/accuracy/benchmarks/hellaswag.py`:
- Around line 105-141: The docstring says matching is case-insensitive but
_resolve_tasks currently checks activity_label strings case-sensitively against
valid_values (HellaSwagTask.value); make value-matching truly case-insensitive
by building a mapping from lowercased enum values to the enum member (e.g.,
lower_value_map = {t.value.lower(): t for t in HellaSwagTask}), then when
iterating names use name_lower = name.lower() to look up in that map (and append
the found HellaSwagTask), keep the existing upper-snake-case fallback using
getattr(HellaSwagTask, name.upper(), None), and preserve the unknown collection
and ValueError behavior if neither lookup succeeds so the error message remains
the same.
In `@tests/unit/accuracy/test_hellaswag_benchmark.py`:
- Around line 208-212: The failing codespell match comes from the literal
sequences containing "nD" in the test assertions (the prompt string and the
options block asserted against variable prompt); instead of relying on "#
codespell:ignore", break or rework the string literals so "nD" never appears as
a single token—e.g., use implicit adjacent string concatenation or split the
offending token across two strings when forming the prompt and the option block
used in the assertions (the lines with the prompt variable and the assertion
containing "A. applies\nB. watches\nC. sings\nD. cooks\nAnswer:"); update those
assertions to compare against the restructured strings and remove the
ineffective pragma.
---
Nitpick comments:
In `@src/aiperf/accuracy/benchmarks/hellaswag.py`:
- Line 153: The __init__ signature in the HellaSWAG benchmark currently types
the variadic keyword parameter as **kwargs: Any; remove the explicit Any
annotation and leave it untyped (use **kwargs) in the __init__ method (the
constructor taking user_config: UserConfig) so it follows the repository
convention for variadic keyword arguments.
- Line 219: Wrap the conversion/indexing of choices[int(label_raw)] in a
defensive try/except that catches ValueError and IndexError, logs the offending
label_raw (and row identifier if available) and skips that row instead of
letting the exception abort the load; update the code around gold_letter,
choices and label_raw to validate int(label_raw) is in range(len(choices)) or
use try: gold_letter = choices[idx] except (ValueError, IndexError):
logger.warning(...); continue so malformed or out-of-range labels are dropped
gracefully.
- Around line 192-234: The current _build_problems scans val_set once per task
causing O(tasks × val_rows); fix it by first building a dict index mapping
activity_label -> list of validation rows (keyed by ACTIVITY_LABEL_FIELD) once,
then iterate tasks and for each use bucket = index.get(task.value, []) instead
of scanning val_set; keep the existing logic that builds shots_set, uses
HellaSwagTemplate.format_question/generate_output, maps label_raw to gold_letter
via choices and appends BenchmarkProblem with the same fields and raw_messages
from self._build_chat_messages(prompt).
In `@tests/unit/accuracy/test_hellaswag_benchmark.py`:
- Around line 327-332: The test hardcodes the string "at most 15" in the
pytest.raises match; change it to reference the imported constant so the test
follows the configured cap. In TestNShotsCap::test_n_shots_above_15_raises
replace match="at most 15" with match=f"at most {MAX_N_SHOTS}" (the constant
MAX_N_SHOTS is already imported) so the assertion uses the canonical limit used
by HellaSwagBenchmark.load_problems.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5a065e80-9e1f-4b07-87d5-d5129709a5fe
📒 Files selected for processing (9)
docs/accuracy/accuracy-benchmarking.mddocs/accuracy/accuracy_stubs.mdpyproject.tomlsrc/aiperf/accuracy/benchmarks/hellaswag.pysrc/aiperf/accuracy/graders/exact_match.pysrc/aiperf/plugin/plugins.yamltests/unit/accuracy/test_accuracy_config.pytests/unit/accuracy/test_exact_match_grader.pytests/unit/accuracy/test_hellaswag_benchmark.py
e3dea75 to
41d5d73
Compare
c71dbac to
f10c9a7
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/aiperf/accuracy/benchmarks/hellaswag.py`:
- Around line 189-194: The current guard only enforces the upper bound
(MAX_N_SHOTS) but lets negative n_shots proceed; update the validation around
n_shots (the same block that references MAX_N_SHOTS and the HellaSwag benchmark)
to also reject negative values by raising a ValueError when n_shots < 0 (or
n_shots is negative), with a clear message like "n_shots must be non-negative"
that mirrors the existing style; keep the existing upper-bound check for
MAX_N_SHOTS and combine both checks in the same validation block that currently
contains the f"HellaSwag supports at most {MAX_N_SHOTS} ..." error.
- Around line 120-121: The current selector logic returns list(HellaSwagTask)
whenever "all" appears in tasks, which lets inputs like ["all","typo"] bypass
validation; change the guard in the task-selection routine that references tasks
and HellaSwagTask so that you only treat "all" as meaning "select every task"
when tasks is empty or exactly ["all"] (case-insensitive). Specifically, if not
tasks or tasks == ["all"] (ignoring case) return list(HellaSwagTask); if "all"
is present alongside other selectors raise a ValueError (or similar) explaining
that "all" cannot be mixed with other task names so invalid selectors don’t get
silently ignored.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0d738e6e-7dfd-4eb5-b2a5-0b9dec70791c
📒 Files selected for processing (5)
docs/accuracy/accuracy-benchmarking.mddocs/accuracy/accuracy_stubs.mdpyproject.tomlsrc/aiperf/accuracy/benchmarks/hellaswag.pysrc/aiperf/accuracy/graders/exact_match.py
✅ Files skipped from review due to trivial changes (1)
- docs/accuracy/accuracy-benchmarking.md
🚧 Files skipped from review as they are similar to previous changes (1)
- src/aiperf/accuracy/graders/exact_match.py
matthewkotila
left a comment
There was a problem hiding this comment.
Solid PR. The reference-parity discipline (byte-equal-to-DeepEval pinned via direct HellaSwagTemplate.generate_output() comparison in the tests) and the three principled fixup commits (validate-before-download, confinement-string append, reject 'all' mixed with other selectors) raise the bar above "ship a benchmark that runs."
Verification I did locally before writing this:
- 58/58 unit tests pass against the installed
deepeval==4.0.3(a full major above the floor) — the deep import paths intodeepeval.benchmarks.hellaswag.templatesurvived the bump cleanly. - End-to-end smoke run: HellaSwag → mock server → grader → exporter. Pipeline completes, 5/5 requests dispatched, accuracy CSV + console output render correctly with spaced task name
"Applying sunscreen".inputs.jsonconfirms theDEEPEVAL_CONFINEMENTstring is appended to live prompts exactly as the fixup commit pins. - Console + CSV exporters correctly handle spaces in task names (cross-checked via direct exercise — no false delimiter splits, no transform).
enable_cot=Trueconfirmed silently dropped (see inline comment on the docstring).
Closed by verification: the original "deep imports will silently break across deepeval majors" concern is substantially weakened; spaced-task-name exporter handling is fine; confinement string reaches real prompts.
Recommended before merge:
- Add the
enable_cotlog line (see inline). - Cap
deepeval(see inline). - Rebase to current
main— branch is 3 commits behind merge-base on unrelated bits, inflating the visible diff against main. - Optional: pin the HF dataset revision (see inline).
…ked (AIP-877) Adds HellaSwag commonsense reasoning benchmark + the strict ExactMatchGrader that DeepEval's ``Scorer.exact_match_score`` uses. Both align byte-for- byte with the trt-llm benchmark recipe's DeepEval-backed configuration. **HellaSwagBenchmark** (`src/aiperf/accuracy/benchmarks/hellaswag.py`) - Loads ``Rowan/hellaswag``: validation split filtered per task by ``activity_label``, train split feeds the "one few-shot per unique activity_label" rule (mirrors ``deepeval.benchmarks.HellaSwag``'s ``categories_seen`` dedupe loop). - Prompt rendering delegates to ``deepeval.benchmarks.hellaswag.template.HellaSwagTemplate.generate_output`` — output is byte-equal to what the trt-llm recipe ships. - Defaults: ``n_shots=10`` (DeepEval cap is 15), ``generation_size=5``, ``default_grader=exact_match``. - Ground truth: bare ``A``/``B``/``C``/``D`` letter (DeepEval's convention for ``Scorer.exact_match_score``). - ``_resolve_tasks`` matches activity labels case-insensitively via a lowercased-value map; falls back to upper-snake-case enum name (``HellaSwagTask.APPLYING_SUNSCREEN`` form) for the recipe's ``getattr(HellaSwagTask, name.upper(), None)`` parity. **ExactMatchGrader** (`src/aiperf/accuracy/graders/exact_match.py`) - Strict ``pred.strip() == gold.strip()`` semantics matching DeepEval's ``Scorer.exact_match_score``: case-sensitive, no normalization, empty response → ``unparsed=True``. - Used by HellaSwag and (in AIP-878) BigBench-Hard for reference parity. **Plugin registration** (`src/aiperf/plugin/plugins.yaml`) - ``hellaswag`` → ``default_grader: exact_match``, ``default_n_shots: 10`` with the DeepEval-backed description. - ``exact_match`` → strict-equality description; drops the ``is_implemented: false`` flag. **Dependencies** (`pyproject.toml`) - Adds ``deepeval>=2.9.0`` to the ``[accuracy]`` optional-dependency group. Aiperf calls DeepEval's bundled prompt template directly so the dep is required for HellaSwag (and BigBench-Hard in AIP-878). **Tests** (`tests/unit/accuracy/`) - ``test_hellaswag_benchmark.py``: ~22 tests covering DeepEval prompt byte-equality, the unique-activity-label shots set rule, validation filtering, task resolution (exact, lower, upper, mixed case), and pathological dataset rows (empty validation, unlabeled rows). - ``test_exact_match_grader.py``: strict-equality semantics including the empty-response → ``unparsed=True`` path and case-sensitivity. - ``test_accuracy_config.py``: drops ``hellaswag`` from ``STUB_BENCHMARKS`` and ``exact_match`` from ``STUB_GRADERS``; the uppercase-stub test now uses ``BIGBENCH`` (a still-stub name). **Docs** (`docs/accuracy/`) - ``accuracy-benchmarking.md``: add HellaSwag row to the benchmarks table. - ``accuracy_stubs.md``: status summary + move HellaSwag from "Still Stubbed" to "Implemented"; move ExactMatchGrader to "Implemented". **Constructor signature** Loader + grader use the v2 ``BenchmarkRun`` API (post-#912 refactor on main) rather than the legacy ``UserConfig`` shape — matches how ``MMLUBenchmark`` and ``AIMEBenchmark`` are wired on current main. Validation: - 70/70 accuracy tests pass (HellaSwag + ExactMatch + AccuracyConfig). - Ruff format + ruff check clean on all modified Python files. - Codespell clean (v2.4.2, matches CI). - HellaSwag prompts verified byte-equal against ``HellaSwagTemplate.generate_output`` on synthetic fixtures. Reference: - ``deepeval/benchmarks/hellaswag/hellaswag.py`` - ``deepeval/benchmarks/hellaswag/template.py`` - ``deepeval/scorer/scorer.py:Scorer.exact_match_score`` - ``trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:319-336`` Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…IP-877) ``HellaSwagBenchmark.load_problems`` called ``load_dataset()`` before ``_resolve_tasks(tasks)``, so an invalid ``--accuracy-tasks`` value triggered a multi-MB HF download (with potential network/cache failures) just to surface the validation error. Swap the order: validate tasks first, then download. Neither call depends on the other, so the reorder is mechanical. Adds a regression test (``TestTaskValidationPrecedesDatasetDownload``) that patches ``load_dataset`` and asserts it's never called when ``tasks`` resolves to a ``ValueError``. Full HellaSwag suite: 26/26 pass. Ruff + codespell clean. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…rompt (AIP-877)
``HellaSwagTemplate.generate_output()`` renders a prompt that ends with
``"\nAnswer:"`` and never instructs the model to emit just a letter.
DeepEval enforces bare-letter output two ways:
primary: ``model.generate(prompt, schema=MultipleChoiceSchema)``
— structured generation pinned to A/B/C/D.
fallback: on ``TypeError`` (when the model client doesn't accept
``schema=``), append ``self.confinement_instructions`` to
the prompt and call ``model.generate(prompt)``.
The default ``confinement_instructions`` set when the user passes
``None`` is:
"Output 'A', 'B', 'C', or 'D'. Full answer not needed."
Aiperf takes neither path: we send the rendered template prompt
verbatim to whatever OpenAI-compatible endpoint the user pointed at,
which corresponds to DeepEval's fallback path WITHOUT the
confinement-string append. ``ExactMatchGrader`` (mirroring
``Scorer.exact_match_score``) then strictly compares the response
against the bare ``A``/``B``/``C``/``D`` gold, so verbose-but-correct
responses like ``"The answer is A."`` grade as wrong — systematically
under-grading vs DeepEval reference numbers.
Append the confinement string to the rendered prompt in
``_build_problems``, separated by ``\n\n`` (matching DeepEval's
``prompt += f"\n\n{self.confinement_instructions}"`` in
``predict()``). Pull the literal into a module-level constant
``DEEPEVAL_CONFINEMENT`` so the source-of-truth is documented and
the test suite can pin it byte-equal.
Adds three regression tests:
* ``test_constant_matches_deepeval_default`` — pins the constant
matches DeepEval's hardcoded default in ``HellaSwag.__init__``.
* ``test_prompt_ends_with_confinement`` — every rendered prompt ends
with ``"\n\n{DEEPEVAL_CONFINEMENT}"``.
* ``test_template_output_is_a_prefix_of_prompt`` — preserves the
``HellaSwagTemplate.generate_output`` byte-equal pin by asserting
the template output is a strict prefix of the final prompt, with
only the confinement suffix added.
Full HellaSwag suite: 29/29 pass (was 26). Ruff + codespell clean.
Reference:
deepeval/benchmarks/hellaswag/hellaswag.py:HellaSwag.__init__
deepeval/benchmarks/hellaswag/hellaswag.py:HellaSwag.predict
Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
… (AIP-877)
The previous ``_resolve_tasks`` guard treated ANY occurrence of ``"all"``
in the task list as "select every task":
if not tasks or "all" in {t.lower() for t in tasks}:
return list(HellaSwagTask)
So ``["all", "NOT_A_REAL_TASK"]`` returned all 192 tasks and silently
swallowed the typo. Same trap with ``["all", "Applying sunscreen"]``
— the model would evaluate every task instead of just the one
specified, with no error surfaced.
Tighten the guard:
* ``None`` / empty list / exactly ``["all"]`` (case-insensitive) →
return every task.
* ``"all"`` mixed with any other name → raise ``ValueError`` so the
user sees their mistake instead of getting an unexpected 192-task
run.
Adds two parametrized regression test groups:
* ``test_all_alone_is_case_insensitive`` (4 cases: ``all``, ``ALL``,
``All``, ``aLl``) — every casing of the bare sentinel still selects
every task.
* ``test_all_mixed_with_other_selectors_raises`` (4 cases: typo,
real-task, real-task-then-all, duplicate-all) — every mixed input
raises with the documented "'all' cannot be mixed" message.
Full HellaSwag suite: 37/37 pass (was 29). Ruff + codespell clean.
Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
Adds an explicit upper bound to the deepeval pin so the accuracy extras follow the same >=X,<Y style used elsewhere in this file (opentelemetry-*, etc.). Reviewer verified deepeval==4.0.3 still passes the HellaSwag unit tests and an E2E run, so the cap at the next major (5.0.0) leaves headroom while preserving version-policy hygiene. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…g (AIP-877) HellaSwag accepts ``enable_cot`` for protocol uniformity but DeepEval's HellaSwag template has no CoT variant, so the flag is silently dropped. A user setting it had no way to discover this — the rendered prompts were byte-identical with and without ``--accuracy-enable-cot`` and no log signal fired. Emits a single ``info`` line at the top of ``load_problems`` so the drop is visible, keeping the accept-but-ignore stance otherwise unchanged. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
``load_dataset("Rowan/hellaswag")`` is intentionally unpinned to match
the trt-llm benchmark recipe's resolution behavior. A reviewer flagged
this as an observability gap — if the upstream HF dataset is rebased
or re-uploaded, the byte-equal pin against
``HellaSwagTemplate.generate_output`` silently follows along with no
test-suite signal. Spells out the policy in the module docstring so
the trade-off (parity-with-recipe over pin-stability) is explicit, and
flags that a SHA pin should only be added when/if the recipe pins one.
Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…877) The lowercased-value → ``HellaSwagTask`` enum dict was being rebuilt inside ``_resolve_tasks`` on every ``load_problems`` call (~190 entries each). Since ``HellaSwagTask`` is immutable, the dict can live at module scope and be built once. Guarded with ``if _HAS_DEEPEVAL`` so the module still imports cleanly when the optional accuracy extras aren't installed. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…IP-877) ``_build_problems`` previously scanned the full validation set once per selected task. With ``--accuracy-tasks=all`` that's ~190 tasks over a ~10K-row split — ~1.9M activity_label comparisons per load. Pre-bucketing val_set by ``activity_label`` in one pass collapses the inner scan to a direct dict lookup, making the loop O(val_rows + tasks) without changing problem order or content. DeepEval's internal ``val_set.filter(...)`` per-task is a reference-only detail (only prompt rendering is part of the parity goal), so the faster shape is free here. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
f3097f6 to
98a2c25
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/unit/accuracy/test_exact_match_grader.py (1)
54-153: ⚡ Quick winNormalize test names to
test_<function>_<scenario>_<expected>.Several names don’t clearly encode the function-under-test and expected outcome in the required format, which hurts suite consistency.
As per coding guidelines: "Test naming:
test_<function>_<scenario>_<expected>e.g.test_parse_config_missing_field_raises_error".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/accuracy/test_exact_match_grader.py` around lines 54 - 153, The test names in TestEmptyAndUnparsed, TestMultiLineNotForgiven, TestUnicodeAndNonAscii, TestExtractAnswerInterface and TestGradingResultFields do not follow the required test_<function>_<scenario>_<expected> convention; rename each test function to include the function-under-test (grade or extract_answer), a brief scenario, and the expected outcome (for example change test_empty_response_unparsed_and_incorrect -> test_grade_empty_response_unparsed_false, test_whitespace_only_response_unparsed -> test_grade_whitespace_only_unparsed_true, test_empty_pred_and_empty_gold_neither_correct_nor_unparsed -> test_grade_empty_pred_and_gold_not_unparsed_false, test_multi_line_response_does_not_match_single_letter -> test_grade_multiline_response_mismatch_false, test_explanation_prefix_does_not_match -> test_grade_explanation_prefix_mismatch_false, test_unicode_match -> test_grade_unicode_exact_match_true, test_unicode_case_sensitive -> test_grade_unicode_case_sensitive_false, test_extract_answer_strips_only -> test_extract_answer_strips_only_returns_stripped, test_extract_answer_preserves_inner -> test_extract_answer_preserves_inner_text, test_extract_answer_empty -> test_extract_answer_empty_returns_empty, test_reasoning_includes_stripped_forms -> test_grade_reasoning_contains_stripped_forms, and test_extracted_answer_is_stripped_pred -> test_grade_extracted_answer_is_stripped; update any references or parametrization to use these new names and ensure they still call ExactMatchGrader.grade and grader.extract_answer unchanged.tests/unit/accuracy/test_hellaswag_benchmark.py (1)
51-57: ⚡ Quick winAdd a return type hint to
_make_run.Please annotate the helper return type so all functions in this module are fully typed.
As per coding guidelines: `**/*.py`: “Type hints on ALL functions (params and return)”.🔧 Suggested fix
-def _make_run(): +def _make_run() -> Any: return make_benchmark_run( model_names=["test-model"], endpoint_type=EndpointType.COMPLETIONS, streaming=False, accuracy={"benchmark": AccuracyBenchmarkType.HELLASWAG}, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/accuracy/test_hellaswag_benchmark.py` around lines 51 - 57, The helper function _make_run lacks a return type annotation; add a precise return type hint that matches the value returned by calling make_benchmark_run (e.g., the BenchmarkRun or equivalent type used across tests) so the signature reads something like def _make_run() -> BenchmarkRun:, ensuring imports or forward references for that type are added if necessary; reference _make_run, make_benchmark_run, EndpointType and AccuracyBenchmarkType to locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/aiperf/accuracy/benchmarks/hellaswag.py`:
- Line 262: Replace the non-ASCII multiplication glyph in the comment "loop is
O(val_rows + tasks) instead of O(tasks × val_rows)." by changing the `×`
character to an ASCII `x` (so it reads "O(tasks x val_rows)" or use `*` if
preferred), updating the comment in the same location in
src/aiperf/accuracy/benchmarks/hellaswag.py to satisfy Ruff RUF003.
In `@tests/unit/accuracy/test_exact_match_grader.py`:
- Around line 22-28: The helper function _make_run is missing a return type
annotation; add an explicit return type that matches the value returned by
make_benchmark_run (e.g., the BenchMarkRun/BenchmarkRun type used by
make_benchmark_run in this test suite) so the signature becomes def _make_run()
-> <appropriate_return_type>: and keep the existing body unchanged; reference
the function name _make_run and the factory call make_benchmark_run (and related
enums EndpointType and AccuracyBenchmarkType) to locate where to apply the
annotation.
---
Nitpick comments:
In `@tests/unit/accuracy/test_exact_match_grader.py`:
- Around line 54-153: The test names in TestEmptyAndUnparsed,
TestMultiLineNotForgiven, TestUnicodeAndNonAscii, TestExtractAnswerInterface and
TestGradingResultFields do not follow the required
test_<function>_<scenario>_<expected> convention; rename each test function to
include the function-under-test (grade or extract_answer), a brief scenario, and
the expected outcome (for example change
test_empty_response_unparsed_and_incorrect ->
test_grade_empty_response_unparsed_false, test_whitespace_only_response_unparsed
-> test_grade_whitespace_only_unparsed_true,
test_empty_pred_and_empty_gold_neither_correct_nor_unparsed ->
test_grade_empty_pred_and_gold_not_unparsed_false,
test_multi_line_response_does_not_match_single_letter ->
test_grade_multiline_response_mismatch_false,
test_explanation_prefix_does_not_match ->
test_grade_explanation_prefix_mismatch_false, test_unicode_match ->
test_grade_unicode_exact_match_true, test_unicode_case_sensitive ->
test_grade_unicode_case_sensitive_false, test_extract_answer_strips_only ->
test_extract_answer_strips_only_returns_stripped,
test_extract_answer_preserves_inner -> test_extract_answer_preserves_inner_text,
test_extract_answer_empty -> test_extract_answer_empty_returns_empty,
test_reasoning_includes_stripped_forms ->
test_grade_reasoning_contains_stripped_forms, and
test_extracted_answer_is_stripped_pred ->
test_grade_extracted_answer_is_stripped; update any references or
parametrization to use these new names and ensure they still call
ExactMatchGrader.grade and grader.extract_answer unchanged.
In `@tests/unit/accuracy/test_hellaswag_benchmark.py`:
- Around line 51-57: The helper function _make_run lacks a return type
annotation; add a precise return type hint that matches the value returned by
calling make_benchmark_run (e.g., the BenchmarkRun or equivalent type used
across tests) so the signature reads something like def _make_run() ->
BenchmarkRun:, ensuring imports or forward references for that type are added if
necessary; reference _make_run, make_benchmark_run, EndpointType and
AccuracyBenchmarkType to locate the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 23a76a96-7459-4a65-b142-8665bded0979
📒 Files selected for processing (9)
docs/accuracy/accuracy-benchmarking.mddocs/accuracy/accuracy_stubs.mdpyproject.tomlsrc/aiperf/accuracy/benchmarks/hellaswag.pysrc/aiperf/accuracy/graders/exact_match.pysrc/aiperf/plugin/plugins.yamltests/unit/accuracy/test_accuracy_config.pytests/unit/accuracy/test_exact_match_grader.pytests/unit/accuracy/test_hellaswag_benchmark.py
✅ Files skipped from review due to trivial changes (3)
- docs/accuracy/accuracy-benchmarking.md
- src/aiperf/accuracy/graders/exact_match.py
- docs/accuracy/accuracy_stubs.md
dynamo-ops
left a comment
There was a problem hiding this comment.
Previous review comments have been addressed. Approving.
hellaswagbenchmark inplugins.yaml(default_grader: multiple_choice,default_n_shots: 0); scaffold loader raisesNotImplementedErroruntil the full DeepEval-backed implementation lands.ExactMatchGrader: registersexact_matchgrader plugin inplugins.yaml; scaffold raisesNotImplementedError— wired here because it is needed by the upcoming BigBench PR (AIP-878).[accuracy]optional group).Reference: trt-llm-benchmark-recipe DeepEval-backed hellaswag evaluation
Summary by CodeRabbit
New Features
Documentation
Tests
Chores