feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) by debermudez · Pull Request #925 · ai-dynamo/aiperf

debermudez · 2026-05-12T23:07:00Z

Plugin: registers aime24 benchmark in plugins.yaml (default_grader: math, default_n_shots: 0); scaffold loader raises NotImplementedError until the full lighteval-backed implementation lands.
This is the lighteval-backed AIME 2024 path — distinct from AIP-874's DeepEval-backed aime loader; uses LightevalExprGrader (expr_gold_metric, ExprExtractionConfig) introduced in AIP-874.
Depends on AIP-874 for LightevalExprGrader and the [accuracy] optional group (lighteval).

Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py (aime24 task)

Summary by CodeRabbit

New Features
- AIME 2024 benchmark is now fully implemented and available for use, featuring integration with lighteval expression-based evaluation.
Tests
- Comprehensive test coverage added for AIME 2024 benchmark functionality.

copy-pr-bot · 2026-05-12T23:07:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-12T23:07:09Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@8913804ff856e2eac8f2910f7713539eea638c82

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@8913804ff856e2eac8f2910f7713539eea638c82

Last updated for commit: 8913804 • Browse code

github-actions · 2026-05-12T23:07:42Z

Fern Docs Preview: https://nvidia-preview-c5abb165-a730-4621-940a-7b2a3e9fbd07.docs.buildwithfern.com/aiperf/dev

debermudez · 2026-05-12T23:10:43Z

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 4 of 8 — base branch is dbermudez/aip-874-implement-aime-benchmark-loader,
depends on #849 (AIP-874) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

codecov · 2026-05-13T21:34:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Implement ``AIME24Benchmark`` to mirror the trt-llm benchmark recipe's ``acc_bench_lighteval.py`` configuration for AIME 2024: aime24 = LightevalTaskConfig( name="aime24", prompt_function=aime_prompt_fn, hf_repo="HuggingFaceH4/aime_2024", evaluation_splits=["train"], few_shots_split=None, few_shots_select=None, generation_size=32768, metric=[expr_gold_metric], ) The recipe's ``aime_prompt_fn`` produces a ``Doc`` whose ``query`` is the bare problem text — lighteval's prompt manager wraps it as a single user message with no instruction prefix and no few-shot priming. The loader emits prompts the same way: one ``BenchmarkProblem`` per dataset row, ``prompt`` = the bare ``problem`` field, ``ground_truth`` = ``str(answer)``, ``metadata.generation_size`` = 32768. ``tasks`` / ``n_shots`` / ``enable_cot`` arguments are accepted for protocol uniformity but ignored (any of them changing the prompt would diverge from the reference). Pair with ``LightevalExprGrader`` for the recipe's ``expr_gold_metric`` extraction. Built on the v2 ``BenchmarkRun`` API (post-PR-#912) and on the AIP-878 test harness conventions: ``make_benchmark_run`` for fixtures, ``BenchmarkProblem``-driven assertions, ``patch`` on ``aime24.load_dataset`` for deterministic rows. The loader has no heavy optional dependency (``datasets`` is a core dep), so no fake-harness is needed; CI gets 100% line + branch coverage out of the box. Updates the stub registry: drop ``aime24`` from ``test_accuracy_config.STUB_BENCHMARKS``, drop the ``is_implemented: false`` flag from the ``aime24`` plugins.yaml entry, switch ``default_grader`` to ``lighteval_expr``, add an ``aime24`` row to ``docs/accuracy/accuracy-benchmarking.md``, and move it from "Still Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing the Status Summary, Method Count Summary, and Suggested Implementation Order sections accordingly). Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

coderabbitai · 2026-05-27T19:45:46Z

Walkthrough

AIME24Benchmark is implemented to load the HuggingFace AIME 2024 dataset, convert rows into problems with bare problem text prompts and stringified answers, and is registered as a production benchmark with lighteval_expr grader. The implementation is tested comprehensively and marked complete in documentation.

Changes

AIME24Benchmark Implementation and Testing

Layer / File(s)	Summary
AIME24Benchmark Core Implementation `src/aiperf/accuracy/benchmarks/aime24.py`	`AIME24Benchmark.load_problems()` loads HuggingFaceH4/aime_2024 train split and converts each row to a `BenchmarkProblem` with bare problem text as prompt, stringified integer answer as ground_truth, and metadata including `generation_size=32768`; replaces previous `NotImplementedError` placeholder with working async loader.
Plugin Registration and Metadata `src/aiperf/plugin/plugins.yaml`	`accuracy_benchmark.aime24` plugin entry updated to set `default_grader: lighteval_expr`, remove `is_implemented: false` flag, and update description to reflect lighteval-backed `expr_gold_metric` alignment.
Test Coverage for AIME24Benchmark `tests/unit/accuracy/test_aime24_benchmark.py`, `tests/unit/accuracy/test_accuracy_config.py`	New test module validates that prompts are exactly bare problem text (no instruction/decoration), parameters `n_shots` and `enable_cot` do not affect prompt generation, one problem per dataset row, metadata includes 32768 `generation_size`, and edge cases (empty dataset, unicode) are handled; test configuration updated to replace `aime24` with `aime25` in stub validation.
Documentation Updates `docs/accuracy/accuracy-benchmarking.md`, `docs/accuracy/accuracy_stubs.md`	Benchmark documentation updated to mark `AIME24Benchmark` as implemented with `lighteval_expr` grader, added to available benchmarks table, removed from stubbed benchmarks list, and method count/implementation order sections adjusted to reflect completion.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 A hop through the AIME now complete,
Where bare problem text keeps prompts sweet,
Lighteval blessed with lighteval_expr grace,
Tests ensure each answer finds its place,
From stub to finished—mark the victory!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.65% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: implementing a lighteval-backed AIME 2024 benchmark, with the reference ticket AIP-875 for context.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

tests/unit/accuracy/test_aime24_benchmark.py (2)

31-37: ⚡ Quick win

Add an explicit return type to _make_run.

Line 31 defines _make_run without a return annotation.

Proposed fix

+from aiperf.config.resolution.plan import BenchmarkRun
+
-def _make_run():
+def _make_run() -> BenchmarkRun:
     return make_benchmark_run(
         model_names=["test-model"],
         endpoint_type=EndpointType.COMPLETIONS,
         streaming=False,
         accuracy={"benchmark": AccuracyBenchmarkType.AIME24},
     )

As per coding guidelines, `**/*.py`: Type hints on ALL functions (parameters and return).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/accuracy/test_aime24_benchmark.py` around lines 31 - 37, Add an
explicit return type annotation to the helper function _make_run so it follows
the project's typing guideline; locate the _make_run definition and annotate its
signature with the return type produced by make_benchmark_run (the type used for
benchmark runs in your codebase), e.g., the Run/BenchmarkRun type used by
make_benchmark_run, ensuring imports are added if needed; verify the signature
references _make_run, make_benchmark_run, EndpointType and
AccuracyBenchmarkType.AIME24 so the function remains identical in behavior but
now has a concrete return type.

58-223: ⚡ Quick win

Rename test methods to the required test_<function>_<scenario>_<expected> format.

Several test names (for example, Line 58 and Line 130) currently use scenario-style names but omit the target function and expected-outcome suffix.

As per coding guidelines, tests/**/*.py: Name test functions as test_<function>_<scenario>_<expected>.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/accuracy/test_aime24_benchmark.py` around lines 58 - 223, Several
test functions (e.g., test_flat_prompt_is_problem_text,
test_no_instruction_prefix, test_chat_message_is_single_user_message,
test_n_shots_argument_does_not_affect_prompt,
test_enable_cot_does_not_affect_prompt, test_returns_one_problem_per_row,
test_ground_truth_is_string_form_of_answer, test_task_name_is_aime24,
test_generation_size_is_32k, test_empty_dataset_returns_empty_list,
test_unicode_problem_text_preserved) do not follow the required naming
convention; rename each to the pattern test_<function>_<scenario>_<expected>
(for example rename test_flat_prompt_is_problem_text ->
test_load_problems_flat_prompt_is_problem_text,
test_n_shots_argument_does_not_affect_prompt ->
test_load_problems_n_shots_ignored_prompts_equal, etc.) keeping the same bodies
and references to AIME24Benchmark, load_problems, TASK_NAME,
DEFAULT_GENERATION_SIZE, and _make_row/_make_fake_dataset to ensure tests still
locate the correct symbols.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit/accuracy/test_aime24_benchmark.py`:
- Around line 31-37: Add an explicit return type annotation to the helper
function _make_run so it follows the project's typing guideline; locate the
_make_run definition and annotate its signature with the return type produced by
make_benchmark_run (the type used for benchmark runs in your codebase), e.g.,
the Run/BenchmarkRun type used by make_benchmark_run, ensuring imports are added
if needed; verify the signature references _make_run, make_benchmark_run,
EndpointType and AccuracyBenchmarkType.AIME24 so the function remains identical
in behavior but now has a concrete return type.
- Around line 58-223: Several test functions (e.g.,
test_flat_prompt_is_problem_text, test_no_instruction_prefix,
test_chat_message_is_single_user_message,
test_n_shots_argument_does_not_affect_prompt,
test_enable_cot_does_not_affect_prompt, test_returns_one_problem_per_row,
test_ground_truth_is_string_form_of_answer, test_task_name_is_aime24,
test_generation_size_is_32k, test_empty_dataset_returns_empty_list,
test_unicode_problem_text_preserved) do not follow the required naming
convention; rename each to the pattern test_<function>_<scenario>_<expected>
(for example rename test_flat_prompt_is_problem_text ->
test_load_problems_flat_prompt_is_problem_text,
test_n_shots_argument_does_not_affect_prompt ->
test_load_problems_n_shots_ignored_prompts_equal, etc.) keeping the same bodies
and references to AIME24Benchmark, load_problems, TASK_NAME,
DEFAULT_GENERATION_SIZE, and _make_row/_make_fake_dataset to ensure tests still
locate the correct symbols.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 60a2bedb-329f-424d-a424-9861f61f7789

📥 Commits

Reviewing files that changed from the base of the PR and between d797629 and 80e91dc.

📒 Files selected for processing (6)

docs/accuracy/accuracy-benchmarking.md
docs/accuracy/accuracy_stubs.md
src/aiperf/accuracy/benchmarks/aime24.py
src/aiperf/plugin/plugins.yaml
tests/unit/accuracy/test_accuracy_config.py
tests/unit/accuracy/test_aime24_benchmark.py

💤 Files with no reviewable changes (1)

tests/unit/accuracy/test_accuracy_config.py

…-loader

github-actions Bot added the feat label May 12, 2026

This was referenced May 12, 2026

feat(accuracy): implement AIME accuracy benchmark #849

Merged

feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923

Merged

feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924

Merged

debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from 5824827 to 03e10c5 Compare May 12, 2026 23:23

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from e0576be to cd239a5 Compare May 12, 2026 23:24

debermudez marked this pull request as draft May 12, 2026 23:27

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from cd239a5 to ed0edf6 Compare May 13, 2026 00:37

debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from f87725a to 09a0337 Compare May 13, 2026 17:06

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from ed0edf6 to 9eabc25 Compare May 13, 2026 21:22

debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from 118534c to 8b16196 Compare May 13, 2026 23:21

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from 9eabc25 to 05ece47 Compare May 13, 2026 23:32

debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from 8b16196 to 44663c3 Compare May 14, 2026 04:08

Base automatically changed from dbermudez/aip-874-implement-aime-benchmark-loader to main May 14, 2026 04:25

debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from 05ece47 to 80e91dc Compare May 27, 2026 19:41

debermudez marked this pull request as ready for review May 27, 2026 19:41

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

dynamo-ops approved these changes May 27, 2026

View reviewed changes

Merge branch 'main' into dbermudez/aip-875-implement-aime24-benchmark…

8913804

…-loader

FrankD412 enabled auto-merge (squash) May 29, 2026 00:24

FrankD412 approved these changes May 29, 2026

View reviewed changes

FrankD412 merged commit abf5d23 into main May 29, 2026
23 of 24 checks passed

FrankD412 deleted the dbermudez/aip-875-implement-aime24-benchmark-loader branch May 29, 2026 00:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875)#925

feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875)#925
FrankD412 merged 2 commits into
mainfrom
dbermudez/aip-875-implement-aime24-benchmark-loader

debermudez commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

debermudez commented May 12, 2026

Uh oh!

codecov Bot commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

debermudez commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

debermudez commented May 12, 2026

Stack dependency

Uh oh!

codecov Bot commented May 13, 2026

Codecov Report

Uh oh!

coderabbitai Bot commented May 27, 2026

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

debermudez commented May 12, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading