Skip to content

feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875)#925

Merged
FrankD412 merged 2 commits into
mainfrom
dbermudez/aip-875-implement-aime24-benchmark-loader
May 29, 2026
Merged

feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875)#925
FrankD412 merged 2 commits into
mainfrom
dbermudez/aip-875-implement-aime24-benchmark-loader

Conversation

@debermudez
Copy link
Copy Markdown
Contributor

@debermudez debermudez commented May 12, 2026

  • Plugin: registers aime24 benchmark in plugins.yaml (default_grader: math, default_n_shots: 0); scaffold loader raises NotImplementedError until the full lighteval-backed implementation lands.
  • This is the lighteval-backed AIME 2024 path — distinct from AIP-874's DeepEval-backed aime loader; uses LightevalExprGrader (expr_gold_metric, ExprExtractionConfig) introduced in AIP-874.
  • Depends on AIP-874 for LightevalExprGrader and the [accuracy] optional group (lighteval).

Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py (aime24 task)

Summary by CodeRabbit

  • New Features

    • AIME 2024 benchmark is now fully implemented and available for use, featuring integration with lighteval expression-based evaluation.
  • Tests

    • Comprehensive test coverage added for AIME 2024 benchmark functionality.

Review Change Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the feat label May 12, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@8913804ff856e2eac8f2910f7713539eea638c82

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@8913804ff856e2eac8f2910f7713539eea638c82

Last updated for commit: 8913804Browse code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

@debermudez
Copy link
Copy Markdown
Contributor Author

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

  1. AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
  2. AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
  3. AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
  4. AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
  5. AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
  6. AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
  7. AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
  8. AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 4 of 8 — base branch is dbermudez/aip-874-implement-aime-benchmark-loader,
depends on #849 (AIP-874) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

@debermudez debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from 5824827 to 03e10c5 Compare May 12, 2026 23:23
@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from e0576be to cd239a5 Compare May 12, 2026 23:24
@debermudez debermudez marked this pull request as draft May 12, 2026 23:27
@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from cd239a5 to ed0edf6 Compare May 13, 2026 00:37
@debermudez debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from f87725a to 09a0337 Compare May 13, 2026 17:06
@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from ed0edf6 to 9eabc25 Compare May 13, 2026 21:22
@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@debermudez debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from 118534c to 8b16196 Compare May 13, 2026 23:21
@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from 9eabc25 to 05ece47 Compare May 13, 2026 23:32
@debermudez debermudez force-pushed the dbermudez/aip-874-implement-aime-benchmark-loader branch from 8b16196 to 44663c3 Compare May 14, 2026 04:08
Base automatically changed from dbermudez/aip-874-implement-aime-benchmark-loader to main May 14, 2026 04:25
Implement ``AIME24Benchmark`` to mirror the trt-llm benchmark recipe's
``acc_bench_lighteval.py`` configuration for AIME 2024:

    aime24 = LightevalTaskConfig(
        name="aime24",
        prompt_function=aime_prompt_fn,
        hf_repo="HuggingFaceH4/aime_2024",
        evaluation_splits=["train"],
        few_shots_split=None,
        few_shots_select=None,
        generation_size=32768,
        metric=[expr_gold_metric],
    )

The recipe's ``aime_prompt_fn`` produces a ``Doc`` whose ``query`` is
the bare problem text — lighteval's prompt manager wraps it as a
single user message with no instruction prefix and no few-shot
priming. The loader emits prompts the same way: one
``BenchmarkProblem`` per dataset row, ``prompt`` = the bare
``problem`` field, ``ground_truth`` = ``str(answer)``,
``metadata.generation_size`` = 32768. ``tasks`` / ``n_shots`` /
``enable_cot`` arguments are accepted for protocol uniformity but
ignored (any of them changing the prompt would diverge from the
reference). Pair with ``LightevalExprGrader`` for the recipe's
``expr_gold_metric`` extraction.

Built on the v2 ``BenchmarkRun`` API (post-PR-#912) and on the AIP-878
test harness conventions: ``make_benchmark_run`` for fixtures,
``BenchmarkProblem``-driven assertions, ``patch`` on
``aime24.load_dataset`` for deterministic rows. The loader has no
heavy optional dependency (``datasets`` is a core dep), so no
fake-harness is needed; CI gets 100% line + branch coverage out of
the box.

Updates the stub registry: drop ``aime24`` from
``test_accuracy_config.STUB_BENCHMARKS``, drop the ``is_implemented:
false`` flag from the ``aime24`` plugins.yaml entry, switch
``default_grader`` to ``lighteval_expr``, add an ``aime24`` row to
``docs/accuracy/accuracy-benchmarking.md``, and move it from "Still
Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing the
Status Summary, Method Count Summary, and Suggested Implementation
Order sections accordingly).

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
@debermudez debermudez force-pushed the dbermudez/aip-875-implement-aime24-benchmark-loader branch from 05ece47 to 80e91dc Compare May 27, 2026 19:41
@debermudez debermudez marked this pull request as ready for review May 27, 2026 19:41
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Walkthrough

AIME24Benchmark is implemented to load the HuggingFace AIME 2024 dataset, convert rows into problems with bare problem text prompts and stringified answers, and is registered as a production benchmark with lighteval_expr grader. The implementation is tested comprehensively and marked complete in documentation.

Changes

AIME24Benchmark Implementation and Testing

Layer / File(s) Summary
AIME24Benchmark Core Implementation
src/aiperf/accuracy/benchmarks/aime24.py
AIME24Benchmark.load_problems() loads HuggingFaceH4/aime_2024 train split and converts each row to a BenchmarkProblem with bare problem text as prompt, stringified integer answer as ground_truth, and metadata including generation_size=32768; replaces previous NotImplementedError placeholder with working async loader.
Plugin Registration and Metadata
src/aiperf/plugin/plugins.yaml
accuracy_benchmark.aime24 plugin entry updated to set default_grader: lighteval_expr, remove is_implemented: false flag, and update description to reflect lighteval-backed expr_gold_metric alignment.
Test Coverage for AIME24Benchmark
tests/unit/accuracy/test_aime24_benchmark.py, tests/unit/accuracy/test_accuracy_config.py
New test module validates that prompts are exactly bare problem text (no instruction/decoration), parameters n_shots and enable_cot do not affect prompt generation, one problem per dataset row, metadata includes 32768 generation_size, and edge cases (empty dataset, unicode) are handled; test configuration updated to replace aime24 with aime25 in stub validation.
Documentation Updates
docs/accuracy/accuracy-benchmarking.md, docs/accuracy/accuracy_stubs.md
Benchmark documentation updated to mark AIME24Benchmark as implemented with lighteval_expr grader, added to available benchmarks table, removed from stubbed benchmarks list, and method count/implementation order sections adjusted to reflect completion.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 A hop through the AIME now complete,
Where bare problem text keeps prompts sweet,
Lighteval blessed with lighteval_expr grace,
Tests ensure each answer finds its place,
From stub to finished—mark the victory!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: implementing a lighteval-backed AIME 2024 benchmark, with the reference ticket AIP-875 for context.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/unit/accuracy/test_aime24_benchmark.py (2)

31-37: ⚡ Quick win

Add an explicit return type to _make_run.

Line 31 defines _make_run without a return annotation.

Proposed fix
+from aiperf.config.resolution.plan import BenchmarkRun
+
-def _make_run():
+def _make_run() -> BenchmarkRun:
     return make_benchmark_run(
         model_names=["test-model"],
         endpoint_type=EndpointType.COMPLETIONS,
         streaming=False,
         accuracy={"benchmark": AccuracyBenchmarkType.AIME24},
     )
As per coding guidelines, `**/*.py`: Type hints on ALL functions (parameters and return).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/accuracy/test_aime24_benchmark.py` around lines 31 - 37, Add an
explicit return type annotation to the helper function _make_run so it follows
the project's typing guideline; locate the _make_run definition and annotate its
signature with the return type produced by make_benchmark_run (the type used for
benchmark runs in your codebase), e.g., the Run/BenchmarkRun type used by
make_benchmark_run, ensuring imports are added if needed; verify the signature
references _make_run, make_benchmark_run, EndpointType and
AccuracyBenchmarkType.AIME24 so the function remains identical in behavior but
now has a concrete return type.

58-223: ⚡ Quick win

Rename test methods to the required test_<function>_<scenario>_<expected> format.

Several test names (for example, Line 58 and Line 130) currently use scenario-style names but omit the target function and expected-outcome suffix.

As per coding guidelines, tests/**/*.py: Name test functions as test_<function>_<scenario>_<expected>.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/accuracy/test_aime24_benchmark.py` around lines 58 - 223, Several
test functions (e.g., test_flat_prompt_is_problem_text,
test_no_instruction_prefix, test_chat_message_is_single_user_message,
test_n_shots_argument_does_not_affect_prompt,
test_enable_cot_does_not_affect_prompt, test_returns_one_problem_per_row,
test_ground_truth_is_string_form_of_answer, test_task_name_is_aime24,
test_generation_size_is_32k, test_empty_dataset_returns_empty_list,
test_unicode_problem_text_preserved) do not follow the required naming
convention; rename each to the pattern test_<function>_<scenario>_<expected>
(for example rename test_flat_prompt_is_problem_text ->
test_load_problems_flat_prompt_is_problem_text,
test_n_shots_argument_does_not_affect_prompt ->
test_load_problems_n_shots_ignored_prompts_equal, etc.) keeping the same bodies
and references to AIME24Benchmark, load_problems, TASK_NAME,
DEFAULT_GENERATION_SIZE, and _make_row/_make_fake_dataset to ensure tests still
locate the correct symbols.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit/accuracy/test_aime24_benchmark.py`:
- Around line 31-37: Add an explicit return type annotation to the helper
function _make_run so it follows the project's typing guideline; locate the
_make_run definition and annotate its signature with the return type produced by
make_benchmark_run (the type used for benchmark runs in your codebase), e.g.,
the Run/BenchmarkRun type used by make_benchmark_run, ensuring imports are added
if needed; verify the signature references _make_run, make_benchmark_run,
EndpointType and AccuracyBenchmarkType.AIME24 so the function remains identical
in behavior but now has a concrete return type.
- Around line 58-223: Several test functions (e.g.,
test_flat_prompt_is_problem_text, test_no_instruction_prefix,
test_chat_message_is_single_user_message,
test_n_shots_argument_does_not_affect_prompt,
test_enable_cot_does_not_affect_prompt, test_returns_one_problem_per_row,
test_ground_truth_is_string_form_of_answer, test_task_name_is_aime24,
test_generation_size_is_32k, test_empty_dataset_returns_empty_list,
test_unicode_problem_text_preserved) do not follow the required naming
convention; rename each to the pattern test_<function>_<scenario>_<expected>
(for example rename test_flat_prompt_is_problem_text ->
test_load_problems_flat_prompt_is_problem_text,
test_n_shots_argument_does_not_affect_prompt ->
test_load_problems_n_shots_ignored_prompts_equal, etc.) keeping the same bodies
and references to AIME24Benchmark, load_problems, TASK_NAME,
DEFAULT_GENERATION_SIZE, and _make_row/_make_fake_dataset to ensure tests still
locate the correct symbols.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 60a2bedb-329f-424d-a424-9861f61f7789

📥 Commits

Reviewing files that changed from the base of the PR and between d797629 and 80e91dc.

📒 Files selected for processing (6)
  • docs/accuracy/accuracy-benchmarking.md
  • docs/accuracy/accuracy_stubs.md
  • src/aiperf/accuracy/benchmarks/aime24.py
  • src/aiperf/plugin/plugins.yaml
  • tests/unit/accuracy/test_accuracy_config.py
  • tests/unit/accuracy/test_aime24_benchmark.py
💤 Files with no reviewable changes (1)
  • tests/unit/accuracy/test_accuracy_config.py

@FrankD412 FrankD412 enabled auto-merge (squash) May 29, 2026 00:24
@FrankD412 FrankD412 merged commit abf5d23 into main May 29, 2026
23 of 24 checks passed
@FrankD412 FrankD412 deleted the dbermudez/aip-875-implement-aime24-benchmark-loader branch May 29, 2026 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants