feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875)#925
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@8913804ff856e2eac8f2910f7713539eea638c82Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@8913804ff856e2eac8f2910f7713539eea638c82Last updated for commit: |
Stack dependencyThis PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with Merge order:
This PR: position 4 of 8 — base branch is After each upstream PR merges, the downstream PR's branch will be rebased |
5824827 to
03e10c5
Compare
e0576be to
cd239a5
Compare
cd239a5 to
ed0edf6
Compare
f87725a to
09a0337
Compare
ed0edf6 to
9eabc25
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
118534c to
8b16196
Compare
9eabc25 to
05ece47
Compare
8b16196 to
44663c3
Compare
Implement ``AIME24Benchmark`` to mirror the trt-llm benchmark recipe's
``acc_bench_lighteval.py`` configuration for AIME 2024:
aime24 = LightevalTaskConfig(
name="aime24",
prompt_function=aime_prompt_fn,
hf_repo="HuggingFaceH4/aime_2024",
evaluation_splits=["train"],
few_shots_split=None,
few_shots_select=None,
generation_size=32768,
metric=[expr_gold_metric],
)
The recipe's ``aime_prompt_fn`` produces a ``Doc`` whose ``query`` is
the bare problem text — lighteval's prompt manager wraps it as a
single user message with no instruction prefix and no few-shot
priming. The loader emits prompts the same way: one
``BenchmarkProblem`` per dataset row, ``prompt`` = the bare
``problem`` field, ``ground_truth`` = ``str(answer)``,
``metadata.generation_size`` = 32768. ``tasks`` / ``n_shots`` /
``enable_cot`` arguments are accepted for protocol uniformity but
ignored (any of them changing the prompt would diverge from the
reference). Pair with ``LightevalExprGrader`` for the recipe's
``expr_gold_metric`` extraction.
Built on the v2 ``BenchmarkRun`` API (post-PR-#912) and on the AIP-878
test harness conventions: ``make_benchmark_run`` for fixtures,
``BenchmarkProblem``-driven assertions, ``patch`` on
``aime24.load_dataset`` for deterministic rows. The loader has no
heavy optional dependency (``datasets`` is a core dep), so no
fake-harness is needed; CI gets 100% line + branch coverage out of
the box.
Updates the stub registry: drop ``aime24`` from
``test_accuracy_config.STUB_BENCHMARKS``, drop the ``is_implemented:
false`` flag from the ``aime24`` plugins.yaml entry, switch
``default_grader`` to ``lighteval_expr``, add an ``aime24`` row to
``docs/accuracy/accuracy-benchmarking.md``, and move it from "Still
Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing the
Status Summary, Method Count Summary, and Suggested Implementation
Order sections accordingly).
Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
05ece47 to
80e91dc
Compare
WalkthroughAIME24Benchmark is implemented to load the HuggingFace AIME 2024 dataset, convert rows into problems with bare problem text prompts and stringified answers, and is registered as a production benchmark with lighteval_expr grader. The implementation is tested comprehensively and marked complete in documentation. ChangesAIME24Benchmark Implementation and Testing
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/unit/accuracy/test_aime24_benchmark.py (2)
31-37: ⚡ Quick winAdd an explicit return type to
_make_run.Line 31 defines
_make_runwithout a return annotation.As per coding guidelines, `**/*.py`: Type hints on ALL functions (parameters and return).Proposed fix
+from aiperf.config.resolution.plan import BenchmarkRun + -def _make_run(): +def _make_run() -> BenchmarkRun: return make_benchmark_run( model_names=["test-model"], endpoint_type=EndpointType.COMPLETIONS, streaming=False, accuracy={"benchmark": AccuracyBenchmarkType.AIME24}, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/accuracy/test_aime24_benchmark.py` around lines 31 - 37, Add an explicit return type annotation to the helper function _make_run so it follows the project's typing guideline; locate the _make_run definition and annotate its signature with the return type produced by make_benchmark_run (the type used for benchmark runs in your codebase), e.g., the Run/BenchmarkRun type used by make_benchmark_run, ensuring imports are added if needed; verify the signature references _make_run, make_benchmark_run, EndpointType and AccuracyBenchmarkType.AIME24 so the function remains identical in behavior but now has a concrete return type.
58-223: ⚡ Quick winRename test methods to the required
test_<function>_<scenario>_<expected>format.Several test names (for example, Line 58 and Line 130) currently use scenario-style names but omit the target function and expected-outcome suffix.
As per coding guidelines,
tests/**/*.py: Name test functions astest_<function>_<scenario>_<expected>.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/accuracy/test_aime24_benchmark.py` around lines 58 - 223, Several test functions (e.g., test_flat_prompt_is_problem_text, test_no_instruction_prefix, test_chat_message_is_single_user_message, test_n_shots_argument_does_not_affect_prompt, test_enable_cot_does_not_affect_prompt, test_returns_one_problem_per_row, test_ground_truth_is_string_form_of_answer, test_task_name_is_aime24, test_generation_size_is_32k, test_empty_dataset_returns_empty_list, test_unicode_problem_text_preserved) do not follow the required naming convention; rename each to the pattern test_<function>_<scenario>_<expected> (for example rename test_flat_prompt_is_problem_text -> test_load_problems_flat_prompt_is_problem_text, test_n_shots_argument_does_not_affect_prompt -> test_load_problems_n_shots_ignored_prompts_equal, etc.) keeping the same bodies and references to AIME24Benchmark, load_problems, TASK_NAME, DEFAULT_GENERATION_SIZE, and _make_row/_make_fake_dataset to ensure tests still locate the correct symbols.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/unit/accuracy/test_aime24_benchmark.py`:
- Around line 31-37: Add an explicit return type annotation to the helper
function _make_run so it follows the project's typing guideline; locate the
_make_run definition and annotate its signature with the return type produced by
make_benchmark_run (the type used for benchmark runs in your codebase), e.g.,
the Run/BenchmarkRun type used by make_benchmark_run, ensuring imports are added
if needed; verify the signature references _make_run, make_benchmark_run,
EndpointType and AccuracyBenchmarkType.AIME24 so the function remains identical
in behavior but now has a concrete return type.
- Around line 58-223: Several test functions (e.g.,
test_flat_prompt_is_problem_text, test_no_instruction_prefix,
test_chat_message_is_single_user_message,
test_n_shots_argument_does_not_affect_prompt,
test_enable_cot_does_not_affect_prompt, test_returns_one_problem_per_row,
test_ground_truth_is_string_form_of_answer, test_task_name_is_aime24,
test_generation_size_is_32k, test_empty_dataset_returns_empty_list,
test_unicode_problem_text_preserved) do not follow the required naming
convention; rename each to the pattern test_<function>_<scenario>_<expected>
(for example rename test_flat_prompt_is_problem_text ->
test_load_problems_flat_prompt_is_problem_text,
test_n_shots_argument_does_not_affect_prompt ->
test_load_problems_n_shots_ignored_prompts_equal, etc.) keeping the same bodies
and references to AIME24Benchmark, load_problems, TASK_NAME,
DEFAULT_GENERATION_SIZE, and _make_row/_make_fake_dataset to ensure tests still
locate the correct symbols.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 60a2bedb-329f-424d-a424-9861f61f7789
📒 Files selected for processing (6)
docs/accuracy/accuracy-benchmarking.mddocs/accuracy/accuracy_stubs.mdsrc/aiperf/accuracy/benchmarks/aime24.pysrc/aiperf/plugin/plugins.yamltests/unit/accuracy/test_accuracy_config.pytests/unit/accuracy/test_aime24_benchmark.py
💤 Files with no reviewable changes (1)
- tests/unit/accuracy/test_accuracy_config.py
aime24benchmark inplugins.yaml(default_grader: math,default_n_shots: 0); scaffold loader raisesNotImplementedErroruntil the full lighteval-backed implementation lands.aimeloader; usesLightevalExprGrader(expr_gold_metric,ExprExtractionConfig) introduced in AIP-874.LightevalExprGraderand the[accuracy]optional group (lighteval).Reference:
trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py(aime24task)Summary by CodeRabbit
New Features
Tests