feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) by debermudez · Pull Request #927 · ai-dynamo/aiperf

debermudez · 2026-05-12T23:07:41Z

Plugin: registers math_500 benchmark in plugins.yaml (default_grader: math, default_n_shots: 0); scaffold loader raises NotImplementedError until the full lighteval-backed implementation lands.
500 curated math problems spanning algebra, geometry, number theory, and combinatorics; gold answers are LaTeX snippets (e.g. \frac{1}{3}, \sqrt{2}).
Uses LightevalLatexGrader (lighteval_latex, latex_gold_metric, LatexExtractionConfig on the gold side) introduced in AIP-874.
Depends on AIP-876 (lighteval sub-stack ordering).

Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py (math_500 task)

Summary by CodeRabbit

New Features
- MATH-500 benchmark implemented and available for accuracy testing with lighteval LaTeX grading and boxed-expression extraction.
Documentation
- Accuracy docs updated to mark MATH-500 implemented and adjust benchmark status (remaining stubs reduced to two).
Tests
- Added unit tests covering prompts, ground truth, task mapping, metadata, and dataset edge cases for MATH-500.

copy-pr-bot · 2026-05-12T23:07:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-12T23:07:52Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@0765b00d11c862b194ca0315ad85f94bbc14a7d0

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@0765b00d11c862b194ca0315ad85f94bbc14a7d0

Last updated for commit: 0765b00 • Browse code

github-actions · 2026-05-12T23:08:24Z

Fern Docs Preview: https://nvidia-preview-5735448c-1a74-4670-b822-cb2389d2949d.docs.buildwithfern.com/aiperf/dev

debermudez · 2026-05-12T23:11:33Z

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 6 of 8 — base branch is dbermudez/aip-876-implement-aime25-benchmark-loader,
depends on #926 (AIP-876) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

codecov · 2026-05-13T21:36:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai · 2026-05-29T23:34:34Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 996f9955-0414-4bec-a4af-ef060cb586ed

📥 Commits

Reviewing files that changed from the base of the PR and between d5d733f and 191a831.

📒 Files selected for processing (2)

src/aiperf/accuracy/benchmarks/math_500.py
tests/unit/accuracy/test_math_500_benchmark.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/aiperf/accuracy/benchmarks/math_500.py
tests/unit/accuracy/test_math_500_benchmark.py

Walkthrough

This PR converts Math500Benchmark from a placeholder raising NotImplementedError into a fully functional, lighteval-aligned MATH-500 dataset loader that asynchronously loads problems, extracts ground truth solutions for boxed-answer grading, and passes comprehensive unit tests.

Changes

Math500Benchmark Lighteval Implementation

Layer / File(s)	Summary
Math500Benchmark loader implementation `src/aiperf/accuracy/benchmarks/math_500.py`	Module docstring describes lighteval workflow and `\boxed{answer}` extraction; constants define dataset name and field names; `load_problems()` asynchronously loads HuggingFace MATH-500 test split via `asyncio.to_thread()` and delegates to `_build_problems()`; `_build_problems()` constructs `BenchmarkProblem` per row with problem text as prompt, full solution as ground_truth, subject as task (with fallback), and metadata fields (subject, level, generation_size).
Plugin configuration and metadata `src/aiperf/plugin/plugins.yaml`	Benchmark description rewritten for lighteval/LaTeX alignment; `default_grader` changed from `math` to `lighteval_latex`; `is_implemented: false` flag removed to mark benchmark as operational.
Math500Benchmark test suite `tests/unit/accuracy/test_math_500_benchmark.py`	New test module validates bare problem-text prompts (no instruction/CoT strings), full-solution ground_truth, subject-derived task names with fallback, insensitivity to n_shots and enable_cot, one-to-one dataset row mapping, metadata correctness (subject, level, generation_size), and edge cases (empty datasets, unicode preservation) using mocked dataset rows.
Test configuration stub list update `tests/unit/accuracy/test_accuracy_config.py`	`STUB_BENCHMARKS` removes `math_500` and adds `gpqa_diamond` and `lcb_codegeneration` to track remaining unimplemented benchmarks.
Documentation updates for implementation status `docs/accuracy/accuracy-benchmarking.md`, `docs/accuracy/accuracy_stubs.md`	accuracy-benchmarking.md adds `math_500` row to Available Benchmarks table; accuracy_stubs.md status summary marks `Math500Benchmark` as implemented; "Still Stubbed" table removes `Math500Benchmark` and lists only two remaining stubs; method counts updated to reflect 7 implemented and 2 stubbed benchmarks.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐰 A benchmark now blooms in lighteval's glow,
With boxes extracted where answers do flow,
Math-500 problems, from MATH dataset drawn,
Complete with true grading—the stub-work is gone! ✨📦

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879)' directly and clearly summarizes the main change: implementing a MATH-500 benchmark with lighteval integration, covering the primary objective of the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/accuracy/benchmarks/math_500.py`:
- Line 98: The assignment for solution uses str(row.get(SOLUTION_FIELD, ""))
which converts None to the literal "None"; change it to use the same "or"
pattern used for task so None/empty values become an empty string — replace the
use of row.get(SOLUTION_FIELD, "") with row.get(SOLUTION_FIELD) or "" when
constructing solution (referencing the SOLUTION_FIELD symbol and the solution
variable).

In `@tests/unit/accuracy/test_math_500_benchmark.py`:
- Around line 29-36: Add a return type hint to the _make_run() function so it
declares it returns a BenchmarkRun: update the signature of _make_run to return
BenchmarkRun and ensure BenchmarkRun is imported (from
aiperf.config.resolution.plan import BenchmarkRun) so the type name resolves;
leave the function body and call to make_benchmark_run (and symbols
EndpointType, AccuracyBenchmarkType) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 75d81a1a-d580-4a0a-a262-2dff57cc279c

📥 Commits

Reviewing files that changed from the base of the PR and between 1070bf2 and eae6f8a.

📒 Files selected for processing (6)

docs/accuracy/accuracy-benchmarking.md
docs/accuracy/accuracy_stubs.md
src/aiperf/accuracy/benchmarks/math_500.py
src/aiperf/plugin/plugins.yaml
tests/unit/accuracy/test_accuracy_config.py
tests/unit/accuracy/test_math_500_benchmark.py

💤 Files with no reviewable changes (1)

tests/unit/accuracy/test_accuracy_config.py

Implement ``Math500Benchmark`` mirroring the trt-llm benchmark recipe's ``acc_bench_lighteval.py:math_500`` configuration. Same lighteval-aligned shape as the AIME24/25 loaders (bare problem text, zero-shot, no CoT trigger, ``generation_size=32768``) with two reference-mandated differences: 1. ``ground_truth`` is the full ``solution`` text (which contains ``\\boxed{answer}``), not a bare answer. The recipe's ``latex_gold_metric.gold_extraction_target=(LatexExtractionConfig(),)`` extracts the boxed expression at grade time via ``LightevalLatexGrader``. 2. Default pairing is ``lighteval_latex`` (not ``lighteval_expr``) because gold answers are LaTeX expressions (fractions, square roots, etc.). Per-row ``task`` field is set to ``row["subject"]`` (falling back to the constant ``"math_500"``) so the accuracy CSV breaks down by MATH subject (algebra, geometry, number theory, ...). Built on top of AIP-876 in the lighteval sub-stack (875 → 876 → 879). No heavy optional dep — ``datasets`` is core — so CI gets 100% line + branch coverage out of the box. Updates the stub registry: drop ``math_500`` from ``test_accuracy_config.STUB_BENCHMARKS``, drop ``is_implemented: false`` from the ``math_500`` plugins.yaml entry, switch ``default_grader`` to ``lighteval_latex``, add the ``math_500`` row to ``docs/accuracy/accuracy-benchmarking.md``, and move it from "Still Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing the Status Summary, Method Count Summary, and Suggested Implementation Order accordingly). Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

coderabbitai

♻️ Duplicate comments (1)

src/aiperf/accuracy/benchmarks/math_500.py (1)
98-98: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize nullable solution values before stringifying.

Line 98 still turns an explicit None solution into the literal "None", which then becomes the gold answer passed to the grader. Normalize first, then stringify.
Proposed fix
-            solution = str(row.get(SOLUTION_FIELD, ""))
+            solution = str(row.get(SOLUTION_FIELD) or "")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/accuracy/benchmarks/math_500.py` at line 98, Normalize the
nullable solution value before stringifying: instead of directly calling
solution = str(row.get(SOLUTION_FIELD, "")), first retrieve the raw value (e.g.,
raw_solution = row.get(SOLUTION_FIELD, "")), check if raw_solution is None and
set it to "" if so, then set solution = str(raw_solution). This ensures explicit
None values do not become the literal "None" passed to the grader.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/aiperf/accuracy/benchmarks/math_500.py`:
- Line 98: Normalize the nullable solution value before stringifying: instead of
directly calling solution = str(row.get(SOLUTION_FIELD, "")), first retrieve the
raw value (e.g., raw_solution = row.get(SOLUTION_FIELD, "")), check if
raw_solution is None and set it to "" if so, then set solution =
str(raw_solution). This ensures explicit None values do not become the literal
"None" passed to the grader.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 734a1d74-a077-4b59-9f41-6902afc0ed5b

📥 Commits

Reviewing files that changed from the base of the PR and between eae6f8a and d5d733f.

📒 Files selected for processing (6)

docs/accuracy/accuracy-benchmarking.md
docs/accuracy/accuracy_stubs.md
src/aiperf/accuracy/benchmarks/math_500.py
src/aiperf/plugin/plugins.yaml
tests/unit/accuracy/test_accuracy_config.py
tests/unit/accuracy/test_math_500_benchmark.py

💤 Files with no reviewable changes (1)

tests/unit/accuracy/test_accuracy_config.py

✅ Files skipped from review due to trivial changes (1)

docs/accuracy/accuracy_stubs.md

🚧 Files skipped from review as they are similar to previous changes (3)

docs/accuracy/accuracy-benchmarking.md
src/aiperf/plugin/plugins.yaml
tests/unit/accuracy/test_math_500_benchmark.py

…879) `_make_run()` had no return annotation, leaving callers reading implicit ``Any``. Add ``-> BenchmarkRun`` (via a ``TYPE_CHECKING`` import of ``aiperf.config.resolution.plan.BenchmarkRun``, so the annotation is string-only thanks to ``from __future__ import annotations`` already on this file and the runtime cost is zero). Mirrors the same annotation pattern used in the ``test_bigbench_benchmark.py`` helper after AIP-878 review feedback. Function body, ``make_benchmark_run`` call, and the ``EndpointType`` / ``AccuracyBenchmarkType`` imports are unchanged. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

…(AIP-879) ``_build_problems`` was doing ``str(row.get(SOLUTION_FIELD, ""))``, which only falls back to ``""`` when the key is absent. If a row actually carries ``solution: None`` (key present, value ``None``), ``row.get`` returns ``None`` and ``str(None)`` produces the literal four-character string ``"None"`` — which then propagates as ``BenchmarkProblem.ground_truth`` and would corrupt ``LightevalLatexGrader`` extraction at grade time. Switch to ``row.get(SOLUTION_FIELD) or ""`` so both the absent-key case and the present-but-None case collapse to ``""``, matching the ``or``-pattern already used a few lines below for ``task`` (``row.get(SUBJECT_FIELD) or TASK_NAME``). Tradeoff: drops the implicit ``str()`` coercion, so a future upstream schema regression that ships a non-string ``solution`` would now surface as a Pydantic ``ValidationError`` on ``BenchmarkProblem.ground_truth`` instead of being silently stringified. Upstream ``HuggingFaceH4/MATH-500`` ships ``solution`` as a string, so this is the desired loud-failure mode. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

…k-loader

github-actions Bot added the feat label May 12, 2026

This was referenced May 12, 2026

feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928

Merged

feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

Open

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 358d5bd to 9bbe752 Compare May 12, 2026 23:25

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from e4b5d58 to 28c1485 Compare May 12, 2026 23:25

debermudez marked this pull request as draft May 12, 2026 23:27

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 9bbe752 to 599d8f4 Compare May 13, 2026 00:39

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 28c1485 to 9e0491c Compare May 13, 2026 00:39

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from 599d8f4 to d7552b6 Compare May 13, 2026 21:23

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 9e0491c to 3770e19 Compare May 13, 2026 21:23

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch from d7552b6 to b8645f1 Compare May 13, 2026 23:32

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 3770e19 to c2ac1d4 Compare May 13, 2026 23:32

debermudez force-pushed the dbermudez/aip-876-implement-aime25-benchmark-loader branch 2 times, most recently from 62d9fbe to 3e85d67 Compare May 29, 2026 20:53

Base automatically changed from dbermudez/aip-876-implement-aime25-benchmark-loader to main May 29, 2026 23:18

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from c2ac1d4 to eae6f8a Compare May 29, 2026 23:28

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread src/aiperf/accuracy/benchmarks/math_500.py Outdated

Comment thread tests/unit/accuracy/test_math_500_benchmark.py Outdated

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from eae6f8a to d5d733f Compare June 2, 2026 17:14

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

debermudez added 2 commits June 2, 2026 10:18

debermudez marked this pull request as ready for review June 2, 2026 17:22

FrankD412 approved these changes Jun 2, 2026

View reviewed changes

Merge branch 'main' into dbermudez/aip-879-implement-math500-benchmar…

0765b00

…k-loader

debermudez merged commit 94ae26e into main Jun 2, 2026
24 of 25 checks passed

debermudez deleted the dbermudez/aip-879-implement-math500-benchmark-loader branch June 2, 2026 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879)#927

feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879)#927
debermudez merged 4 commits into
mainfrom
dbermudez/aip-879-implement-math500-benchmark-loader

debermudez commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

debermudez commented May 12, 2026

Uh oh!

codecov Bot commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

debermudez commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

debermudez commented May 12, 2026

Stack dependency

Uh oh!

codecov Bot commented May 13, 2026

Codecov Report

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

debermudez commented May 12, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading