feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879)#927
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@0765b00d11c862b194ca0315ad85f94bbc14a7d0Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@0765b00d11c862b194ca0315ad85f94bbc14a7d0Last updated for commit: |
Stack dependencyThis PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with Merge order:
This PR: position 6 of 8 — base branch is After each upstream PR merges, the downstream PR's branch will be rebased |
358d5bd to
9bbe752
Compare
e4b5d58 to
28c1485
Compare
9bbe752 to
599d8f4
Compare
28c1485 to
9e0491c
Compare
599d8f4 to
d7552b6
Compare
9e0491c to
3770e19
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
d7552b6 to
b8645f1
Compare
3770e19 to
c2ac1d4
Compare
62d9fbe to
3e85d67
Compare
c2ac1d4 to
eae6f8a
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughThis PR converts ChangesMath500Benchmark Lighteval Implementation
🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/aiperf/accuracy/benchmarks/math_500.py`:
- Line 98: The assignment for solution uses str(row.get(SOLUTION_FIELD, ""))
which converts None to the literal "None"; change it to use the same "or"
pattern used for task so None/empty values become an empty string — replace the
use of row.get(SOLUTION_FIELD, "") with row.get(SOLUTION_FIELD) or "" when
constructing solution (referencing the SOLUTION_FIELD symbol and the solution
variable).
In `@tests/unit/accuracy/test_math_500_benchmark.py`:
- Around line 29-36: Add a return type hint to the _make_run() function so it
declares it returns a BenchmarkRun: update the signature of _make_run to return
BenchmarkRun and ensure BenchmarkRun is imported (from
aiperf.config.resolution.plan import BenchmarkRun) so the type name resolves;
leave the function body and call to make_benchmark_run (and symbols
EndpointType, AccuracyBenchmarkType) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 75d81a1a-d580-4a0a-a262-2dff57cc279c
📒 Files selected for processing (6)
docs/accuracy/accuracy-benchmarking.mddocs/accuracy/accuracy_stubs.mdsrc/aiperf/accuracy/benchmarks/math_500.pysrc/aiperf/plugin/plugins.yamltests/unit/accuracy/test_accuracy_config.pytests/unit/accuracy/test_math_500_benchmark.py
💤 Files with no reviewable changes (1)
- tests/unit/accuracy/test_accuracy_config.py
Implement ``Math500Benchmark`` mirroring the trt-llm benchmark recipe's
``acc_bench_lighteval.py:math_500`` configuration. Same lighteval-aligned
shape as the AIME24/25 loaders (bare problem text, zero-shot, no CoT
trigger, ``generation_size=32768``) with two reference-mandated
differences:
1. ``ground_truth`` is the full ``solution`` text (which contains
``\\boxed{answer}``), not a bare answer. The recipe's
``latex_gold_metric.gold_extraction_target=(LatexExtractionConfig(),)``
extracts the boxed expression at grade time via
``LightevalLatexGrader``.
2. Default pairing is ``lighteval_latex`` (not ``lighteval_expr``)
because gold answers are LaTeX expressions (fractions, square
roots, etc.).
Per-row ``task`` field is set to ``row["subject"]`` (falling back to
the constant ``"math_500"``) so the accuracy CSV breaks down by MATH
subject (algebra, geometry, number theory, ...).
Built on top of AIP-876 in the lighteval sub-stack
(875 → 876 → 879). No heavy optional dep — ``datasets`` is core —
so CI gets 100% line + branch coverage out of the box.
Updates the stub registry: drop ``math_500`` from
``test_accuracy_config.STUB_BENCHMARKS``, drop ``is_implemented:
false`` from the ``math_500`` plugins.yaml entry, switch
``default_grader`` to ``lighteval_latex``, add the ``math_500`` row
to ``docs/accuracy/accuracy-benchmarking.md``, and move it from
"Still Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing
the Status Summary, Method Count Summary, and Suggested
Implementation Order accordingly).
Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
eae6f8a to
d5d733f
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/aiperf/accuracy/benchmarks/math_500.py (1)
98-98:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winNormalize nullable
solutionvalues before stringifying.Line 98 still turns an explicit
Nonesolution into the literal"None", which then becomes the gold answer passed to the grader. Normalize first, then stringify.Proposed fix
- solution = str(row.get(SOLUTION_FIELD, "")) + solution = str(row.get(SOLUTION_FIELD) or "")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/aiperf/accuracy/benchmarks/math_500.py` at line 98, Normalize the nullable solution value before stringifying: instead of directly calling solution = str(row.get(SOLUTION_FIELD, "")), first retrieve the raw value (e.g., raw_solution = row.get(SOLUTION_FIELD, "")), check if raw_solution is None and set it to "" if so, then set solution = str(raw_solution). This ensures explicit None values do not become the literal "None" passed to the grader.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/aiperf/accuracy/benchmarks/math_500.py`:
- Line 98: Normalize the nullable solution value before stringifying: instead of
directly calling solution = str(row.get(SOLUTION_FIELD, "")), first retrieve the
raw value (e.g., raw_solution = row.get(SOLUTION_FIELD, "")), check if
raw_solution is None and set it to "" if so, then set solution =
str(raw_solution). This ensures explicit None values do not become the literal
"None" passed to the grader.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 734a1d74-a077-4b59-9f41-6902afc0ed5b
📒 Files selected for processing (6)
docs/accuracy/accuracy-benchmarking.mddocs/accuracy/accuracy_stubs.mdsrc/aiperf/accuracy/benchmarks/math_500.pysrc/aiperf/plugin/plugins.yamltests/unit/accuracy/test_accuracy_config.pytests/unit/accuracy/test_math_500_benchmark.py
💤 Files with no reviewable changes (1)
- tests/unit/accuracy/test_accuracy_config.py
✅ Files skipped from review due to trivial changes (1)
- docs/accuracy/accuracy_stubs.md
🚧 Files skipped from review as they are similar to previous changes (3)
- docs/accuracy/accuracy-benchmarking.md
- src/aiperf/plugin/plugins.yaml
- tests/unit/accuracy/test_math_500_benchmark.py
…879) `_make_run()` had no return annotation, leaving callers reading implicit ``Any``. Add ``-> BenchmarkRun`` (via a ``TYPE_CHECKING`` import of ``aiperf.config.resolution.plan.BenchmarkRun``, so the annotation is string-only thanks to ``from __future__ import annotations`` already on this file and the runtime cost is zero). Mirrors the same annotation pattern used in the ``test_bigbench_benchmark.py`` helper after AIP-878 review feedback. Function body, ``make_benchmark_run`` call, and the ``EndpointType`` / ``AccuracyBenchmarkType`` imports are unchanged. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
…(AIP-879) ``_build_problems`` was doing ``str(row.get(SOLUTION_FIELD, ""))``, which only falls back to ``""`` when the key is absent. If a row actually carries ``solution: None`` (key present, value ``None``), ``row.get`` returns ``None`` and ``str(None)`` produces the literal four-character string ``"None"`` — which then propagates as ``BenchmarkProblem.ground_truth`` and would corrupt ``LightevalLatexGrader`` extraction at grade time. Switch to ``row.get(SOLUTION_FIELD) or ""`` so both the absent-key case and the present-but-None case collapse to ``""``, matching the ``or``-pattern already used a few lines below for ``task`` (``row.get(SUBJECT_FIELD) or TASK_NAME``). Tradeoff: drops the implicit ``str()`` coercion, so a future upstream schema regression that ships a non-string ``solution`` would now surface as a Pydantic ``ValidationError`` on ``BenchmarkProblem.ground_truth`` instead of being silently stringified. Upstream ``HuggingFaceH4/MATH-500`` ships ``solution`` as a string, so this is the desired loud-failure mode. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
math_500benchmark inplugins.yaml(default_grader: math,default_n_shots: 0); scaffold loader raisesNotImplementedErroruntil the full lighteval-backed implementation lands.\frac{1}{3},\sqrt{2}).LightevalLatexGrader(lighteval_latex,latex_gold_metric,LatexExtractionConfigon the gold side) introduced in AIP-874.Reference:
trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py(math_500task)Summary by CodeRabbit
New Features
Documentation
Tests