From d5d733f0d954d8ab05456eee1ac12a9fe6b9047e Mon Sep 17 00:00:00 2001 From: Elias Bermudez Date: Tue, 26 May 2026 10:07:34 -0700 Subject: [PATCH 1/3] feat(accuracy): MATH-500 lighteval-aligned benchmark loader (AIP-879) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implement ``Math500Benchmark`` mirroring the trt-llm benchmark recipe's ``acc_bench_lighteval.py:math_500`` configuration. Same lighteval-aligned shape as the AIME24/25 loaders (bare problem text, zero-shot, no CoT trigger, ``generation_size=32768``) with two reference-mandated differences: 1. ``ground_truth`` is the full ``solution`` text (which contains ``\\boxed{answer}``), not a bare answer. The recipe's ``latex_gold_metric.gold_extraction_target=(LatexExtractionConfig(),)`` extracts the boxed expression at grade time via ``LightevalLatexGrader``. 2. Default pairing is ``lighteval_latex`` (not ``lighteval_expr``) because gold answers are LaTeX expressions (fractions, square roots, etc.). Per-row ``task`` field is set to ``row["subject"]`` (falling back to the constant ``"math_500"``) so the accuracy CSV breaks down by MATH subject (algebra, geometry, number theory, ...). Built on top of AIP-876 in the lighteval sub-stack (875 → 876 → 879). No heavy optional dep — ``datasets`` is core — so CI gets 100% line + branch coverage out of the box. Updates the stub registry: drop ``math_500`` from ``test_accuracy_config.STUB_BENCHMARKS``, drop ``is_implemented: false`` from the ``math_500`` plugins.yaml entry, switch ``default_grader`` to ``lighteval_latex``, add the ``math_500`` row to ``docs/accuracy/accuracy-benchmarking.md``, and move it from "Still Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing the Status Summary, Method Count Summary, and Suggested Implementation Order accordingly). Signed-off-by: Elias Bermudez --- docs/accuracy/accuracy-benchmarking.md | 1 + docs/accuracy/accuracy_stubs.md | 21 +- src/aiperf/accuracy/benchmarks/math_500.py | 104 ++++++- src/aiperf/plugin/plugins.yaml | 9 +- tests/unit/accuracy/test_accuracy_config.py | 1 - .../unit/accuracy/test_math_500_benchmark.py | 256 ++++++++++++++++++ 6 files changed, 367 insertions(+), 25 deletions(-) create mode 100644 tests/unit/accuracy/test_math_500_benchmark.py diff --git a/docs/accuracy/accuracy-benchmarking.md b/docs/accuracy/accuracy-benchmarking.md index 18e8dabab..adabfe136 100644 --- a/docs/accuracy/accuracy-benchmarking.md +++ b/docs/accuracy/accuracy-benchmarking.md @@ -76,6 +76,7 @@ system message). | `bigbench` | `exact_match` | 3 | `lukaemon/bbh` (trt-llm/DeepEval reference; 27 subtasks, canonical CoT/non-CoT prompt files) | | `aime24` | `lighteval_expr` | 0 | `HuggingFaceH4/aime_2024` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) | | `aime25` | `lighteval_expr` | 0 | `yentinglin/aime_2025` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) | +| `math_500` | `lighteval_latex` | 0 | `HuggingFaceH4/MATH-500` (trt-llm/lighteval reference, gold is full solution containing `\boxed{answer}`, `latex_gold_metric`) | ## CLI Flags diff --git a/docs/accuracy/accuracy_stubs.md b/docs/accuracy/accuracy_stubs.md index e02953cf6..aeafe8703 100644 --- a/docs/accuracy/accuracy_stubs.md +++ b/docs/accuracy/accuracy_stubs.md @@ -7,7 +7,7 @@ This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected. -**Status summary:** With the AIME25 loader landing on top of the AIME24 / BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, and `AIME25Benchmark` are fully implemented; the remaining benchmarks (`math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs. +**Status summary:** With the MATH-500 loader landing on top of the AIME25 / AIME24 / BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, and `Math500Benchmark` are fully implemented; the remaining benchmarks (`gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs. ## Table of Contents @@ -175,14 +175,14 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method. | 4 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 | **IMPLEMENTED.** Loads `lukaemon/bbh` (27 BBH subtasks). Prompt rendering delegates to `deepeval.benchmarks.BigBenchHard`'s `BigBenchHardTemplate.generate_output`, which reads the 27 canonical CoT/shot prompt files DeepEval ships as package data. Pairs with `exact_match` for the recipe's strict `Scorer.exact_match_score` semantics. `default_n_shots=3`, `default_enable_cot=true`. Requires the `[accuracy]` extras (deepeval). | | 5 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `lighteval_expr` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (train split) and emits the bare problem text as a single user message — no instruction prefix, no few-shot priming. Mirrors the trt-llm benchmark recipe's `acc_bench_lighteval.py` configuration (`few_shots_split=None`, `generation_size=32768`). Pairs with `lighteval_expr` for the recipe's `expr_gold_metric` extraction. | | 6 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `lighteval_expr` | 0 | **IMPLEMENTED.** Same lighteval-aligned shape as `AIME24Benchmark` but pointed at `yentinglin/aime_2025` (the recipe's `aime25` task config). Identical prompt rendering, generation size, and grader pairing. | +| 7 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `lighteval_latex` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/MATH-500` (test split). Same lighteval-aligned shape as AIME24/25, but `ground_truth` is the full `solution` text (containing `\boxed{answer}`); `LightevalLatexGrader` extracts the boxed expression at grade time. Per-row `task` = `subject` so the accuracy CSV breaks down by MATH subject. | ### Still Stubbed | # | Class | File | Plugin Key | Default Grader | Default N-Shots | |---|-------|------|------------|----------------|-----------------| -| 1 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 | -| 2 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 | -| 3 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 | +| 1 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 | +| 2 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 | **Each benchmark has 1 method to implement:** @@ -309,13 +309,13 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu | Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods | |-----------|-------------|---------------|------------------|-------------------| | Graders | 7 (all) | 0 | — | 0 | -| Benchmarks | 6 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`) | 3 | 1 (`load_problems`) | 3 | +| Benchmarks | 7 (incl. MMLU, AIME, HellaSwag, BigBench, AIME24, AIME25, Math500) | 2 | 1 (`load_problems`) | 2 | | Record Processor | 1 (`AccuracyRecordProcessor`) | 0 | — | 0 | | Results Processor | 1 (`AccuracyResultsProcessor`) | 0 | — | 0 | | Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 | — | 0 | | Data Exporter | 1 (`AccuracyDataExporter`) | 0 | — | 0 | | Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 | -| **Total** | **17** | **4** | | **4** | +| **Total** | **18** | **3** | | **3** | ### Self-Disabling Pattern @@ -323,12 +323,11 @@ Processors and exporters raise their `Disabled` exception **in `__init__`** when ### Suggested Implementation Order -The processors, exporters, all graders, and six benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`) are already wired end-to-end. The remaining work is the three stub benchmarks; mirror the existing loader whose grader matches: +The processors, exporters, all graders, and seven benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, `Math500Benchmark`) are already wired end-to-end. The remaining work is the two stub benchmarks; mirror the existing loader whose grader matches: -1. **`math_500`** — mirror `AIME24Benchmark` (`benchmarks/aime24.py`) for the lighteval-aligned shape; pair with `lighteval_latex`. -2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader. -3. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader. -4. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported. +1. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader. +2. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader. +3. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported. ### Key Files for Reference diff --git a/src/aiperf/accuracy/benchmarks/math_500.py b/src/aiperf/accuracy/benchmarks/math_500.py index 04588fa4b..14f85bf73 100644 --- a/src/aiperf/accuracy/benchmarks/math_500.py +++ b/src/aiperf/accuracy/benchmarks/math_500.py @@ -1,31 +1,117 @@ # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 +"""MATH-500 benchmark loader, aligned with the trt-llm lighteval reference. + +Mirrors ``acc_bench_lighteval.py:math_500``: + + math_500 = LightevalTaskConfig( + name="math_500", + prompt_function=prompt_fn, # query=line["problem"], choices=[line["solution"]] + hf_repo="HuggingFaceH4/MATH-500", + evaluation_splits=["test"], + few_shots_split=None, + generation_size=32768, + metric=[latex_gold_metric], + ) + +Two notable differences from the AIME24/AIME25 loaders: + +1. ``ground_truth`` is the full ``solution`` text (which contains a + ``\\boxed{answer}``), not a bare answer. ``LightevalLatexGrader``'s + ``LatexExtractionConfig`` extracts the boxed answer from the + solution at grade time. This matches the recipe's + ``latex_gold_metric.gold_extraction_target=(LatexExtractionConfig(),)``. +2. Pair with ``LightevalLatexGrader`` (default), not + ``LightevalExprGrader`` — gold answers are LaTeX expressions + (fractions, square roots, etc.). + +Reference: + trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:156 +""" + from __future__ import annotations -from typing import TYPE_CHECKING +import asyncio +from typing import TYPE_CHECKING, Any + +from datasets import Dataset, load_dataset -from aiperf.accuracy.models import BenchmarkProblem +from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem from aiperf.common.mixins import AIPerfLoggerMixin if TYPE_CHECKING: from aiperf.config.resolution.plan import BenchmarkRun +DATASET_NAME = "HuggingFaceH4/MATH-500" +TASK_NAME = "math_500" + +# lighteval's math_500 task config: ``generation_size=32768``. +DEFAULT_GENERATION_SIZE = 32768 + +# Schema field names in HuggingFaceH4/MATH-500. +PROBLEM_FIELD = "problem" +SOLUTION_FIELD = "solution" +SUBJECT_FIELD = "subject" +LEVEL_FIELD = "level" + class Math500Benchmark(AIPerfLoggerMixin): - """Registered placeholder for a future MATH-500 loader. + """MATH-500 lighteval-aligned benchmark loader. - `load_problems()` intentionally raises NotImplementedError in this release; - use the MMLU benchmark when a working accuracy loader is required. + Loads ``HuggingFaceH4/MATH-500`` (test split) and emits one user + message per problem containing the bare problem text — matching + lighteval's ``prompt_fn``. Gold is the full ``solution`` text; + ``LightevalLatexGrader`` extracts the boxed answer at grade time. """ - def __init__(self, run: BenchmarkRun, **kwargs) -> None: + def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None: super().__init__(**kwargs) self.run = run async def load_problems( self, tasks: list[str] | None, n_shots: int, enable_cot: bool ) -> list[BenchmarkProblem]: - raise NotImplementedError( - "math_500 benchmark is not yet implemented; only 'mmlu' is available in this release." - ) + """Load MATH-500 problems lighteval-style. + + Args: + tasks: Ignored — lighteval's MATH-500 task has no subtask + filtering (subjects are kept in metadata for reporting, + but lighteval evaluates the full split). Use the + aggregated CSV per-subject row to break results down + after the run. + n_shots: Ignored — the lighteval reference is zero-shot + (``few_shots_split=None``). + enable_cot: Ignored — lighteval's ``prompt_fn`` does not + add a CoT trigger. + + Returns: + One ``BenchmarkProblem`` per dataset row, in dataset order. + """ + ds: Dataset = await asyncio.to_thread(load_dataset, DATASET_NAME, split="test") + return await asyncio.to_thread(self._build_problems, ds) + + def _build_problems(self, ds: Dataset) -> list[BenchmarkProblem]: + problems: list[BenchmarkProblem] = [] + for row in ds: + problem = row[PROBLEM_FIELD] + solution = str(row.get(SOLUTION_FIELD, "")) + messages: list[AccuracyChatMessage] = [{"role": "user", "content": problem}] + problems.append( + BenchmarkProblem( + prompt=problem, + # Gold is the full solution containing \\boxed{answer}; + # LightevalLatexGrader extracts the boxed expression. + ground_truth=solution, + # Use ``subject`` as the per-row task so the + # accuracy CSV breaks down by MATH subject. + task=row.get(SUBJECT_FIELD) or TASK_NAME, + metadata={ + "subject": row.get(SUBJECT_FIELD, ""), + "level": row.get(LEVEL_FIELD), + "generation_size": DEFAULT_GENERATION_SIZE, + }, + raw_messages=messages, + ) + ) + return problems diff --git a/src/aiperf/plugin/plugins.yaml b/src/aiperf/plugin/plugins.yaml index 6353e491c..d18e42880 100644 --- a/src/aiperf/plugin/plugins.yaml +++ b/src/aiperf/plugin/plugins.yaml @@ -1506,12 +1506,13 @@ accuracy_benchmark: math_500: class: aiperf.accuracy.benchmarks.math_500:Math500Benchmark description: | - MATH-500 benchmark with 500 curated mathematical reasoning problems - spanning algebra, geometry, number theory, and combinatorics. + MATH-500 benchmark, aligned with the trt-llm benchmark recipe's + lighteval-backed configuration (HuggingFaceH4/MATH-500 + lighteval + ``latex_gold_metric``). Gold is the full ``solution`` text; + ``LightevalLatexGrader`` extracts the boxed expression at grade time. metadata: - default_grader: math + default_grader: lighteval_latex default_n_shots: 0 - is_implemented: false gpqa_diamond: class: aiperf.accuracy.benchmarks.gpqa_diamond:GPQADiamondBenchmark diff --git a/tests/unit/accuracy/test_accuracy_config.py b/tests/unit/accuracy/test_accuracy_config.py index 2beb4dc87..109e8e3fc 100644 --- a/tests/unit/accuracy/test_accuracy_config.py +++ b/tests/unit/accuracy/test_accuracy_config.py @@ -23,7 +23,6 @@ # This branch (AIP-874) implements ``aime``, ``math``, and ``code_execution``, # so those names are absent from the stub lists. STUB_BENCHMARKS = ( - "math_500", "gpqa_diamond", "lcb_codegeneration", ) diff --git a/tests/unit/accuracy/test_math_500_benchmark.py b/tests/unit/accuracy/test_math_500_benchmark.py new file mode 100644 index 000000000..1dac7de6a --- /dev/null +++ b/tests/unit/accuracy/test_math_500_benchmark.py @@ -0,0 +1,256 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Unit tests for ``Math500Benchmark`` after lighteval alignment. + +The recipe's ``acc_bench_lighteval.py:math_500`` uses ``prompt_fn`` +which produces ``Doc(query=line["problem"], choices=[line["solution"]], +gold_index=0)``. Aiperf mirrors this: prompt is bare problem text; +ground_truth is the full solution (containing the boxed answer). +""" + +from __future__ import annotations + +from typing import Any +from unittest.mock import MagicMock, patch + +import pytest + +from aiperf.accuracy.benchmarks.math_500 import ( + DEFAULT_GENERATION_SIZE, + TASK_NAME, + Math500Benchmark, +) +from aiperf.accuracy.models import BenchmarkProblem +from aiperf.plugin.enums import AccuracyBenchmarkType, EndpointType +from tests.unit.conftest import make_benchmark_run + + +def _make_run(): + return make_benchmark_run( + model_names=["test-model"], + endpoint_type=EndpointType.COMPLETIONS, + streaming=False, + accuracy={"benchmark": AccuracyBenchmarkType.MATH_500}, + ) + + +def _make_row( + problem: str = "What is 1+1?", + solution: str = "The answer is $\\boxed{2}$.", + subject: str = "Algebra", + level: int | None = 1, +) -> dict[str, Any]: + return { + "problem": problem, + "solution": solution, + "subject": subject, + "level": level, + } + + +def _make_fake_dataset(rows: list[dict[str, Any]]) -> MagicMock: + ds = MagicMock() + ds.__iter__ = MagicMock(side_effect=lambda: iter(rows)) + ds.__len__ = MagicMock(return_value=len(rows)) + ds.__getitem__ = MagicMock(side_effect=lambda i: rows[i]) + return ds + + +class TestPromptIsBareProblemText: + @pytest.mark.asyncio + async def test_flat_prompt_is_problem_text(self) -> None: + rows = [_make_row(problem="Find x.")] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert problems[0].prompt == "Find x." + + @pytest.mark.asyncio + async def test_no_instruction_prefix(self) -> None: + rows = [_make_row(problem="Q?")] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + prompt = problems[0].prompt + assert "Solve the following" not in prompt + assert "boxed" not in prompt + assert "Let's think" not in prompt + + @pytest.mark.asyncio + async def test_chat_message_is_single_user_message(self) -> None: + rows = [_make_row()] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + msgs = problems[0].raw_messages + assert msgs is not None + assert len(msgs) == 1 + assert msgs[0]["role"] == "user" + + +class TestGroundTruthIsFullSolution: + """Lighteval's ``prompt_fn`` puts ``line["solution"]`` in + ``choices[0]``; ``latex_gold_metric`` extracts the boxed answer + from it. Aiperf stores the full solution in + ``BenchmarkProblem.ground_truth`` so the lighteval grader can do + the same extraction at grade time.""" + + @pytest.mark.asyncio + async def test_ground_truth_is_full_solution(self) -> None: + rows = [_make_row(solution="Step one: simplify. Step two: \\boxed{42}.")] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert problems[0].ground_truth == ( + "Step one: simplify. Step two: \\boxed{42}." + ) + + +class TestTaskFieldIsSubject: + """Per-row ``subject`` becomes the ``task`` so the accuracy CSV + breaks down by MATH subject.""" + + @pytest.mark.asyncio + async def test_subject_used_as_task_name(self) -> None: + rows = [ + _make_row(subject="Geometry"), + _make_row(subject="Algebra"), + _make_row(subject="Number Theory"), + ] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert {p.task for p in problems} == { + "Geometry", + "Algebra", + "Number Theory", + } + + @pytest.mark.asyncio + async def test_missing_subject_falls_back_to_task_name(self) -> None: + rows = [{"problem": "Q", "solution": "S"}] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert problems[0].task == TASK_NAME + + +class TestNShotsAndCoTAreIgnored: + @pytest.mark.asyncio + async def test_n_shots_argument_does_not_affect_prompt(self) -> None: + rows = [_make_row()] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + zero_shot = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + five_shot = await bench.load_problems( + tasks=None, n_shots=5, enable_cot=False + ) + assert zero_shot[0].prompt == five_shot[0].prompt + + @pytest.mark.asyncio + async def test_enable_cot_does_not_affect_prompt(self) -> None: + rows = [_make_row()] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + no_cot = await bench.load_problems(tasks=None, n_shots=0, enable_cot=False) + with_cot = await bench.load_problems(tasks=None, n_shots=0, enable_cot=True) + assert no_cot[0].prompt == with_cot[0].prompt + + +class TestLoadProblemsCore: + @pytest.mark.asyncio + async def test_returns_one_problem_per_row(self) -> None: + rows = [_make_row(problem=f"q{i}") for i in range(5)] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert len(problems) == 5 + assert all(isinstance(p, BenchmarkProblem) for p in problems) + + @pytest.mark.asyncio + async def test_metadata_carries_subject_level_gen_size(self) -> None: + rows = [_make_row(subject="Geometry", level=4)] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + meta = problems[0].metadata + assert meta["subject"] == "Geometry" + assert meta["level"] == 4 + assert meta["generation_size"] == DEFAULT_GENERATION_SIZE + assert DEFAULT_GENERATION_SIZE == 32768 + + +class TestPathologicalDatasetRows: + @pytest.mark.asyncio + async def test_empty_dataset_returns_empty_list(self) -> None: + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset([]), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert problems == [] + + @pytest.mark.asyncio + async def test_unicode_in_problem_preserved(self) -> None: + rows = [_make_row(problem="∫ x dx = ?")] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert "∫" in problems[0].prompt From 66144686a144a0b850205bd7ce6b8f8dd4ed23a1 Mon Sep 17 00:00:00 2001 From: Elias Bermudez Date: Tue, 2 Jun 2026 10:18:20 -0700 Subject: [PATCH 2/3] test(accuracy): annotate _make_run return type in math_500 test (AIP-879) `_make_run()` had no return annotation, leaving callers reading implicit ``Any``. Add ``-> BenchmarkRun`` (via a ``TYPE_CHECKING`` import of ``aiperf.config.resolution.plan.BenchmarkRun``, so the annotation is string-only thanks to ``from __future__ import annotations`` already on this file and the runtime cost is zero). Mirrors the same annotation pattern used in the ``test_bigbench_benchmark.py`` helper after AIP-878 review feedback. Function body, ``make_benchmark_run`` call, and the ``EndpointType`` / ``AccuracyBenchmarkType`` imports are unchanged. Signed-off-by: Elias Bermudez --- tests/unit/accuracy/test_math_500_benchmark.py | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/tests/unit/accuracy/test_math_500_benchmark.py b/tests/unit/accuracy/test_math_500_benchmark.py index 1dac7de6a..112bc68a2 100644 --- a/tests/unit/accuracy/test_math_500_benchmark.py +++ b/tests/unit/accuracy/test_math_500_benchmark.py @@ -11,7 +11,7 @@ from __future__ import annotations -from typing import Any +from typing import TYPE_CHECKING, Any from unittest.mock import MagicMock, patch import pytest @@ -25,8 +25,11 @@ from aiperf.plugin.enums import AccuracyBenchmarkType, EndpointType from tests.unit.conftest import make_benchmark_run +if TYPE_CHECKING: + from aiperf.config.resolution.plan import BenchmarkRun -def _make_run(): + +def _make_run() -> BenchmarkRun: return make_benchmark_run( model_names=["test-model"], endpoint_type=EndpointType.COMPLETIONS, From 191a831ae5d7bf6453c61af9431cc7bbf185d686 Mon Sep 17 00:00:00 2001 From: Elias Bermudez Date: Tue, 2 Jun 2026 10:21:40 -0700 Subject: [PATCH 3/3] fix(accuracy): coerce missing/None MATH-500 solution to empty string (AIP-879) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ``_build_problems`` was doing ``str(row.get(SOLUTION_FIELD, ""))``, which only falls back to ``""`` when the key is absent. If a row actually carries ``solution: None`` (key present, value ``None``), ``row.get`` returns ``None`` and ``str(None)`` produces the literal four-character string ``"None"`` — which then propagates as ``BenchmarkProblem.ground_truth`` and would corrupt ``LightevalLatexGrader`` extraction at grade time. Switch to ``row.get(SOLUTION_FIELD) or ""`` so both the absent-key case and the present-but-None case collapse to ``""``, matching the ``or``-pattern already used a few lines below for ``task`` (``row.get(SUBJECT_FIELD) or TASK_NAME``). Tradeoff: drops the implicit ``str()`` coercion, so a future upstream schema regression that ships a non-string ``solution`` would now surface as a Pydantic ``ValidationError`` on ``BenchmarkProblem.ground_truth`` instead of being silently stringified. Upstream ``HuggingFaceH4/MATH-500`` ships ``solution`` as a string, so this is the desired loud-failure mode. Signed-off-by: Elias Bermudez --- src/aiperf/accuracy/benchmarks/math_500.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/aiperf/accuracy/benchmarks/math_500.py b/src/aiperf/accuracy/benchmarks/math_500.py index 14f85bf73..ed63c56f4 100644 --- a/src/aiperf/accuracy/benchmarks/math_500.py +++ b/src/aiperf/accuracy/benchmarks/math_500.py @@ -95,7 +95,7 @@ def _build_problems(self, ds: Dataset) -> list[BenchmarkProblem]: problems: list[BenchmarkProblem] = [] for row in ds: problem = row[PROBLEM_FIELD] - solution = str(row.get(SOLUTION_FIELD, "")) + solution = row.get(SOLUTION_FIELD) or "" messages: list[AccuracyChatMessage] = [{"role": "user", "content": problem}] problems.append( BenchmarkProblem(