From d5d733f0d954d8ab05456eee1ac12a9fe6b9047e Mon Sep 17 00:00:00 2001
From: Elias Bermudez <dbermudez@nvidia.com>
Date: Tue, 26 May 2026 10:07:34 -0700
Subject: [PATCH 1/3] feat(accuracy): MATH-500 lighteval-aligned benchmark
 loader (AIP-879)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Implement ``Math500Benchmark`` mirroring the trt-llm benchmark recipe's
``acc_bench_lighteval.py:math_500`` configuration. Same lighteval-aligned
shape as the AIME24/25 loaders (bare problem text, zero-shot, no CoT
trigger, ``generation_size=32768``) with two reference-mandated
differences:

1. ``ground_truth`` is the full ``solution`` text (which contains
   ``\\boxed{answer}``), not a bare answer. The recipe's
   ``latex_gold_metric.gold_extraction_target=(LatexExtractionConfig(),)``
   extracts the boxed expression at grade time via
   ``LightevalLatexGrader``.
2. Default pairing is ``lighteval_latex`` (not ``lighteval_expr``)
   because gold answers are LaTeX expressions (fractions, square
   roots, etc.).

Per-row ``task`` field is set to ``row["subject"]`` (falling back to
the constant ``"math_500"``) so the accuracy CSV breaks down by MATH
subject (algebra, geometry, number theory, ...).

Built on top of AIP-876 in the lighteval sub-stack
(875 → 876 → 879). No heavy optional dep — ``datasets`` is core —
so CI gets 100% line + branch coverage out of the box.

Updates the stub registry: drop ``math_500`` from
``test_accuracy_config.STUB_BENCHMARKS``, drop ``is_implemented:
false`` from the ``math_500`` plugins.yaml entry, switch
``default_grader`` to ``lighteval_latex``, add the ``math_500`` row
to ``docs/accuracy/accuracy-benchmarking.md``, and move it from
"Still Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing
the Status Summary, Method Count Summary, and Suggested
Implementation Order accordingly).

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
---
 docs/accuracy/accuracy-benchmarking.md        |   1 +
 docs/accuracy/accuracy_stubs.md               |  21 +-
 src/aiperf/accuracy/benchmarks/math_500.py    | 104 ++++++-
 src/aiperf/plugin/plugins.yaml                |   9 +-
 tests/unit/accuracy/test_accuracy_config.py   |   1 -
 .../unit/accuracy/test_math_500_benchmark.py  | 256 ++++++++++++++++++
 6 files changed, 367 insertions(+), 25 deletions(-)
 create mode 100644 tests/unit/accuracy/test_math_500_benchmark.py

diff --git a/docs/accuracy/accuracy-benchmarking.md b/docs/accuracy/accuracy-benchmarking.md
index 18e8dabab..adabfe136 100644
--- a/docs/accuracy/accuracy-benchmarking.md
+++ b/docs/accuracy/accuracy-benchmarking.md
@@ -76,6 +76,7 @@ system message).
 | `bigbench` | `exact_match` | 3 | `lukaemon/bbh` (trt-llm/DeepEval reference; 27 subtasks, canonical CoT/non-CoT prompt files) |
 | `aime24` | `lighteval_expr` | 0 | `HuggingFaceH4/aime_2024` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
 | `aime25` | `lighteval_expr` | 0 | `yentinglin/aime_2025` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
+| `math_500` | `lighteval_latex` | 0 | `HuggingFaceH4/MATH-500` (trt-llm/lighteval reference, gold is full solution containing `\boxed{answer}`, `latex_gold_metric`) |
 
 ## CLI Flags
 
diff --git a/docs/accuracy/accuracy_stubs.md b/docs/accuracy/accuracy_stubs.md
index e02953cf6..aeafe8703 100644
--- a/docs/accuracy/accuracy_stubs.md
+++ b/docs/accuracy/accuracy_stubs.md
@@ -7,7 +7,7 @@
 
 This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected.
 
-**Status summary:** With the AIME25 loader landing on top of the AIME24 / BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, and `AIME25Benchmark` are fully implemented; the remaining benchmarks (`math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
+**Status summary:** With the MATH-500 loader landing on top of the AIME25 / AIME24 / BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, and `Math500Benchmark` are fully implemented; the remaining benchmarks (`gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
 
 ## Table of Contents
 
@@ -175,14 +175,14 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
 | 4 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 | **IMPLEMENTED.** Loads `lukaemon/bbh` (27 BBH subtasks). Prompt rendering delegates to `deepeval.benchmarks.BigBenchHard`'s `BigBenchHardTemplate.generate_output`, which reads the 27 canonical CoT/shot prompt files DeepEval ships as package data. Pairs with `exact_match` for the recipe's strict `Scorer.exact_match_score` semantics. `default_n_shots=3`, `default_enable_cot=true`. Requires the `[accuracy]` extras (deepeval). |
 | 5 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `lighteval_expr` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (train split) and emits the bare problem text as a single user message — no instruction prefix, no few-shot priming. Mirrors the trt-llm benchmark recipe's `acc_bench_lighteval.py` configuration (`few_shots_split=None`, `generation_size=32768`). Pairs with `lighteval_expr` for the recipe's `expr_gold_metric` extraction. |
 | 6 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `lighteval_expr` | 0 | **IMPLEMENTED.** Same lighteval-aligned shape as `AIME24Benchmark` but pointed at `yentinglin/aime_2025` (the recipe's `aime25` task config). Identical prompt rendering, generation size, and grader pairing. |
+| 7 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `lighteval_latex` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/MATH-500` (test split). Same lighteval-aligned shape as AIME24/25, but `ground_truth` is the full `solution` text (containing `\boxed{answer}`); `LightevalLatexGrader` extracts the boxed expression at grade time. Per-row `task` = `subject` so the accuracy CSV breaks down by MATH subject. |
 
 ### Still Stubbed
 
 | # | Class | File | Plugin Key | Default Grader | Default N-Shots |
 |---|-------|------|------------|----------------|-----------------|
-| 1 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
-| 2 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
-| 3 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
+| 1 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
+| 2 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
 
 **Each benchmark has 1 method to implement:**
 
@@ -309,13 +309,13 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu
 | Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods |
 |-----------|-------------|---------------|------------------|-------------------|
 | Graders | 7 (all) | 0 | — | 0 |
-| Benchmarks | 6 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`) | 3 | 1 (`load_problems`) | 3 |
+| Benchmarks | 7 (incl. MMLU, AIME, HellaSwag, BigBench, AIME24, AIME25, Math500) | 2 | 1 (`load_problems`) | 2 |
 | Record Processor | 1 (`AccuracyRecordProcessor`) | 0 | — | 0 |
 | Results Processor | 1 (`AccuracyResultsProcessor`) | 0 | — | 0 |
 | Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 | — | 0 |
 | Data Exporter | 1 (`AccuracyDataExporter`) | 0 | — | 0 |
 | Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 |
-| **Total** | **17** | **4** | | **4** |
+| **Total** | **18** | **3** | | **3** |
 
 ### Self-Disabling Pattern
 
@@ -323,12 +323,11 @@ Processors and exporters raise their `Disabled` exception **in `__init__`** when
 
 ### Suggested Implementation Order
 
-The processors, exporters, all graders, and six benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`) are already wired end-to-end. The remaining work is the three stub benchmarks; mirror the existing loader whose grader matches:
+The processors, exporters, all graders, and seven benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, `Math500Benchmark`) are already wired end-to-end. The remaining work is the two stub benchmarks; mirror the existing loader whose grader matches:
 
-1. **`math_500`** — mirror `AIME24Benchmark` (`benchmarks/aime24.py`) for the lighteval-aligned shape; pair with `lighteval_latex`.
-2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader.
-3. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
-4. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.
+1. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader.
+2. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
+3. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.
 
 ### Key Files for Reference
 
diff --git a/src/aiperf/accuracy/benchmarks/math_500.py b/src/aiperf/accuracy/benchmarks/math_500.py
index 04588fa4b..14f85bf73 100644
--- a/src/aiperf/accuracy/benchmarks/math_500.py
+++ b/src/aiperf/accuracy/benchmarks/math_500.py
@@ -1,31 +1,117 @@
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 
+"""MATH-500 benchmark loader, aligned with the trt-llm lighteval reference.
+
+Mirrors ``acc_bench_lighteval.py:math_500``:
+
+    math_500 = LightevalTaskConfig(
+        name="math_500",
+        prompt_function=prompt_fn,        # query=line["problem"], choices=[line["solution"]]
+        hf_repo="HuggingFaceH4/MATH-500",
+        evaluation_splits=["test"],
+        few_shots_split=None,
+        generation_size=32768,
+        metric=[latex_gold_metric],
+    )
+
+Two notable differences from the AIME24/AIME25 loaders:
+
+1. ``ground_truth`` is the full ``solution`` text (which contains a
+   ``\\boxed{answer}``), not a bare answer. ``LightevalLatexGrader``'s
+   ``LatexExtractionConfig`` extracts the boxed answer from the
+   solution at grade time. This matches the recipe's
+   ``latex_gold_metric.gold_extraction_target=(LatexExtractionConfig(),)``.
+2. Pair with ``LightevalLatexGrader`` (default), not
+   ``LightevalExprGrader`` — gold answers are LaTeX expressions
+   (fractions, square roots, etc.).
+
+Reference:
+    trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:156
+"""
+
 from __future__ import annotations
 
-from typing import TYPE_CHECKING
+import asyncio
+from typing import TYPE_CHECKING, Any
+
+from datasets import Dataset, load_dataset
 
-from aiperf.accuracy.models import BenchmarkProblem
+from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
 from aiperf.common.mixins import AIPerfLoggerMixin
 
 if TYPE_CHECKING:
     from aiperf.config.resolution.plan import BenchmarkRun
 
+DATASET_NAME = "HuggingFaceH4/MATH-500"
+TASK_NAME = "math_500"
+
+# lighteval's math_500 task config: ``generation_size=32768``.
+DEFAULT_GENERATION_SIZE = 32768
+
+# Schema field names in HuggingFaceH4/MATH-500.
+PROBLEM_FIELD = "problem"
+SOLUTION_FIELD = "solution"
+SUBJECT_FIELD = "subject"
+LEVEL_FIELD = "level"
+
 
 class Math500Benchmark(AIPerfLoggerMixin):
-    """Registered placeholder for a future MATH-500 loader.
+    """MATH-500 lighteval-aligned benchmark loader.
 
-    `load_problems()` intentionally raises NotImplementedError in this release;
-    use the MMLU benchmark when a working accuracy loader is required.
+    Loads ``HuggingFaceH4/MATH-500`` (test split) and emits one user
+    message per problem containing the bare problem text — matching
+    lighteval's ``prompt_fn``. Gold is the full ``solution`` text;
+    ``LightevalLatexGrader`` extracts the boxed answer at grade time.
     """
 
-    def __init__(self, run: BenchmarkRun, **kwargs) -> None:
+    def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
         super().__init__(**kwargs)
         self.run = run
 
     async def load_problems(
         self, tasks: list[str] | None, n_shots: int, enable_cot: bool
     ) -> list[BenchmarkProblem]:
-        raise NotImplementedError(
-            "math_500 benchmark is not yet implemented; only 'mmlu' is available in this release."
-        )
+        """Load MATH-500 problems lighteval-style.
+
+        Args:
+            tasks: Ignored — lighteval's MATH-500 task has no subtask
+                filtering (subjects are kept in metadata for reporting,
+                but lighteval evaluates the full split). Use the
+                aggregated CSV per-subject row to break results down
+                after the run.
+            n_shots: Ignored — the lighteval reference is zero-shot
+                (``few_shots_split=None``).
+            enable_cot: Ignored — lighteval's ``prompt_fn`` does not
+                add a CoT trigger.
+
+        Returns:
+            One ``BenchmarkProblem`` per dataset row, in dataset order.
+        """
+        ds: Dataset = await asyncio.to_thread(load_dataset, DATASET_NAME, split="test")
+        return await asyncio.to_thread(self._build_problems, ds)
+
+    def _build_problems(self, ds: Dataset) -> list[BenchmarkProblem]:
+        problems: list[BenchmarkProblem] = []
+        for row in ds:
+            problem = row[PROBLEM_FIELD]
+            solution = str(row.get(SOLUTION_FIELD, ""))
+            messages: list[AccuracyChatMessage] = [{"role": "user", "content": problem}]
+            problems.append(
+                BenchmarkProblem(
+                    prompt=problem,
+                    # Gold is the full solution containing \\boxed{answer};
+                    # LightevalLatexGrader extracts the boxed expression.
+                    ground_truth=solution,
+                    # Use ``subject`` as the per-row task so the
+                    # accuracy CSV breaks down by MATH subject.
+                    task=row.get(SUBJECT_FIELD) or TASK_NAME,
+                    metadata={
+                        "subject": row.get(SUBJECT_FIELD, ""),
+                        "level": row.get(LEVEL_FIELD),
+                        "generation_size": DEFAULT_GENERATION_SIZE,
+                    },
+                    raw_messages=messages,
+                )
+            )
+        return problems
diff --git a/src/aiperf/plugin/plugins.yaml b/src/aiperf/plugin/plugins.yaml
index 6353e491c..d18e42880 100644
--- a/src/aiperf/plugin/plugins.yaml
+++ b/src/aiperf/plugin/plugins.yaml
@@ -1506,12 +1506,13 @@ accuracy_benchmark:
   math_500:
     class: aiperf.accuracy.benchmarks.math_500:Math500Benchmark
     description: |
-      MATH-500 benchmark with 500 curated mathematical reasoning problems
-      spanning algebra, geometry, number theory, and combinatorics.
+      MATH-500 benchmark, aligned with the trt-llm benchmark recipe's
+      lighteval-backed configuration (HuggingFaceH4/MATH-500 + lighteval
+      ``latex_gold_metric``). Gold is the full ``solution`` text;
+      ``LightevalLatexGrader`` extracts the boxed expression at grade time.
     metadata:
-      default_grader: math
+      default_grader: lighteval_latex
       default_n_shots: 0
-      is_implemented: false
 
   gpqa_diamond:
     class: aiperf.accuracy.benchmarks.gpqa_diamond:GPQADiamondBenchmark
diff --git a/tests/unit/accuracy/test_accuracy_config.py b/tests/unit/accuracy/test_accuracy_config.py
index 2beb4dc87..109e8e3fc 100644
--- a/tests/unit/accuracy/test_accuracy_config.py
+++ b/tests/unit/accuracy/test_accuracy_config.py
@@ -23,7 +23,6 @@
 # This branch (AIP-874) implements ``aime``, ``math``, and ``code_execution``,
 # so those names are absent from the stub lists.
 STUB_BENCHMARKS = (
-    "math_500",
     "gpqa_diamond",
     "lcb_codegeneration",
 )
diff --git a/tests/unit/accuracy/test_math_500_benchmark.py b/tests/unit/accuracy/test_math_500_benchmark.py
new file mode 100644
index 000000000..1dac7de6a
--- /dev/null
+++ b/tests/unit/accuracy/test_math_500_benchmark.py
@@ -0,0 +1,256 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""Unit tests for ``Math500Benchmark`` after lighteval alignment.
+
+The recipe's ``acc_bench_lighteval.py:math_500`` uses ``prompt_fn``
+which produces ``Doc(query=line["problem"], choices=[line["solution"]],
+gold_index=0)``. Aiperf mirrors this: prompt is bare problem text;
+ground_truth is the full solution (containing the boxed answer).
+"""
+
+from __future__ import annotations
+
+from typing import Any
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from aiperf.accuracy.benchmarks.math_500 import (
+    DEFAULT_GENERATION_SIZE,
+    TASK_NAME,
+    Math500Benchmark,
+)
+from aiperf.accuracy.models import BenchmarkProblem
+from aiperf.plugin.enums import AccuracyBenchmarkType, EndpointType
+from tests.unit.conftest import make_benchmark_run
+
+
+def _make_run():
+    return make_benchmark_run(
+        model_names=["test-model"],
+        endpoint_type=EndpointType.COMPLETIONS,
+        streaming=False,
+        accuracy={"benchmark": AccuracyBenchmarkType.MATH_500},
+    )
+
+
+def _make_row(
+    problem: str = "What is 1+1?",
+    solution: str = "The answer is $\\boxed{2}$.",
+    subject: str = "Algebra",
+    level: int | None = 1,
+) -> dict[str, Any]:
+    return {
+        "problem": problem,
+        "solution": solution,
+        "subject": subject,
+        "level": level,
+    }
+
+
+def _make_fake_dataset(rows: list[dict[str, Any]]) -> MagicMock:
+    ds = MagicMock()
+    ds.__iter__ = MagicMock(side_effect=lambda: iter(rows))
+    ds.__len__ = MagicMock(return_value=len(rows))
+    ds.__getitem__ = MagicMock(side_effect=lambda i: rows[i])
+    return ds
+
+
+class TestPromptIsBareProblemText:
+    @pytest.mark.asyncio
+    async def test_flat_prompt_is_problem_text(self) -> None:
+        rows = [_make_row(problem="Find x.")]
+        with patch(
+            "aiperf.accuracy.benchmarks.math_500.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = Math500Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert problems[0].prompt == "Find x."
+
+    @pytest.mark.asyncio
+    async def test_no_instruction_prefix(self) -> None:
+        rows = [_make_row(problem="Q?")]
+        with patch(
+            "aiperf.accuracy.benchmarks.math_500.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = Math500Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        prompt = problems[0].prompt
+        assert "Solve the following" not in prompt
+        assert "boxed" not in prompt
+        assert "Let's think" not in prompt
+
+    @pytest.mark.asyncio
+    async def test_chat_message_is_single_user_message(self) -> None:
+        rows = [_make_row()]
+        with patch(
+            "aiperf.accuracy.benchmarks.math_500.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = Math500Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        msgs = problems[0].raw_messages
+        assert msgs is not None
+        assert len(msgs) == 1
+        assert msgs[0]["role"] == "user"
+
+
+class TestGroundTruthIsFullSolution:
+    """Lighteval's ``prompt_fn`` puts ``line["solution"]`` in
+    ``choices[0]``; ``latex_gold_metric`` extracts the boxed answer
+    from it. Aiperf stores the full solution in
+    ``BenchmarkProblem.ground_truth`` so the lighteval grader can do
+    the same extraction at grade time."""
+
+    @pytest.mark.asyncio
+    async def test_ground_truth_is_full_solution(self) -> None:
+        rows = [_make_row(solution="Step one: simplify. Step two: \\boxed{42}.")]
+        with patch(
+            "aiperf.accuracy.benchmarks.math_500.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = Math500Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert problems[0].ground_truth == (
+            "Step one: simplify. Step two: \\boxed{42}."
+        )
+
+
+class TestTaskFieldIsSubject:
+    """Per-row ``subject`` becomes the ``task`` so the accuracy CSV
+    breaks down by MATH subject."""
+
+    @pytest.mark.asyncio
+    async def test_subject_used_as_task_name(self) -> None:
+        rows = [
+            _make_row(subject="Geometry"),
+            _make_row(subject="Algebra"),
+            _make_row(subject="Number Theory"),
+        ]
+        with patch(
+            "aiperf.accuracy.benchmarks.math_500.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = Math500Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert {p.task for p in problems} == {
+            "Geometry",
+            "Algebra",
+            "Number Theory",
+        }
+
+    @pytest.mark.asyncio
+    async def test_missing_subject_falls_back_to_task_name(self) -> None:
+        rows = [{"problem": "Q", "solution": "S"}]
+        with patch(
+            "aiperf.accuracy.benchmarks.math_500.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = Math500Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert problems[0].task == TASK_NAME
+
+
+class TestNShotsAndCoTAreIgnored:
+    @pytest.mark.asyncio
+    async def test_n_shots_argument_does_not_affect_prompt(self) -> None:
+        rows = [_make_row()]
+        with patch(
+            "aiperf.accuracy.benchmarks.math_500.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = Math500Benchmark(run=_make_run())
+            zero_shot = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+            five_shot = await bench.load_problems(
+                tasks=None, n_shots=5, enable_cot=False
+            )
+        assert zero_shot[0].prompt == five_shot[0].prompt
+
+    @pytest.mark.asyncio
+    async def test_enable_cot_does_not_affect_prompt(self) -> None:
+        rows = [_make_row()]
+        with patch(
+            "aiperf.accuracy.benchmarks.math_500.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = Math500Benchmark(run=_make_run())
+            no_cot = await bench.load_problems(tasks=None, n_shots=0, enable_cot=False)
+            with_cot = await bench.load_problems(tasks=None, n_shots=0, enable_cot=True)
+        assert no_cot[0].prompt == with_cot[0].prompt
+
+
+class TestLoadProblemsCore:
+    @pytest.mark.asyncio
+    async def test_returns_one_problem_per_row(self) -> None:
+        rows = [_make_row(problem=f"q{i}") for i in range(5)]
+        with patch(
+            "aiperf.accuracy.benchmarks.math_500.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = Math500Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert len(problems) == 5
+        assert all(isinstance(p, BenchmarkProblem) for p in problems)
+
+    @pytest.mark.asyncio
+    async def test_metadata_carries_subject_level_gen_size(self) -> None:
+        rows = [_make_row(subject="Geometry", level=4)]
+        with patch(
+            "aiperf.accuracy.benchmarks.math_500.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = Math500Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        meta = problems[0].metadata
+        assert meta["subject"] == "Geometry"
+        assert meta["level"] == 4
+        assert meta["generation_size"] == DEFAULT_GENERATION_SIZE
+        assert DEFAULT_GENERATION_SIZE == 32768
+
+
+class TestPathologicalDatasetRows:
+    @pytest.mark.asyncio
+    async def test_empty_dataset_returns_empty_list(self) -> None:
+        with patch(
+            "aiperf.accuracy.benchmarks.math_500.load_dataset",
+            return_value=_make_fake_dataset([]),
+        ):
+            bench = Math500Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert problems == []
+
+    @pytest.mark.asyncio
+    async def test_unicode_in_problem_preserved(self) -> None:
+        rows = [_make_row(problem="∫ x dx = ?")]
+        with patch(
+            "aiperf.accuracy.benchmarks.math_500.load_dataset",
+            return_value=_make_fake_dataset(rows),
+        ):
+            bench = Math500Benchmark(run=_make_run())
+            problems = await bench.load_problems(
+                tasks=None, n_shots=0, enable_cot=False
+            )
+        assert "∫" in problems[0].prompt

From 66144686a144a0b850205bd7ce6b8f8dd4ed23a1 Mon Sep 17 00:00:00 2001
From: Elias Bermudez <dbermudez@nvidia.com>
Date: Tue, 2 Jun 2026 10:18:20 -0700
Subject: [PATCH 2/3] test(accuracy): annotate _make_run return type in
 math_500 test (AIP-879)

`_make_run()` had no return annotation, leaving callers reading
implicit ``Any``. Add ``-> BenchmarkRun`` (via a ``TYPE_CHECKING``
import of ``aiperf.config.resolution.plan.BenchmarkRun``, so the
annotation is string-only thanks to ``from __future__ import
annotations`` already on this file and the runtime cost is zero).

Mirrors the same annotation pattern used in the
``test_bigbench_benchmark.py`` helper after AIP-878 review feedback.
Function body, ``make_benchmark_run`` call, and the
``EndpointType`` / ``AccuracyBenchmarkType`` imports are unchanged.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
---
 tests/unit/accuracy/test_math_500_benchmark.py | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tests/unit/accuracy/test_math_500_benchmark.py b/tests/unit/accuracy/test_math_500_benchmark.py
index 1dac7de6a..112bc68a2 100644
--- a/tests/unit/accuracy/test_math_500_benchmark.py
+++ b/tests/unit/accuracy/test_math_500_benchmark.py
@@ -11,7 +11,7 @@
 
 from __future__ import annotations
 
-from typing import Any
+from typing import TYPE_CHECKING, Any
 from unittest.mock import MagicMock, patch
 
 import pytest
@@ -25,8 +25,11 @@
 from aiperf.plugin.enums import AccuracyBenchmarkType, EndpointType
 from tests.unit.conftest import make_benchmark_run
 
+if TYPE_CHECKING:
+    from aiperf.config.resolution.plan import BenchmarkRun
 
-def _make_run():
+
+def _make_run() -> BenchmarkRun:
     return make_benchmark_run(
         model_names=["test-model"],
         endpoint_type=EndpointType.COMPLETIONS,

From 191a831ae5d7bf6453c61af9431cc7bbf185d686 Mon Sep 17 00:00:00 2001
From: Elias Bermudez <dbermudez@nvidia.com>
Date: Tue, 2 Jun 2026 10:21:40 -0700
Subject: [PATCH 3/3] fix(accuracy): coerce missing/None MATH-500 solution to
 empty string (AIP-879)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

``_build_problems`` was doing ``str(row.get(SOLUTION_FIELD, ""))``,
which only falls back to ``""`` when the key is absent. If a row
actually carries ``solution: None`` (key present, value ``None``),
``row.get`` returns ``None`` and ``str(None)`` produces the literal
four-character string ``"None"`` — which then propagates as
``BenchmarkProblem.ground_truth`` and would corrupt
``LightevalLatexGrader`` extraction at grade time.

Switch to ``row.get(SOLUTION_FIELD) or ""`` so both the absent-key
case and the present-but-None case collapse to ``""``, matching the
``or``-pattern already used a few lines below for ``task``
(``row.get(SUBJECT_FIELD) or TASK_NAME``).

Tradeoff: drops the implicit ``str()`` coercion, so a future
upstream schema regression that ships a non-string ``solution``
would now surface as a Pydantic ``ValidationError`` on
``BenchmarkProblem.ground_truth`` instead of being silently
stringified. Upstream ``HuggingFaceH4/MATH-500`` ships ``solution``
as a string, so this is the desired loud-failure mode.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
---
 src/aiperf/accuracy/benchmarks/math_500.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/aiperf/accuracy/benchmarks/math_500.py b/src/aiperf/accuracy/benchmarks/math_500.py
index 14f85bf73..ed63c56f4 100644
--- a/src/aiperf/accuracy/benchmarks/math_500.py
+++ b/src/aiperf/accuracy/benchmarks/math_500.py
@@ -95,7 +95,7 @@ def _build_problems(self, ds: Dataset) -> list[BenchmarkProblem]:
         problems: list[BenchmarkProblem] = []
         for row in ds:
             problem = row[PROBLEM_FIELD]
-            solution = str(row.get(SOLUTION_FIELD, ""))
+            solution = row.get(SOLUTION_FIELD) or ""
             messages: list[AccuracyChatMessage] = [{"role": "user", "content": problem}]
             problems.append(
                 BenchmarkProblem(