diff --git a/docs/accuracy/accuracy-benchmarking.md b/docs/accuracy/accuracy-benchmarking.md index 18e8dabab..adabfe136 100644 --- a/docs/accuracy/accuracy-benchmarking.md +++ b/docs/accuracy/accuracy-benchmarking.md @@ -76,6 +76,7 @@ system message). | `bigbench` | `exact_match` | 3 | `lukaemon/bbh` (trt-llm/DeepEval reference; 27 subtasks, canonical CoT/non-CoT prompt files) | | `aime24` | `lighteval_expr` | 0 | `HuggingFaceH4/aime_2024` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) | | `aime25` | `lighteval_expr` | 0 | `yentinglin/aime_2025` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) | +| `math_500` | `lighteval_latex` | 0 | `HuggingFaceH4/MATH-500` (trt-llm/lighteval reference, gold is full solution containing `\boxed{answer}`, `latex_gold_metric`) | ## CLI Flags diff --git a/docs/accuracy/accuracy_stubs.md b/docs/accuracy/accuracy_stubs.md index e02953cf6..aeafe8703 100644 --- a/docs/accuracy/accuracy_stubs.md +++ b/docs/accuracy/accuracy_stubs.md @@ -7,7 +7,7 @@ This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected. -**Status summary:** With the AIME25 loader landing on top of the AIME24 / BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, and `AIME25Benchmark` are fully implemented; the remaining benchmarks (`math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs. +**Status summary:** With the MATH-500 loader landing on top of the AIME25 / AIME24 / BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, and `Math500Benchmark` are fully implemented; the remaining benchmarks (`gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs. ## Table of Contents @@ -175,14 +175,14 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method. | 4 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 | **IMPLEMENTED.** Loads `lukaemon/bbh` (27 BBH subtasks). Prompt rendering delegates to `deepeval.benchmarks.BigBenchHard`'s `BigBenchHardTemplate.generate_output`, which reads the 27 canonical CoT/shot prompt files DeepEval ships as package data. Pairs with `exact_match` for the recipe's strict `Scorer.exact_match_score` semantics. `default_n_shots=3`, `default_enable_cot=true`. Requires the `[accuracy]` extras (deepeval). | | 5 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `lighteval_expr` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (train split) and emits the bare problem text as a single user message — no instruction prefix, no few-shot priming. Mirrors the trt-llm benchmark recipe's `acc_bench_lighteval.py` configuration (`few_shots_split=None`, `generation_size=32768`). Pairs with `lighteval_expr` for the recipe's `expr_gold_metric` extraction. | | 6 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `lighteval_expr` | 0 | **IMPLEMENTED.** Same lighteval-aligned shape as `AIME24Benchmark` but pointed at `yentinglin/aime_2025` (the recipe's `aime25` task config). Identical prompt rendering, generation size, and grader pairing. | +| 7 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `lighteval_latex` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/MATH-500` (test split). Same lighteval-aligned shape as AIME24/25, but `ground_truth` is the full `solution` text (containing `\boxed{answer}`); `LightevalLatexGrader` extracts the boxed expression at grade time. Per-row `task` = `subject` so the accuracy CSV breaks down by MATH subject. | ### Still Stubbed | # | Class | File | Plugin Key | Default Grader | Default N-Shots | |---|-------|------|------------|----------------|-----------------| -| 1 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 | -| 2 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 | -| 3 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 | +| 1 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 | +| 2 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 | **Each benchmark has 1 method to implement:** @@ -309,13 +309,13 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu | Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods | |-----------|-------------|---------------|------------------|-------------------| | Graders | 7 (all) | 0 | — | 0 | -| Benchmarks | 6 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`) | 3 | 1 (`load_problems`) | 3 | +| Benchmarks | 7 (incl. MMLU, AIME, HellaSwag, BigBench, AIME24, AIME25, Math500) | 2 | 1 (`load_problems`) | 2 | | Record Processor | 1 (`AccuracyRecordProcessor`) | 0 | — | 0 | | Results Processor | 1 (`AccuracyResultsProcessor`) | 0 | — | 0 | | Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 | — | 0 | | Data Exporter | 1 (`AccuracyDataExporter`) | 0 | — | 0 | | Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 | -| **Total** | **17** | **4** | | **4** | +| **Total** | **18** | **3** | | **3** | ### Self-Disabling Pattern @@ -323,12 +323,11 @@ Processors and exporters raise their `Disabled` exception **in `__init__`** when ### Suggested Implementation Order -The processors, exporters, all graders, and six benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`) are already wired end-to-end. The remaining work is the three stub benchmarks; mirror the existing loader whose grader matches: +The processors, exporters, all graders, and seven benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, `Math500Benchmark`) are already wired end-to-end. The remaining work is the two stub benchmarks; mirror the existing loader whose grader matches: -1. **`math_500`** — mirror `AIME24Benchmark` (`benchmarks/aime24.py`) for the lighteval-aligned shape; pair with `lighteval_latex`. -2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader. -3. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader. -4. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported. +1. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader. +2. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader. +3. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported. ### Key Files for Reference diff --git a/src/aiperf/accuracy/benchmarks/math_500.py b/src/aiperf/accuracy/benchmarks/math_500.py index 04588fa4b..ed63c56f4 100644 --- a/src/aiperf/accuracy/benchmarks/math_500.py +++ b/src/aiperf/accuracy/benchmarks/math_500.py @@ -1,31 +1,117 @@ # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 +"""MATH-500 benchmark loader, aligned with the trt-llm lighteval reference. + +Mirrors ``acc_bench_lighteval.py:math_500``: + + math_500 = LightevalTaskConfig( + name="math_500", + prompt_function=prompt_fn, # query=line["problem"], choices=[line["solution"]] + hf_repo="HuggingFaceH4/MATH-500", + evaluation_splits=["test"], + few_shots_split=None, + generation_size=32768, + metric=[latex_gold_metric], + ) + +Two notable differences from the AIME24/AIME25 loaders: + +1. ``ground_truth`` is the full ``solution`` text (which contains a + ``\\boxed{answer}``), not a bare answer. ``LightevalLatexGrader``'s + ``LatexExtractionConfig`` extracts the boxed answer from the + solution at grade time. This matches the recipe's + ``latex_gold_metric.gold_extraction_target=(LatexExtractionConfig(),)``. +2. Pair with ``LightevalLatexGrader`` (default), not + ``LightevalExprGrader`` — gold answers are LaTeX expressions + (fractions, square roots, etc.). + +Reference: + trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:156 +""" + from __future__ import annotations -from typing import TYPE_CHECKING +import asyncio +from typing import TYPE_CHECKING, Any + +from datasets import Dataset, load_dataset -from aiperf.accuracy.models import BenchmarkProblem +from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem from aiperf.common.mixins import AIPerfLoggerMixin if TYPE_CHECKING: from aiperf.config.resolution.plan import BenchmarkRun +DATASET_NAME = "HuggingFaceH4/MATH-500" +TASK_NAME = "math_500" + +# lighteval's math_500 task config: ``generation_size=32768``. +DEFAULT_GENERATION_SIZE = 32768 + +# Schema field names in HuggingFaceH4/MATH-500. +PROBLEM_FIELD = "problem" +SOLUTION_FIELD = "solution" +SUBJECT_FIELD = "subject" +LEVEL_FIELD = "level" + class Math500Benchmark(AIPerfLoggerMixin): - """Registered placeholder for a future MATH-500 loader. + """MATH-500 lighteval-aligned benchmark loader. - `load_problems()` intentionally raises NotImplementedError in this release; - use the MMLU benchmark when a working accuracy loader is required. + Loads ``HuggingFaceH4/MATH-500`` (test split) and emits one user + message per problem containing the bare problem text — matching + lighteval's ``prompt_fn``. Gold is the full ``solution`` text; + ``LightevalLatexGrader`` extracts the boxed answer at grade time. """ - def __init__(self, run: BenchmarkRun, **kwargs) -> None: + def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None: super().__init__(**kwargs) self.run = run async def load_problems( self, tasks: list[str] | None, n_shots: int, enable_cot: bool ) -> list[BenchmarkProblem]: - raise NotImplementedError( - "math_500 benchmark is not yet implemented; only 'mmlu' is available in this release." - ) + """Load MATH-500 problems lighteval-style. + + Args: + tasks: Ignored — lighteval's MATH-500 task has no subtask + filtering (subjects are kept in metadata for reporting, + but lighteval evaluates the full split). Use the + aggregated CSV per-subject row to break results down + after the run. + n_shots: Ignored — the lighteval reference is zero-shot + (``few_shots_split=None``). + enable_cot: Ignored — lighteval's ``prompt_fn`` does not + add a CoT trigger. + + Returns: + One ``BenchmarkProblem`` per dataset row, in dataset order. + """ + ds: Dataset = await asyncio.to_thread(load_dataset, DATASET_NAME, split="test") + return await asyncio.to_thread(self._build_problems, ds) + + def _build_problems(self, ds: Dataset) -> list[BenchmarkProblem]: + problems: list[BenchmarkProblem] = [] + for row in ds: + problem = row[PROBLEM_FIELD] + solution = row.get(SOLUTION_FIELD) or "" + messages: list[AccuracyChatMessage] = [{"role": "user", "content": problem}] + problems.append( + BenchmarkProblem( + prompt=problem, + # Gold is the full solution containing \\boxed{answer}; + # LightevalLatexGrader extracts the boxed expression. + ground_truth=solution, + # Use ``subject`` as the per-row task so the + # accuracy CSV breaks down by MATH subject. + task=row.get(SUBJECT_FIELD) or TASK_NAME, + metadata={ + "subject": row.get(SUBJECT_FIELD, ""), + "level": row.get(LEVEL_FIELD), + "generation_size": DEFAULT_GENERATION_SIZE, + }, + raw_messages=messages, + ) + ) + return problems diff --git a/src/aiperf/plugin/plugins.yaml b/src/aiperf/plugin/plugins.yaml index 6353e491c..d18e42880 100644 --- a/src/aiperf/plugin/plugins.yaml +++ b/src/aiperf/plugin/plugins.yaml @@ -1506,12 +1506,13 @@ accuracy_benchmark: math_500: class: aiperf.accuracy.benchmarks.math_500:Math500Benchmark description: | - MATH-500 benchmark with 500 curated mathematical reasoning problems - spanning algebra, geometry, number theory, and combinatorics. + MATH-500 benchmark, aligned with the trt-llm benchmark recipe's + lighteval-backed configuration (HuggingFaceH4/MATH-500 + lighteval + ``latex_gold_metric``). Gold is the full ``solution`` text; + ``LightevalLatexGrader`` extracts the boxed expression at grade time. metadata: - default_grader: math + default_grader: lighteval_latex default_n_shots: 0 - is_implemented: false gpqa_diamond: class: aiperf.accuracy.benchmarks.gpqa_diamond:GPQADiamondBenchmark diff --git a/tests/unit/accuracy/test_accuracy_config.py b/tests/unit/accuracy/test_accuracy_config.py index 2beb4dc87..109e8e3fc 100644 --- a/tests/unit/accuracy/test_accuracy_config.py +++ b/tests/unit/accuracy/test_accuracy_config.py @@ -23,7 +23,6 @@ # This branch (AIP-874) implements ``aime``, ``math``, and ``code_execution``, # so those names are absent from the stub lists. STUB_BENCHMARKS = ( - "math_500", "gpqa_diamond", "lcb_codegeneration", ) diff --git a/tests/unit/accuracy/test_math_500_benchmark.py b/tests/unit/accuracy/test_math_500_benchmark.py new file mode 100644 index 000000000..112bc68a2 --- /dev/null +++ b/tests/unit/accuracy/test_math_500_benchmark.py @@ -0,0 +1,259 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Unit tests for ``Math500Benchmark`` after lighteval alignment. + +The recipe's ``acc_bench_lighteval.py:math_500`` uses ``prompt_fn`` +which produces ``Doc(query=line["problem"], choices=[line["solution"]], +gold_index=0)``. Aiperf mirrors this: prompt is bare problem text; +ground_truth is the full solution (containing the boxed answer). +""" + +from __future__ import annotations + +from typing import TYPE_CHECKING, Any +from unittest.mock import MagicMock, patch + +import pytest + +from aiperf.accuracy.benchmarks.math_500 import ( + DEFAULT_GENERATION_SIZE, + TASK_NAME, + Math500Benchmark, +) +from aiperf.accuracy.models import BenchmarkProblem +from aiperf.plugin.enums import AccuracyBenchmarkType, EndpointType +from tests.unit.conftest import make_benchmark_run + +if TYPE_CHECKING: + from aiperf.config.resolution.plan import BenchmarkRun + + +def _make_run() -> BenchmarkRun: + return make_benchmark_run( + model_names=["test-model"], + endpoint_type=EndpointType.COMPLETIONS, + streaming=False, + accuracy={"benchmark": AccuracyBenchmarkType.MATH_500}, + ) + + +def _make_row( + problem: str = "What is 1+1?", + solution: str = "The answer is $\\boxed{2}$.", + subject: str = "Algebra", + level: int | None = 1, +) -> dict[str, Any]: + return { + "problem": problem, + "solution": solution, + "subject": subject, + "level": level, + } + + +def _make_fake_dataset(rows: list[dict[str, Any]]) -> MagicMock: + ds = MagicMock() + ds.__iter__ = MagicMock(side_effect=lambda: iter(rows)) + ds.__len__ = MagicMock(return_value=len(rows)) + ds.__getitem__ = MagicMock(side_effect=lambda i: rows[i]) + return ds + + +class TestPromptIsBareProblemText: + @pytest.mark.asyncio + async def test_flat_prompt_is_problem_text(self) -> None: + rows = [_make_row(problem="Find x.")] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert problems[0].prompt == "Find x." + + @pytest.mark.asyncio + async def test_no_instruction_prefix(self) -> None: + rows = [_make_row(problem="Q?")] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + prompt = problems[0].prompt + assert "Solve the following" not in prompt + assert "boxed" not in prompt + assert "Let's think" not in prompt + + @pytest.mark.asyncio + async def test_chat_message_is_single_user_message(self) -> None: + rows = [_make_row()] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + msgs = problems[0].raw_messages + assert msgs is not None + assert len(msgs) == 1 + assert msgs[0]["role"] == "user" + + +class TestGroundTruthIsFullSolution: + """Lighteval's ``prompt_fn`` puts ``line["solution"]`` in + ``choices[0]``; ``latex_gold_metric`` extracts the boxed answer + from it. Aiperf stores the full solution in + ``BenchmarkProblem.ground_truth`` so the lighteval grader can do + the same extraction at grade time.""" + + @pytest.mark.asyncio + async def test_ground_truth_is_full_solution(self) -> None: + rows = [_make_row(solution="Step one: simplify. Step two: \\boxed{42}.")] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert problems[0].ground_truth == ( + "Step one: simplify. Step two: \\boxed{42}." + ) + + +class TestTaskFieldIsSubject: + """Per-row ``subject`` becomes the ``task`` so the accuracy CSV + breaks down by MATH subject.""" + + @pytest.mark.asyncio + async def test_subject_used_as_task_name(self) -> None: + rows = [ + _make_row(subject="Geometry"), + _make_row(subject="Algebra"), + _make_row(subject="Number Theory"), + ] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert {p.task for p in problems} == { + "Geometry", + "Algebra", + "Number Theory", + } + + @pytest.mark.asyncio + async def test_missing_subject_falls_back_to_task_name(self) -> None: + rows = [{"problem": "Q", "solution": "S"}] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert problems[0].task == TASK_NAME + + +class TestNShotsAndCoTAreIgnored: + @pytest.mark.asyncio + async def test_n_shots_argument_does_not_affect_prompt(self) -> None: + rows = [_make_row()] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + zero_shot = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + five_shot = await bench.load_problems( + tasks=None, n_shots=5, enable_cot=False + ) + assert zero_shot[0].prompt == five_shot[0].prompt + + @pytest.mark.asyncio + async def test_enable_cot_does_not_affect_prompt(self) -> None: + rows = [_make_row()] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + no_cot = await bench.load_problems(tasks=None, n_shots=0, enable_cot=False) + with_cot = await bench.load_problems(tasks=None, n_shots=0, enable_cot=True) + assert no_cot[0].prompt == with_cot[0].prompt + + +class TestLoadProblemsCore: + @pytest.mark.asyncio + async def test_returns_one_problem_per_row(self) -> None: + rows = [_make_row(problem=f"q{i}") for i in range(5)] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert len(problems) == 5 + assert all(isinstance(p, BenchmarkProblem) for p in problems) + + @pytest.mark.asyncio + async def test_metadata_carries_subject_level_gen_size(self) -> None: + rows = [_make_row(subject="Geometry", level=4)] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + meta = problems[0].metadata + assert meta["subject"] == "Geometry" + assert meta["level"] == 4 + assert meta["generation_size"] == DEFAULT_GENERATION_SIZE + assert DEFAULT_GENERATION_SIZE == 32768 + + +class TestPathologicalDatasetRows: + @pytest.mark.asyncio + async def test_empty_dataset_returns_empty_list(self) -> None: + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset([]), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert problems == [] + + @pytest.mark.asyncio + async def test_unicode_in_problem_preserved(self) -> None: + rows = [_make_row(problem="∫ x dx = ?")] + with patch( + "aiperf.accuracy.benchmarks.math_500.load_dataset", + return_value=_make_fake_dataset(rows), + ): + bench = Math500Benchmark(run=_make_run()) + problems = await bench.load_problems( + tasks=None, n_shots=0, enable_cot=False + ) + assert "∫" in problems[0].prompt