Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/accuracy/accuracy-benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ system message).
| `bigbench` | `exact_match` | 3 | `lukaemon/bbh` (trt-llm/DeepEval reference; 27 subtasks, canonical CoT/non-CoT prompt files) |
| `aime24` | `lighteval_expr` | 0 | `HuggingFaceH4/aime_2024` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
| `aime25` | `lighteval_expr` | 0 | `yentinglin/aime_2025` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
| `math_500` | `lighteval_latex` | 0 | `HuggingFaceH4/MATH-500` (trt-llm/lighteval reference, gold is full solution containing `\boxed{answer}`, `latex_gold_metric`) |

## CLI Flags

Expand Down
21 changes: 10 additions & 11 deletions docs/accuracy/accuracy_stubs.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected.

**Status summary:** With the AIME25 loader landing on top of the AIME24 / BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, and `AIME25Benchmark` are fully implemented; the remaining benchmarks (`math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
**Status summary:** With the MATH-500 loader landing on top of the AIME25 / AIME24 / BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, and `Math500Benchmark` are fully implemented; the remaining benchmarks (`gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.

## Table of Contents

Expand Down Expand Up @@ -175,14 +175,14 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
| 4 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 | **IMPLEMENTED.** Loads `lukaemon/bbh` (27 BBH subtasks). Prompt rendering delegates to `deepeval.benchmarks.BigBenchHard`'s `BigBenchHardTemplate.generate_output`, which reads the 27 canonical CoT/shot prompt files DeepEval ships as package data. Pairs with `exact_match` for the recipe's strict `Scorer.exact_match_score` semantics. `default_n_shots=3`, `default_enable_cot=true`. Requires the `[accuracy]` extras (deepeval). |
| 5 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `lighteval_expr` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (train split) and emits the bare problem text as a single user message — no instruction prefix, no few-shot priming. Mirrors the trt-llm benchmark recipe's `acc_bench_lighteval.py` configuration (`few_shots_split=None`, `generation_size=32768`). Pairs with `lighteval_expr` for the recipe's `expr_gold_metric` extraction. |
| 6 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `lighteval_expr` | 0 | **IMPLEMENTED.** Same lighteval-aligned shape as `AIME24Benchmark` but pointed at `yentinglin/aime_2025` (the recipe's `aime25` task config). Identical prompt rendering, generation size, and grader pairing. |
| 7 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `lighteval_latex` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/MATH-500` (test split). Same lighteval-aligned shape as AIME24/25, but `ground_truth` is the full `solution` text (containing `\boxed{answer}`); `LightevalLatexGrader` extracts the boxed expression at grade time. Per-row `task` = `subject` so the accuracy CSV breaks down by MATH subject. |

### Still Stubbed

| # | Class | File | Plugin Key | Default Grader | Default N-Shots |
|---|-------|------|------------|----------------|-----------------|
| 1 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
| 2 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
| 3 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
| 1 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
| 2 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |

**Each benchmark has 1 method to implement:**

Expand Down Expand Up @@ -309,26 +309,25 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu
| Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods |
|-----------|-------------|---------------|------------------|-------------------|
| Graders | 7 (all) | 0 | — | 0 |
| Benchmarks | 6 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`) | 3 | 1 (`load_problems`) | 3 |
| Benchmarks | 7 (incl. MMLU, AIME, HellaSwag, BigBench, AIME24, AIME25, Math500) | 2 | 1 (`load_problems`) | 2 |
| Record Processor | 1 (`AccuracyRecordProcessor`) | 0 | — | 0 |
| Results Processor | 1 (`AccuracyResultsProcessor`) | 0 | — | 0 |
| Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 | — | 0 |
| Data Exporter | 1 (`AccuracyDataExporter`) | 0 | — | 0 |
| Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 |
| **Total** | **17** | **4** | | **4** |
| **Total** | **18** | **3** | | **3** |

### Self-Disabling Pattern

Processors and exporters raise their `Disabled` exception **in `__init__`** when accuracy is off. The existing framework catches these and silently skips the plugin. No code changes needed to support this — it uses the same pattern as `RawRecordWriterProcessor` and `ServerMetricsCsvExporter`.

### Suggested Implementation Order

The processors, exporters, all graders, and six benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`) are already wired end-to-end. The remaining work is the three stub benchmarks; mirror the existing loader whose grader matches:
The processors, exporters, all graders, and seven benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, `Math500Benchmark`) are already wired end-to-end. The remaining work is the two stub benchmarks; mirror the existing loader whose grader matches:

1. **`math_500`** — mirror `AIME24Benchmark` (`benchmarks/aime24.py`) for the lighteval-aligned shape; pair with `lighteval_latex`.
2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader.
3. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
4. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.
1. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader.
2. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
3. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.

### Key Files for Reference

Expand Down
104 changes: 95 additions & 9 deletions src/aiperf/accuracy/benchmarks/math_500.py
Original file line number Diff line number Diff line change
@@ -1,31 +1,117 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""MATH-500 benchmark loader, aligned with the trt-llm lighteval reference.

Mirrors ``acc_bench_lighteval.py:math_500``:

math_500 = LightevalTaskConfig(
name="math_500",
prompt_function=prompt_fn, # query=line["problem"], choices=[line["solution"]]
hf_repo="HuggingFaceH4/MATH-500",
evaluation_splits=["test"],
few_shots_split=None,
generation_size=32768,
metric=[latex_gold_metric],
)

Two notable differences from the AIME24/AIME25 loaders:

1. ``ground_truth`` is the full ``solution`` text (which contains a
``\\boxed{answer}``), not a bare answer. ``LightevalLatexGrader``'s
``LatexExtractionConfig`` extracts the boxed answer from the
solution at grade time. This matches the recipe's
``latex_gold_metric.gold_extraction_target=(LatexExtractionConfig(),)``.
2. Pair with ``LightevalLatexGrader`` (default), not
``LightevalExprGrader`` — gold answers are LaTeX expressions
(fractions, square roots, etc.).

Reference:
trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:156
"""

from __future__ import annotations

from typing import TYPE_CHECKING
import asyncio
from typing import TYPE_CHECKING, Any

from datasets import Dataset, load_dataset

from aiperf.accuracy.models import BenchmarkProblem
from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
from aiperf.common.mixins import AIPerfLoggerMixin

if TYPE_CHECKING:
from aiperf.config.resolution.plan import BenchmarkRun

DATASET_NAME = "HuggingFaceH4/MATH-500"
TASK_NAME = "math_500"

# lighteval's math_500 task config: ``generation_size=32768``.
DEFAULT_GENERATION_SIZE = 32768

# Schema field names in HuggingFaceH4/MATH-500.
PROBLEM_FIELD = "problem"
SOLUTION_FIELD = "solution"
SUBJECT_FIELD = "subject"
LEVEL_FIELD = "level"


class Math500Benchmark(AIPerfLoggerMixin):
"""Registered placeholder for a future MATH-500 loader.
"""MATH-500 lighteval-aligned benchmark loader.

`load_problems()` intentionally raises NotImplementedError in this release;
use the MMLU benchmark when a working accuracy loader is required.
Loads ``HuggingFaceH4/MATH-500`` (test split) and emits one user
message per problem containing the bare problem text — matching
lighteval's ``prompt_fn``. Gold is the full ``solution`` text;
``LightevalLatexGrader`` extracts the boxed answer at grade time.
"""

def __init__(self, run: BenchmarkRun, **kwargs) -> None:
def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
super().__init__(**kwargs)
self.run = run

async def load_problems(
self, tasks: list[str] | None, n_shots: int, enable_cot: bool
) -> list[BenchmarkProblem]:
raise NotImplementedError(
"math_500 benchmark is not yet implemented; only 'mmlu' is available in this release."
)
"""Load MATH-500 problems lighteval-style.

Args:
tasks: Ignored — lighteval's MATH-500 task has no subtask
filtering (subjects are kept in metadata for reporting,
but lighteval evaluates the full split). Use the
aggregated CSV per-subject row to break results down
after the run.
n_shots: Ignored — the lighteval reference is zero-shot
(``few_shots_split=None``).
enable_cot: Ignored — lighteval's ``prompt_fn`` does not
add a CoT trigger.

Returns:
One ``BenchmarkProblem`` per dataset row, in dataset order.
"""
ds: Dataset = await asyncio.to_thread(load_dataset, DATASET_NAME, split="test")
return await asyncio.to_thread(self._build_problems, ds)

def _build_problems(self, ds: Dataset) -> list[BenchmarkProblem]:
problems: list[BenchmarkProblem] = []
for row in ds:
problem = row[PROBLEM_FIELD]
solution = row.get(SOLUTION_FIELD) or ""
messages: list[AccuracyChatMessage] = [{"role": "user", "content": problem}]
problems.append(
BenchmarkProblem(
prompt=problem,
# Gold is the full solution containing \\boxed{answer};
# LightevalLatexGrader extracts the boxed expression.
ground_truth=solution,
# Use ``subject`` as the per-row task so the
# accuracy CSV breaks down by MATH subject.
task=row.get(SUBJECT_FIELD) or TASK_NAME,
metadata={
"subject": row.get(SUBJECT_FIELD, ""),
"level": row.get(LEVEL_FIELD),
"generation_size": DEFAULT_GENERATION_SIZE,
},
raw_messages=messages,
)
)
return problems
9 changes: 5 additions & 4 deletions src/aiperf/plugin/plugins.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1506,12 +1506,13 @@ accuracy_benchmark:
math_500:
class: aiperf.accuracy.benchmarks.math_500:Math500Benchmark
description: |
MATH-500 benchmark with 500 curated mathematical reasoning problems
spanning algebra, geometry, number theory, and combinatorics.
MATH-500 benchmark, aligned with the trt-llm benchmark recipe's
lighteval-backed configuration (HuggingFaceH4/MATH-500 + lighteval
``latex_gold_metric``). Gold is the full ``solution`` text;
``LightevalLatexGrader`` extracts the boxed expression at grade time.
metadata:
default_grader: math
default_grader: lighteval_latex
default_n_shots: 0
is_implemented: false

gpqa_diamond:
class: aiperf.accuracy.benchmarks.gpqa_diamond:GPQADiamondBenchmark
Expand Down
1 change: 0 additions & 1 deletion tests/unit/accuracy/test_accuracy_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@
# This branch (AIP-874) implements ``aime``, ``math``, and ``code_execution``,
# so those names are absent from the stub lists.
STUB_BENCHMARKS = (
"math_500",
"gpqa_diamond",
"lcb_codegeneration",
)
Expand Down
Loading
Loading