Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/accuracy/accuracy-benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ system message).
| `bigbench` | `exact_match` | 3 | `lukaemon/bbh` (trt-llm/DeepEval reference; 27 subtasks, canonical CoT/non-CoT prompt files) |
| `aime24` | `lighteval_expr` | 0 | `HuggingFaceH4/aime_2024` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
| `aime25` | `lighteval_expr` | 0 | `yentinglin/aime_2025` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
| `math_500` | `lighteval_latex` | 0 | `HuggingFaceH4/MATH-500` (trt-llm/lighteval reference, gold is full solution containing `\boxed{answer}`, `latex_gold_metric`) |

## CLI Flags

Expand Down
21 changes: 10 additions & 11 deletions docs/accuracy/accuracy_stubs.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected.

**Status summary:** With the AIME25 loader landing on top of the AIME24 / BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, and `AIME25Benchmark` are fully implemented; the remaining benchmarks (`math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
**Status summary:** With the MATH-500 loader landing on top of the AIME25 / AIME24 / BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, and `Math500Benchmark` are fully implemented; the remaining benchmarks (`gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.

## Table of Contents

Expand Down Expand Up @@ -175,14 +175,14 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
| 4 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 | **IMPLEMENTED.** Loads `lukaemon/bbh` (27 BBH subtasks). Prompt rendering delegates to `deepeval.benchmarks.BigBenchHard`'s `BigBenchHardTemplate.generate_output`, which reads the 27 canonical CoT/shot prompt files DeepEval ships as package data. Pairs with `exact_match` for the recipe's strict `Scorer.exact_match_score` semantics. `default_n_shots=3`, `default_enable_cot=true`. Requires the `[accuracy]` extras (deepeval). |
| 5 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `lighteval_expr` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (train split) and emits the bare problem text as a single user message — no instruction prefix, no few-shot priming. Mirrors the trt-llm benchmark recipe's `acc_bench_lighteval.py` configuration (`few_shots_split=None`, `generation_size=32768`). Pairs with `lighteval_expr` for the recipe's `expr_gold_metric` extraction. |
| 6 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `lighteval_expr` | 0 | **IMPLEMENTED.** Same lighteval-aligned shape as `AIME24Benchmark` but pointed at `yentinglin/aime_2025` (the recipe's `aime25` task config). Identical prompt rendering, generation size, and grader pairing. |
| 7 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `lighteval_latex` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/MATH-500` (test split). Same lighteval-aligned shape as AIME24/25, but `ground_truth` is the full `solution` text (containing `\boxed{answer}`); `LightevalLatexGrader` extracts the boxed expression at grade time. Per-row `task` = `subject` so the accuracy CSV breaks down by MATH subject. |

### Still Stubbed

| # | Class | File | Plugin Key | Default Grader | Default N-Shots |
|---|-------|------|------------|----------------|-----------------|
| 1 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
| 2 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
| 3 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
| 1 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
| 2 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |

**Each benchmark has 1 method to implement:**

Expand Down Expand Up @@ -309,26 +309,25 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu
| Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods |
|-----------|-------------|---------------|------------------|-------------------|
| Graders | 7 (all) | 0 | — | 0 |
| Benchmarks | 6 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`) | 3 | 1 (`load_problems`) | 3 |
| Benchmarks | 7 (incl. MMLU, AIME, HellaSwag, BigBench, AIME24, AIME25, Math500) | 2 | 1 (`load_problems`) | 2 |
| Record Processor | 1 (`AccuracyRecordProcessor`) | 0 | — | 0 |
| Results Processor | 1 (`AccuracyResultsProcessor`) | 0 | — | 0 |
| Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 | — | 0 |
| Data Exporter | 1 (`AccuracyDataExporter`) | 0 | — | 0 |
| Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 |
| **Total** | **17** | **4** | | **4** |
| **Total** | **18** | **3** | | **3** |

### Self-Disabling Pattern

Processors and exporters raise their `Disabled` exception **in `__init__`** when accuracy is off. The existing framework catches these and silently skips the plugin. No code changes needed to support this — it uses the same pattern as `RawRecordWriterProcessor` and `ServerMetricsCsvExporter`.

### Suggested Implementation Order

The processors, exporters, all graders, and six benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`) are already wired end-to-end. The remaining work is the three stub benchmarks; mirror the existing loader whose grader matches:
The processors, exporters, all graders, and seven benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, `Math500Benchmark`) are already wired end-to-end. The remaining work is the two stub benchmarks; mirror the existing loader whose grader matches:

1. **`math_500`** — mirror `AIME24Benchmark` (`benchmarks/aime24.py`) for the lighteval-aligned shape; pair with `lighteval_latex`.
2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader.
3. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
4. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.
1. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader.
2. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
3. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.

### Key Files for Reference

Expand Down
104 changes: 95 additions & 9 deletions src/aiperf/accuracy/benchmarks/math_500.py
Original file line number Diff line number Diff line change
@@ -1,31 +1,117 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""MATH-500 benchmark loader, aligned with the trt-llm lighteval reference.

Mirrors ``acc_bench_lighteval.py:math_500``:

math_500 = LightevalTaskConfig(
name="math_500",
prompt_function=prompt_fn, # query=line["problem"], choices=[line["solution"]]
hf_repo="HuggingFaceH4/MATH-500",
evaluation_splits=["test"],
few_shots_split=None,
generation_size=32768,
metric=[latex_gold_metric],
)

Two notable differences from the AIME24/AIME25 loaders:

1. ``ground_truth`` is the full ``solution`` text (which contains a
``\\boxed{answer}``), not a bare answer. ``LightevalLatexGrader``'s
``LatexExtractionConfig`` extracts the boxed answer from the
solution at grade time. This matches the recipe's
``latex_gold_metric.gold_extraction_target=(LatexExtractionConfig(),)``.
2. Pair with ``LightevalLatexGrader`` (default), not
``LightevalExprGrader`` — gold answers are LaTeX expressions
(fractions, square roots, etc.).

Reference:
trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:156
"""

from __future__ import annotations

from typing import TYPE_CHECKING
import asyncio
from typing import TYPE_CHECKING, Any

from datasets import Dataset, load_dataset

from aiperf.accuracy.models import BenchmarkProblem
from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
from aiperf.common.mixins import AIPerfLoggerMixin

if TYPE_CHECKING:
from aiperf.config.resolution.plan import BenchmarkRun

DATASET_NAME = "HuggingFaceH4/MATH-500"
TASK_NAME = "math_500"

# lighteval's math_500 task config: ``generation_size=32768``.
DEFAULT_GENERATION_SIZE = 32768

# Schema field names in HuggingFaceH4/MATH-500.
PROBLEM_FIELD = "problem"
SOLUTION_FIELD = "solution"
SUBJECT_FIELD = "subject"
LEVEL_FIELD = "level"


class Math500Benchmark(AIPerfLoggerMixin):
"""Registered placeholder for a future MATH-500 loader.
"""MATH-500 lighteval-aligned benchmark loader.

`load_problems()` intentionally raises NotImplementedError in this release;
use the MMLU benchmark when a working accuracy loader is required.
Loads ``HuggingFaceH4/MATH-500`` (test split) and emits one user
message per problem containing the bare problem text — matching
lighteval's ``prompt_fn``. Gold is the full ``solution`` text;
``LightevalLatexGrader`` extracts the boxed answer at grade time.
"""

def __init__(self, run: BenchmarkRun, **kwargs) -> None:
def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
super().__init__(**kwargs)
self.run = run

async def load_problems(
self, tasks: list[str] | None, n_shots: int, enable_cot: bool
) -> list[BenchmarkProblem]:
raise NotImplementedError(
"math_500 benchmark is not yet implemented; only 'mmlu' is available in this release."
)
"""Load MATH-500 problems lighteval-style.

Args:
tasks: Ignored — lighteval's MATH-500 task has no subtask
filtering (subjects are kept in metadata for reporting,
but lighteval evaluates the full split). Use the
aggregated CSV per-subject row to break results down
after the run.
n_shots: Ignored — the lighteval reference is zero-shot
(``few_shots_split=None``).
enable_cot: Ignored — lighteval's ``prompt_fn`` does not
add a CoT trigger.

Returns:
One ``BenchmarkProblem`` per dataset row, in dataset order.
"""
ds: Dataset = await asyncio.to_thread(load_dataset, DATASET_NAME, split="test")
return await asyncio.to_thread(self._build_problems, ds)

def _build_problems(self, ds: Dataset) -> list[BenchmarkProblem]:
problems: list[BenchmarkProblem] = []
for row in ds:
problem = row[PROBLEM_FIELD]
solution = row.get(SOLUTION_FIELD) or ""
messages: list[AccuracyChatMessage] = [{"role": "user", "content": problem}]
problems.append(
BenchmarkProblem(
prompt=problem,
# Gold is the full solution containing \\boxed{answer};
# LightevalLatexGrader extracts the boxed expression.
ground_truth=solution,
# Use ``subject`` as the per-row task so the
# accuracy CSV breaks down by MATH subject.
task=row.get(SUBJECT_FIELD) or TASK_NAME,
metadata={
"subject": row.get(SUBJECT_FIELD, ""),
"level": row.get(LEVEL_FIELD),
"generation_size": DEFAULT_GENERATION_SIZE,
},
raw_messages=messages,
)
)
return problems
9 changes: 5 additions & 4 deletions src/aiperf/plugin/plugins.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1506,12 +1506,13 @@ accuracy_benchmark:
math_500:
class: aiperf.accuracy.benchmarks.math_500:Math500Benchmark
description: |
MATH-500 benchmark with 500 curated mathematical reasoning problems
spanning algebra, geometry, number theory, and combinatorics.
MATH-500 benchmark, aligned with the trt-llm benchmark recipe's
lighteval-backed configuration (HuggingFaceH4/MATH-500 + lighteval
``latex_gold_metric``). Gold is the full ``solution`` text;
``LightevalLatexGrader`` extracts the boxed expression at grade time.
metadata:
default_grader: math
default_grader: lighteval_latex
default_n_shots: 0
is_implemented: false

gpqa_diamond:
class: aiperf.accuracy.benchmarks.gpqa_diamond:GPQADiamondBenchmark
Expand Down
1 change: 0 additions & 1 deletion tests/unit/accuracy/test_accuracy_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@
# This branch (AIP-874) implements ``aime``, ``math``, and ``code_execution``,
# so those names are absent from the stub lists.
STUB_BENCHMARKS = (
"math_500",
"gpqa_diamond",
"lcb_codegeneration",
)
Expand Down
Loading
Loading