Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/accuracy/accuracy-benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ system message).
| `aime` | `math` | 8 | `Maxwell-Jia/AIME_2024` (trt-llm reference, 8-shot CoT) |
| `hellaswag` | `exact_match` | 10 | `Rowan/hellaswag` (trt-llm/DeepEval reference; one few-shot per unique activity_label) |
| `bigbench` | `exact_match` | 3 | `lukaemon/bbh` (trt-llm/DeepEval reference; 27 subtasks, canonical CoT/non-CoT prompt files) |
| `aime24` | `lighteval_expr` | 0 | `HuggingFaceH4/aime_2024` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |

## CLI Flags

Expand Down
22 changes: 11 additions & 11 deletions docs/accuracy/accuracy_stubs.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected.

**Status summary:** With the BigBench-Hard loader landing on top of the HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, and `BigBenchBenchmark` are fully implemented; the remaining benchmarks (`aime24`, `aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
**Status summary:** With the AIME24 loader landing on top of the BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, and `AIME24Benchmark` are fully implemented; the remaining benchmarks (`aime25`, `math_500`, `gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.

## Table of Contents

Expand Down Expand Up @@ -173,16 +173,16 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
| 2 | `AIMEBenchmark` | `benchmarks/aime.py` | `aime` | `math` | 8 | **IMPLEMENTED.** Loads `Maxwell-Jia/AIME_2024`, instructs the model to wrap its final integer in `\boxed{}`, supports few-shot priming and chain-of-thought. `default_enable_cot=true`. |
| 3 | `HellaSwagBenchmark` | `benchmarks/hellaswag.py` | `hellaswag` | `exact_match` | 10 | **IMPLEMENTED.** Loads `Rowan/hellaswag` (validation split filtered per task by `activity_label`; train split feeds the "one few-shot per unique activity_label" rule). Prompt rendering delegates to `deepeval.benchmarks.HellaSwag`'s `HellaSwagTemplate.generate_output`, so output is byte-equal to the trt-llm recipe's DeepEval-backed path. Pairs with `exact_match` for strict `Scorer.exact_match_score` semantics. Requires the `[accuracy]` extras (deepeval). |
| 4 | `BigBenchBenchmark` | `benchmarks/bigbench.py` | `bigbench` | `exact_match` | 3 | **IMPLEMENTED.** Loads `lukaemon/bbh` (27 BBH subtasks). Prompt rendering delegates to `deepeval.benchmarks.BigBenchHard`'s `BigBenchHardTemplate.generate_output`, which reads the 27 canonical CoT/shot prompt files DeepEval ships as package data. Pairs with `exact_match` for the recipe's strict `Scorer.exact_match_score` semantics. `default_n_shots=3`, `default_enable_cot=true`. Requires the `[accuracy]` extras (deepeval). |
| 5 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `lighteval_expr` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (train split) and emits the bare problem text as a single user message — no instruction prefix, no few-shot priming. Mirrors the trt-llm benchmark recipe's `acc_bench_lighteval.py` configuration (`few_shots_split=None`, `generation_size=32768`). Pairs with `lighteval_expr` for the recipe's `expr_gold_metric` extraction. |

### Still Stubbed

| # | Class | File | Plugin Key | Default Grader | Default N-Shots |
|---|-------|------|------------|----------------|-----------------|
| 1 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `math` | 0 |
| 2 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
| 3 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
| 4 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
| 5 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
| 1 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `math` | 0 |
| 2 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `math` | 0 |
| 3 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
| 4 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |

**Each benchmark has 1 method to implement:**

Expand Down Expand Up @@ -309,24 +309,24 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu
| Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods |
|-----------|-------------|---------------|------------------|-------------------|
| Graders | 7 (all) | 0 | — | 0 |
| Benchmarks | 4 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) | 5 | 1 (`load_problems`) | 5 |
| Benchmarks | 5 (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`) | 4 | 1 (`load_problems`) | 4 |
| Record Processor | 1 (`AccuracyRecordProcessor`) | 0 | — | 0 |
| Results Processor | 1 (`AccuracyResultsProcessor`) | 0 | — | 0 |
| Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 | — | 0 |
| Data Exporter | 1 (`AccuracyDataExporter`) | 0 | — | 0 |
| Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 |
| **Total** | **15** | **6** | | **6** |
| **Total** | **16** | **5** | | **5** |

### Self-Disabling Pattern

Processors and exporters raise their `Disabled` exception **in `__init__`** when accuracy is off. The existing framework catches these and silently skips the plugin. No code changes needed to support this — it uses the same pattern as `RawRecordWriterProcessor` and `ServerMetricsCsvExporter`.

### Suggested Implementation Order

The processors, exporters, all graders, and four benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`) are already wired end-to-end. The remaining work is the five stub benchmarks; mirror the existing loader whose grader matches:
The processors, exporters, all graders, and five benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`) are already wired end-to-end. The remaining work is the four stub benchmarks; mirror the existing loader whose grader matches:

1. **`aime24`, `aime25`, `math_500`** — mirror `AIMEBenchmark` (`benchmarks/aime.py`); pair with the `math` grader.
2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `multiple_choice` grader.
1. **`aime25`, `math_500`** — mirror `AIME24Benchmark` (`benchmarks/aime24.py`) for the lighteval-aligned shape; pair with `lighteval_expr` (aime25) or `lighteval_latex` (math_500).
2. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader.
3. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
4. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.

Expand Down
95 changes: 86 additions & 9 deletions src/aiperf/accuracy/benchmarks/aime24.py
Original file line number Diff line number Diff line change
@@ -1,31 +1,108 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""AIME 2024 benchmark loader, aligned with the trt-llm lighteval reference.

Mirrors the recipe's ``acc_bench_lighteval.py`` configuration:

aime24 = LightevalTaskConfig(
name="aime24",
prompt_function=aime_prompt_fn,
hf_repo="HuggingFaceH4/aime_2024",
hf_subset="default",
evaluation_splits=["train"],
few_shots_split=None,
few_shots_select=None,
generation_size=32768,
metric=[expr_gold_metric],
)

The recipe's ``aime_prompt_fn`` produces a ``Doc`` whose ``query`` is
the bare problem text — lighteval's prompt manager wraps it as a
single user message with no instruction prefix and no few-shot
priming (``few_shots_split=None``). We emit prompts the same way.
Pair with ``LightevalExprGrader`` for the recipe's ``expr_gold_metric``
extraction.

Reference:
trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:128
"""

from __future__ import annotations

from typing import TYPE_CHECKING
import asyncio
from typing import TYPE_CHECKING, Any

from datasets import Dataset, load_dataset

from aiperf.accuracy.models import BenchmarkProblem
from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
from aiperf.common.mixins import AIPerfLoggerMixin

if TYPE_CHECKING:
from aiperf.config.resolution.plan import BenchmarkRun

DATASET_NAME = "HuggingFaceH4/aime_2024"
TASK_NAME = "aime24"

# lighteval's aime24 task config: ``generation_size=32768`` to give
# reasoning models room to think before emitting the boxed answer.
DEFAULT_GENERATION_SIZE = 32768

# Schema field names in HuggingFaceH4/aime_2024 (lowercase, lighteval
# canonical — distinct from the Maxwell-Jia mirror used by ``aime``).
PROBLEM_FIELD = "problem"
ANSWER_FIELD = "answer"


class AIME24Benchmark(AIPerfLoggerMixin):
"""Registered placeholder for a future AIME 2024 loader.
"""AIME 2024 lighteval-aligned benchmark loader.

`load_problems()` intentionally raises NotImplementedError in this release;
use the MMLU benchmark when a working accuracy loader is required.
Loads ``HuggingFaceH4/aime_2024`` (train split) and emits one user
message per problem containing the bare problem text — the format
lighteval's ``aime_prompt_fn`` + ``PromptManager`` produce when
``few_shots_split=None``. Pair with ``LightevalExprGrader`` for
grading parity with the recipe.
"""

def __init__(self, run: BenchmarkRun, **kwargs) -> None:
def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
super().__init__(**kwargs)
self.run = run

async def load_problems(
self, tasks: list[str] | None, n_shots: int, enable_cot: bool
) -> list[BenchmarkProblem]:
raise NotImplementedError(
"aime24 benchmark is not yet implemented; only 'mmlu' is available in this release."
)
"""Load AIME24 problems and format them lighteval-style.

Args:
tasks: Ignored — AIME24 has no subtasks.
n_shots: Ignored — the lighteval reference is zero-shot
(``few_shots_split=None``); accepting the parameter
keeps the protocol uniform but emitting few-shots
here would diverge from the reference.
enable_cot: Ignored — lighteval's ``aime_prompt_fn`` does
not add a CoT trigger; the model decides whether to
reason based on the system prompt the user provides
via ``--accuracy-system-prompt``.

Returns:
One ``BenchmarkProblem`` per dataset row, in dataset
order.
"""
ds: Dataset = await asyncio.to_thread(load_dataset, DATASET_NAME, split="train")
return await asyncio.to_thread(self._build_problems, ds)

def _build_problems(self, ds: Dataset) -> list[BenchmarkProblem]:
problems: list[BenchmarkProblem] = []
for row in ds:
problem = row[PROBLEM_FIELD]
messages: list[AccuracyChatMessage] = [{"role": "user", "content": problem}]
problems.append(
BenchmarkProblem(
prompt=problem,
ground_truth=str(row[ANSWER_FIELD]),
task=TASK_NAME,
metadata={"generation_size": DEFAULT_GENERATION_SIZE},
raw_messages=messages,
)
)
return problems
7 changes: 4 additions & 3 deletions src/aiperf/plugin/plugins.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1255,11 +1255,12 @@ accuracy_benchmark:
aime24:
class: aiperf.accuracy.benchmarks.aime24:AIME24Benchmark
description: |
AIME 2024 benchmark with problems from the 2024 competition year.
AIME 2024 benchmark, aligned with the trt-llm benchmark recipe's
lighteval-backed configuration (HuggingFaceH4/aime_2024 + lighteval
``expr_gold_metric``).
metadata:
default_grader: math
default_grader: lighteval_expr
default_n_shots: 0
is_implemented: false

aime25:
class: aiperf.accuracy.benchmarks.aime25:AIME25Benchmark
Expand Down
1 change: 0 additions & 1 deletion tests/unit/accuracy/test_accuracy_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@
# This branch (AIP-874) implements ``aime``, ``math``, and ``code_execution``,
# so those names are absent from the stub lists.
STUB_BENCHMARKS = (
"aime24",
"aime25",
"math_500",
"gpqa_diamond",
Expand Down
Loading
Loading