Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/accuracy/accuracy-benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ system message).
| `aime24` | `lighteval_expr` | 0 | `HuggingFaceH4/aime_2024` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
| `aime25` | `lighteval_expr` | 0 | `yentinglin/aime_2025` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
| `math_500` | `lighteval_latex` | 0 | `HuggingFaceH4/MATH-500` (trt-llm/lighteval reference, gold is full solution containing `\boxed{answer}`, `latex_gold_metric`) |
| `gpqa_diamond` | `lighteval_gpqa` | 0 | `Idavidrein/gpqa` subset `gpqa_diamond` (trt-llm/lighteval reference, simple-evals template with SHA-256-seeded deterministic A/B/C/D shuffling, `gpqa_metric`) |

## CLI Flags

Expand Down
17 changes: 8 additions & 9 deletions docs/accuracy/accuracy_stubs.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected.

**Status summary:** With the MATH-500 loader landing on top of the AIME25 / AIME24 / BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, and `Math500Benchmark` are fully implemented; the remaining benchmarks (`gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
**Status summary:** With the GPQA-Diamond loader landing on top of the MATH-500 / AIME25 / AIME24 / BigBench / HellaSwag stack, all seven graders and eight benchmark loaders are now implemented; only `lcb_codegeneration` remains stubbed (LiveCodeBench code-generation) and ships behind `NotImplementedError` until the AIP-881 branch lands. Use the implemented classes as canonical references when filling in the remaining stub.

## Table of Contents

Expand Down Expand Up @@ -176,13 +176,13 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
| 5 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `lighteval_expr` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (train split) and emits the bare problem text as a single user message — no instruction prefix, no few-shot priming. Mirrors the trt-llm benchmark recipe's `acc_bench_lighteval.py` configuration (`few_shots_split=None`, `generation_size=32768`). Pairs with `lighteval_expr` for the recipe's `expr_gold_metric` extraction. |
| 6 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `lighteval_expr` | 0 | **IMPLEMENTED.** Same lighteval-aligned shape as `AIME24Benchmark` but pointed at `yentinglin/aime_2025` (the recipe's `aime25` task config). Identical prompt rendering, generation size, and grader pairing. |
| 7 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `lighteval_latex` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/MATH-500` (test split). Same lighteval-aligned shape as AIME24/25, but `ground_truth` is the full `solution` text (containing `\boxed{answer}`); `LightevalLatexGrader` extracts the boxed expression at grade time. Per-row `task` = `subject` so the accuracy CSV breaks down by MATH subject. |
| 8 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `lighteval_gpqa` | 0 | **IMPLEMENTED.** Loads `Idavidrein/gpqa` (subset `gpqa_diamond`, train split). Renders the simple-evals prompt template with **SHA-256-seeded deterministic A/B/C/D shuffling** of the correct + 3 distractor answers — one intentional deviation from the recipe's stochastic `random.randint(0, 3)` so gold positions reproduce across runs. Per-row `task` = `High-level domain` so the accuracy CSV breaks down by physics/chemistry/biology. |

### Still Stubbed

| # | Class | File | Plugin Key | Default Grader | Default N-Shots |
|---|-------|------|------------|----------------|-----------------|
| 1 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
| 2 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
| 1 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |

**Each benchmark has 1 method to implement:**

Expand Down Expand Up @@ -309,25 +309,24 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu
| Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods |
|-----------|-------------|---------------|------------------|-------------------|
| Graders | 7 (all) | 0 | — | 0 |
| Benchmarks | 7 (incl. MMLU, AIME, HellaSwag, BigBench, AIME24, AIME25, Math500) | 2 | 1 (`load_problems`) | 2 |
| Benchmarks | 8 (incl. MMLU, AIME, HellaSwag, BigBench, AIME24, AIME25, Math500, GPQADiamond) | 1 | 1 (`load_problems`) | 1 |
| Record Processor | 1 (`AccuracyRecordProcessor`) | 0 | — | 0 |
| Results Processor | 1 (`AccuracyResultsProcessor`) | 0 | — | 0 |
| Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 | — | 0 |
| Data Exporter | 1 (`AccuracyDataExporter`) | 0 | — | 0 |
| Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 |
| **Total** | **18** | **3** | | **3** |
| **Total** | **19** | **2** | | **2** |

### Self-Disabling Pattern

Processors and exporters raise their `Disabled` exception **in `__init__`** when accuracy is off. The existing framework catches these and silently skips the plugin. No code changes needed to support this — it uses the same pattern as `RawRecordWriterProcessor` and `ServerMetricsCsvExporter`.

### Suggested Implementation Order

The processors, exporters, all graders, and seven benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, `Math500Benchmark`) are already wired end-to-end. The remaining work is the two stub benchmarks; mirror the existing loader whose grader matches:
The processors, exporters, all graders, and eight benchmarks are already wired end-to-end. The remaining work is the single stub benchmark:

1. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader.
2. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
3. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.
1. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
2. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` when `lcb_codegeneration` lands so the validator no longer rejects it.

### Key Files for Reference

Expand Down
203 changes: 195 additions & 8 deletions src/aiperf/accuracy/benchmarks/gpqa_diamond.py
Original file line number Diff line number Diff line change
@@ -1,31 +1,218 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""GPQA-Diamond benchmark loader, aligned with the trt-llm lighteval reference.

Mirrors ``acc_bench_lighteval.py:gpqa_diamond``:

gpqa_diamond = LightevalTaskConfig(
name="gpqa:diamond",
prompt_function=gpqa_prompt_fn,
hf_repo="Idavidrein/gpqa",
hf_subset="gpqa_diamond",
evaluation_splits=["train"],
few_shots_split=None,
generation_size=32768,
metric=[gpqa_metric],
stop_sequence=[],
trust_dataset=True,
)

The recipe's ``gpqa_prompt_fn`` builds the simple-evals template:

Answer the following multiple choice question. The last line of
your response should be of the following format: 'Answer: $LETTER'
(without quotes) where LETTER is one of ABCD. Think step by step
before answering.

{Question}

A) {A}
B) {B}
C) {C}
D) {D}

The recipe's prompt_fn shuffles options with ``random.randint(0, 3)``
(stochastic, different per call). Aiperf instead uses **SHA-256-seeded
deterministic shuffling** (per the user direction during the alignment
review) so gold positions are reproducible across runs while still
distributed uniformly. This is the one intentional deviation from the
trt-llm reference, documented in
``docs/accuracy/accuracy-benchmarking.md``.

Pair with ``LightevalGPQAGrader`` (default), which extracts via
``IndicesExtractionConfig(prefix_for_extraction="NativeLetters")`` to
match the recipe's ``gpqa_metric``.

Reference:
trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:108,170
"""

from __future__ import annotations

from typing import TYPE_CHECKING
import asyncio
import hashlib
import random
from typing import TYPE_CHECKING, Any

from aiperf.accuracy.models import BenchmarkProblem
from datasets import Dataset, load_dataset

from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
from aiperf.common.mixins import AIPerfLoggerMixin

if TYPE_CHECKING:
from aiperf.config.resolution.plan import BenchmarkRun

DATASET_NAME = "Idavidrein/gpqa"
DATASET_CONFIG = "gpqa_diamond"
TASK_NAME = "gpqa_diamond"

# lighteval's gpqa_diamond task config: ``generation_size=32768``.
DEFAULT_GENERATION_SIZE = 32768

# 4 choices per question (1 correct + 3 distractors).
NUM_CHOICES = 4

# Width of the SHA-256-derived seed when modded down to a 32-bit
# Python ``random.Random`` seed.
_SEED_MODULUS = 2**32

# Schema field names in the Idavidrein/gpqa dataset (Title Case with
# spaces — the upstream's choice).
QUESTION_FIELD = "Question"
CORRECT_ANSWER_FIELD = "Correct Answer"
INCORRECT_ANSWER_FIELDS = (
"Incorrect Answer 1",
"Incorrect Answer 2",
"Incorrect Answer 3",
)
DOMAIN_FIELD = "High-level domain"
SUBDOMAIN_FIELD = "Subdomain"

# Recipe's ``gpqa_prompt_fn`` template. The model is told to emit
# ``Answer: $LETTER`` so ``LightevalGPQAGrader`` (with
# ``IndicesExtractionConfig(prefix_for_extraction="NativeLetters")``)
# can extract the letter cleanly.
_PROMPT_TEMPLATE = (
"Answer the following multiple choice question. The last line of "
"your response should be of the following format: 'Answer: $LETTER' "
"(without quotes) where LETTER is one of ABCD. Think step by step "
"before answering.\n\n"
"{Question}\n\n"
"A) {A}\n"
"B) {B}\n"
"C) {C}\n"
"D) {D}"
)


def _seeded_shuffle_indices(key: str, n: int) -> list[int]:
"""Return a deterministic permutation of ``range(n)`` seeded by ``key``.

Uses the leading 32 bits of SHA-256(key) as the seed for Python's
``random.Random``. This gives a stable, locale-independent,
Python-version-independent permutation: regenerating prompts on a
new machine produces identical letter orderings.

The recipe shuffles via ``random.randint(0, 3)`` (stochastic per
call) — see the module docstring for why aiperf chose
determinism instead.
"""
digest = hashlib.sha256(key.encode("utf-8")).hexdigest()
seed = int(digest, 16) % _SEED_MODULUS
rng = random.Random(seed)
indices = list(range(n))
rng.shuffle(indices)
return indices


class GPQADiamondBenchmark(AIPerfLoggerMixin):
"""Registered placeholder for a future GPQA Diamond loader.
"""GPQA-Diamond lighteval-aligned benchmark loader.

`load_problems()` intentionally raises NotImplementedError in this release;
use the MMLU benchmark when a working accuracy loader is required.
Loads ``Idavidrein/gpqa`` (config ``gpqa_diamond``, train split).
Each row's correct + 3 incorrect answers are deterministically
shuffled into A/B/C/D positions and rendered with the simple-evals
template (matching ``gpqa_prompt_fn``). Pair with
``LightevalGPQAGrader`` for grading parity with the recipe.
"""

def __init__(self, run: BenchmarkRun, **kwargs) -> None:
def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
super().__init__(**kwargs)
self.run = run

async def load_problems(
self, tasks: list[str] | None, n_shots: int, enable_cot: bool
) -> list[BenchmarkProblem]:
raise NotImplementedError(
"gpqa_diamond benchmark is not yet implemented; only 'mmlu' is available in this release."
"""Load GPQA-Diamond problems lighteval-style.

Args:
tasks: Ignored — lighteval's gpqa_diamond task has no
subtask filtering (per-row High-level domain is in
metadata for post-run reporting).
n_shots: Ignored — the lighteval reference is zero-shot
(``few_shots_split=None``).
enable_cot: Ignored — the simple-evals template already
includes "Think step by step before answering."

Returns:
One ``BenchmarkProblem`` per dataset row, in dataset order.
``ground_truth`` is the gold letter ("A", "B", "C", or
"D") so ``LightevalGPQAGrader`` can pass it directly into
its ``Doc.choices=["A","B","C","D"], gold_index=...``
shape.
"""
ds: Dataset = await asyncio.to_thread(
load_dataset, DATASET_NAME, DATASET_CONFIG, split="train"
)
return await asyncio.to_thread(self._build_problems, ds)

def _build_problems(self, ds: Dataset) -> list[BenchmarkProblem]:
problems: list[BenchmarkProblem] = []
for row in ds:
choices, gold_letter = self._build_choices(row)
prompt = self._format_prompt(row, choices)
messages: list[AccuracyChatMessage] = [{"role": "user", "content": prompt}]
problems.append(
BenchmarkProblem(
prompt=prompt,
ground_truth=gold_letter,
task=TASK_NAME,
metadata={
"domain": row.get(DOMAIN_FIELD, ""),
"subdomain": row.get(SUBDOMAIN_FIELD, ""),
"generation_size": DEFAULT_GENERATION_SIZE,
},
raw_messages=messages,
)
)
return problems

@staticmethod
def _build_choices(row: dict[str, Any]) -> tuple[list[str], str]:
"""Assemble 4 lettered choices and report the gold letter.

Uses SHA-256-seeded permutation (see ``_seeded_shuffle_indices``)
— deterministic per-question shuffle, distinct from the
recipe's stochastic ``random.randint(0, 3)``.
"""
raw = [
row[CORRECT_ANSWER_FIELD],
row[INCORRECT_ANSWER_FIELDS[0]],
row[INCORRECT_ANSWER_FIELDS[1]],
row[INCORRECT_ANSWER_FIELDS[2]],
]
order = _seeded_shuffle_indices(row[QUESTION_FIELD], len(raw))
ordered = [raw[i] for i in order]
gold_index = order.index(0)
gold_letter = "ABCD"[gold_index]
return ordered, gold_letter

def _format_prompt(self, row: dict[str, Any], choices: list[str]) -> str:
"""Render the simple-evals template byte-equal to the recipe."""
return _PROMPT_TEMPLATE.format(
Question=row[QUESTION_FIELD],
A=choices[0],
B=choices[1],
C=choices[2],
D=choices[3],
)
10 changes: 6 additions & 4 deletions src/aiperf/plugin/plugins.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1517,12 +1517,14 @@ accuracy_benchmark:
gpqa_diamond:
class: aiperf.accuracy.benchmarks.gpqa_diamond:GPQADiamondBenchmark
description: |
GPQA Diamond benchmark with graduate-level science questions in physics,
chemistry, and biology requiring expert-level reasoning.
GPQA Diamond benchmark, aligned with the trt-llm benchmark recipe's
lighteval-backed configuration (Idavidrein/gpqa subset gpqa_diamond
+ lighteval ``gpqa_metric``). Uses the simple-evals prompt template
with SHA-256-seeded deterministic A/B/C/D shuffling so gold
positions are reproducible across runs.
metadata:
default_grader: multiple_choice
default_grader: lighteval_gpqa
default_n_shots: 0
is_implemented: false

lcb_codegeneration:
class: aiperf.accuracy.benchmarks.lcb_codegeneration:LCBCodeGenerationBenchmark
Expand Down
5 changes: 1 addition & 4 deletions tests/unit/accuracy/test_accuracy_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,7 @@
# implementation (and remove the ``is_implemented: false`` from the YAML).
# This branch (AIP-874) implements ``aime``, ``math``, and ``code_execution``,
# so those names are absent from the stub lists.
STUB_BENCHMARKS = (
"gpqa_diamond",
"lcb_codegeneration",
)
STUB_BENCHMARKS = ("lcb_codegeneration",)
STUB_GRADERS: tuple[str, ...] = ()


Expand Down
Loading
Loading