ai-dynamo · FrankD412 · Jun 3, 2026 · May 26, 2026 · Jun 2, 2026 · Jun 3, 2026
diff --git a/docs/accuracy/accuracy-benchmarking.md b/docs/accuracy/accuracy-benchmarking.md
@@ -77,6 +77,7 @@ system message).
 | `aime24` | `lighteval_expr` | 0 | `HuggingFaceH4/aime_2024` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
 | `aime25` | `lighteval_expr` | 0 | `yentinglin/aime_2025` (trt-llm/lighteval reference, bare problem text, `expr_gold_metric`) |
 | `math_500` | `lighteval_latex` | 0 | `HuggingFaceH4/MATH-500` (trt-llm/lighteval reference, gold is full solution containing `\boxed{answer}`, `latex_gold_metric`) |
+| `gpqa_diamond` | `lighteval_gpqa` | 0 | `Idavidrein/gpqa` subset `gpqa_diamond` (trt-llm/lighteval reference, simple-evals template with SHA-256-seeded deterministic A/B/C/D shuffling, `gpqa_metric`) |
 
 ## CLI Flags
 

diff --git a/docs/accuracy/accuracy_stubs.md b/docs/accuracy/accuracy_stubs.md
@@ -7,7 +7,7 @@
 
 This document catalogs every stubbed method in the accuracy benchmarking scaffolding. The scaffolding is fully integrated into the plugin system, CLI, and config pipeline — the performance benchmarking path is unaffected.
 
-**Status summary:** With the MATH-500 loader landing on top of the AIME25 / AIME24 / BigBench / HellaSwag stack, `MultipleChoiceGrader`, `MathGrader`, `CodeExecutionGrader`, `LightevalExprGrader`, `LightevalLatexGrader`, `LightevalGPQAGrader`, `ExactMatchGrader`, `MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, and `Math500Benchmark` are fully implemented; the remaining benchmarks (`gpqa_diamond`, `lcb_codegeneration`) are still stubs and ship behind `NotImplementedError` until each follow-up branch lands. Use the implemented classes as canonical references when filling in the remaining stubs.
+**Status summary:** With the GPQA-Diamond loader landing on top of the MATH-500 / AIME25 / AIME24 / BigBench / HellaSwag stack, all seven graders and eight benchmark loaders are now implemented; only `lcb_codegeneration` remains stubbed (LiveCodeBench code-generation) and ships behind `NotImplementedError` until the AIP-881 branch lands. Use the implemented classes as canonical references when filling in the remaining stub.
 
 ## Table of Contents
 
@@ -176,13 +176,13 @@ All benchmarks use `AIPerfLoggerMixin` and must implement 1 method.
 | 5 | `AIME24Benchmark` | `benchmarks/aime24.py` | `aime24` | `lighteval_expr` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/aime_2024` (train split) and emits the bare problem text as a single user message — no instruction prefix, no few-shot priming. Mirrors the trt-llm benchmark recipe's `acc_bench_lighteval.py` configuration (`few_shots_split=None`, `generation_size=32768`). Pairs with `lighteval_expr` for the recipe's `expr_gold_metric` extraction. |
 | 6 | `AIME25Benchmark` | `benchmarks/aime25.py` | `aime25` | `lighteval_expr` | 0 | **IMPLEMENTED.** Same lighteval-aligned shape as `AIME24Benchmark` but pointed at `yentinglin/aime_2025` (the recipe's `aime25` task config). Identical prompt rendering, generation size, and grader pairing. |
 | 7 | `Math500Benchmark` | `benchmarks/math_500.py` | `math_500` | `lighteval_latex` | 0 | **IMPLEMENTED.** Loads `HuggingFaceH4/MATH-500` (test split). Same lighteval-aligned shape as AIME24/25, but `ground_truth` is the full `solution` text (containing `\boxed{answer}`); `LightevalLatexGrader` extracts the boxed expression at grade time. Per-row `task` = `subject` so the accuracy CSV breaks down by MATH subject. |
+| 8 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `lighteval_gpqa` | 0 | **IMPLEMENTED.** Loads `Idavidrein/gpqa` (subset `gpqa_diamond`, train split). Renders the simple-evals prompt template with **SHA-256-seeded deterministic A/B/C/D shuffling** of the correct + 3 distractor answers — one intentional deviation from the recipe's stochastic `random.randint(0, 3)` so gold positions reproduce across runs. Per-row `task` = `High-level domain` so the accuracy CSV breaks down by physics/chemistry/biology. |
 
 ### Still Stubbed
 
 | # | Class | File | Plugin Key | Default Grader | Default N-Shots |
 |---|-------|------|------------|----------------|-----------------|
-| 1 | `GPQADiamondBenchmark` | `benchmarks/gpqa_diamond.py` | `gpqa_diamond` | `multiple_choice` | 0 |
-| 2 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
+| 1 | `LCBCodeGenerationBenchmark` | `benchmarks/lcb_codegeneration.py` | `lcb_codegeneration` | `code_execution` | 0 |
 
 **Each benchmark has 1 method to implement:**
 
@@ -309,25 +309,24 @@ All stubs are registered in `src/aiperf/plugin/plugins.yaml` and `src/aiperf/plu
 | Component | Implemented | Still Stubbed | Methods per Stub | Remaining Methods |
 |-----------|-------------|---------------|------------------|-------------------|
 | Graders | 7 (all) | 0 | — | 0 |
-| Benchmarks | 7 (incl. MMLU, AIME, HellaSwag, BigBench, AIME24, AIME25, Math500) | 2 | 1 (`load_problems`) | 2 |
+| Benchmarks | 8 (incl. MMLU, AIME, HellaSwag, BigBench, AIME24, AIME25, Math500, GPQADiamond) | 1 | 1 (`load_problems`) | 1 |
 | Record Processor | 1 (`AccuracyRecordProcessor`) | 0 | — | 0 |
 | Results Processor | 1 (`AccuracyResultsProcessor`) | 0 | — | 0 |
 | Console Exporter | 1 (`AccuracyConsoleExporter`) | 0 | — | 0 |
 | Data Exporter | 1 (`AccuracyDataExporter`) | 0 | — | 0 |
 | Stub-plugin Validator | 0 | 1 | 1 (`AccuracyConfig._reject_stub_plugins`) | 1 |
-| **Total** | **18** | **3** | | **3** |
+| **Total** | **19** | **2** | | **2** |
 
 ### Self-Disabling Pattern
 
 Processors and exporters raise their `Disabled` exception **in `__init__`** when accuracy is off. The existing framework catches these and silently skips the plugin. No code changes needed to support this — it uses the same pattern as `RawRecordWriterProcessor` and `ServerMetricsCsvExporter`.
 
 ### Suggested Implementation Order
 
-The processors, exporters, all graders, and seven benchmarks (`MMLUBenchmark`, `AIMEBenchmark`, `HellaSwagBenchmark`, `BigBenchBenchmark`, `AIME24Benchmark`, `AIME25Benchmark`, `Math500Benchmark`) are already wired end-to-end. The remaining work is the two stub benchmarks; mirror the existing loader whose grader matches:
+The processors, exporters, all graders, and eight benchmarks are already wired end-to-end. The remaining work is the single stub benchmark:
 
-1. **`gpqa_diamond`** — mirror `MMLUBenchmark` (`benchmarks/mmlu.py`); pair with the `lighteval_gpqa` grader.
-2. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
-3. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` whenever a benchmark moves from stubbed to supported.
+1. **`lcb_codegeneration`** — mirror `MMLUBenchmark`'s scaffolding; pair with the `code_execution` grader.
+2. **Stub-plugin validator** — update `AccuracyConfig._reject_stub_plugins()` when `lcb_codegeneration` lands so the validator no longer rejects it.
 
 ### Key Files for Reference
 

diff --git a/src/aiperf/accuracy/benchmarks/gpqa_diamond.py b/src/aiperf/accuracy/benchmarks/gpqa_diamond.py
@@ -1,31 +1,218 @@
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 
+"""GPQA-Diamond benchmark loader, aligned with the trt-llm lighteval reference.
+
+Mirrors ``acc_bench_lighteval.py:gpqa_diamond``:
+
+    gpqa_diamond = LightevalTaskConfig(
+        name="gpqa:diamond",
+        prompt_function=gpqa_prompt_fn,
+        hf_repo="Idavidrein/gpqa",
+        hf_subset="gpqa_diamond",
+        evaluation_splits=["train"],
+        few_shots_split=None,
+        generation_size=32768,
+        metric=[gpqa_metric],
+        stop_sequence=[],
+        trust_dataset=True,
+    )
+
+The recipe's ``gpqa_prompt_fn`` builds the simple-evals template:
+
+    Answer the following multiple choice question. The last line of
+    your response should be of the following format: 'Answer: $LETTER'
+    (without quotes) where LETTER is one of ABCD. Think step by step
+    before answering.
+
+    {Question}
+
+    A) {A}
+    B) {B}
+    C) {C}
+    D) {D}
+
+The recipe's prompt_fn shuffles options with ``random.randint(0, 3)``
+(stochastic, different per call). Aiperf instead uses **SHA-256-seeded
+deterministic shuffling** (per the user direction during the alignment
+review) so gold positions are reproducible across runs while still
+distributed uniformly. This is the one intentional deviation from the
+trt-llm reference, documented in
+``docs/accuracy/accuracy-benchmarking.md``.
+
+Pair with ``LightevalGPQAGrader`` (default), which extracts via
+``IndicesExtractionConfig(prefix_for_extraction="NativeLetters")`` to
+match the recipe's ``gpqa_metric``.
+
+Reference:
+    trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py:108,170
+"""
+
 from __future__ import annotations
 
-from typing import TYPE_CHECKING
+import asyncio
+import hashlib
+import random
+from typing import TYPE_CHECKING, Any
 
-from aiperf.accuracy.models import BenchmarkProblem
+from datasets import Dataset, load_dataset
+
+from aiperf.accuracy.models import AccuracyChatMessage, BenchmarkProblem
 from aiperf.common.mixins import AIPerfLoggerMixin
 
 if TYPE_CHECKING:
     from aiperf.config.resolution.plan import BenchmarkRun
 
+DATASET_NAME = "Idavidrein/gpqa"
+DATASET_CONFIG = "gpqa_diamond"
+TASK_NAME = "gpqa_diamond"
+
+# lighteval's gpqa_diamond task config: ``generation_size=32768``.
+DEFAULT_GENERATION_SIZE = 32768
+
+# 4 choices per question (1 correct + 3 distractors).
+NUM_CHOICES = 4
+
+# Width of the SHA-256-derived seed when modded down to a 32-bit
+# Python ``random.Random`` seed.
+_SEED_MODULUS = 2**32
+
+# Schema field names in the Idavidrein/gpqa dataset (Title Case with
+# spaces — the upstream's choice).
+QUESTION_FIELD = "Question"
+CORRECT_ANSWER_FIELD = "Correct Answer"
+INCORRECT_ANSWER_FIELDS = (
+    "Incorrect Answer 1",
+    "Incorrect Answer 2",
+    "Incorrect Answer 3",
+)
+DOMAIN_FIELD = "High-level domain"
+SUBDOMAIN_FIELD = "Subdomain"
+
+# Recipe's ``gpqa_prompt_fn`` template. The model is told to emit
+# ``Answer: $LETTER`` so ``LightevalGPQAGrader`` (with
+# ``IndicesExtractionConfig(prefix_for_extraction="NativeLetters")``)
+# can extract the letter cleanly.
+_PROMPT_TEMPLATE = (
+    "Answer the following multiple choice question. The last line of "
+    "your response should be of the following format: 'Answer: $LETTER' "
+    "(without quotes) where LETTER is one of ABCD. Think step by step "
+    "before answering.\n\n"
+    "{Question}\n\n"
+    "A) {A}\n"
+    "B) {B}\n"
+    "C) {C}\n"
+    "D) {D}"
+)
+
+
+def _seeded_shuffle_indices(key: str, n: int) -> list[int]:
+    """Return a deterministic permutation of ``range(n)`` seeded by ``key``.
+
+    Uses the leading 32 bits of SHA-256(key) as the seed for Python's
+    ``random.Random``. This gives a stable, locale-independent,
+    Python-version-independent permutation: regenerating prompts on a
+    new machine produces identical letter orderings.
+
+    The recipe shuffles via ``random.randint(0, 3)`` (stochastic per
+    call) — see the module docstring for why aiperf chose
+    determinism instead.
+    """
+    digest = hashlib.sha256(key.encode("utf-8")).hexdigest()
+    seed = int(digest, 16) % _SEED_MODULUS
+    rng = random.Random(seed)
+    indices = list(range(n))
+    rng.shuffle(indices)
+    return indices
+
 
 class GPQADiamondBenchmark(AIPerfLoggerMixin):
-    """Registered placeholder for a future GPQA Diamond loader.
+    """GPQA-Diamond lighteval-aligned benchmark loader.
 
-    `load_problems()` intentionally raises NotImplementedError in this release;
-    use the MMLU benchmark when a working accuracy loader is required.
+    Loads ``Idavidrein/gpqa`` (config ``gpqa_diamond``, train split).
+    Each row's correct + 3 incorrect answers are deterministically
+    shuffled into A/B/C/D positions and rendered with the simple-evals
+    template (matching ``gpqa_prompt_fn``). Pair with
+    ``LightevalGPQAGrader`` for grading parity with the recipe.
     """
 
-    def __init__(self, run: BenchmarkRun, **kwargs) -> None:
+    def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None:
         super().__init__(**kwargs)
         self.run = run
 
     async def load_problems(
         self, tasks: list[str] | None, n_shots: int, enable_cot: bool
     ) -> list[BenchmarkProblem]:
-        raise NotImplementedError(
-            "gpqa_diamond benchmark is not yet implemented; only 'mmlu' is available in this release."
+        """Load GPQA-Diamond problems lighteval-style.
+
+        Args:
+            tasks: Ignored — lighteval's gpqa_diamond task has no
+                subtask filtering (per-row High-level domain is in
+                metadata for post-run reporting).
+            n_shots: Ignored — the lighteval reference is zero-shot
+                (``few_shots_split=None``).
+            enable_cot: Ignored — the simple-evals template already
+                includes "Think step by step before answering."
+
+        Returns:
+            One ``BenchmarkProblem`` per dataset row, in dataset order.
+            ``ground_truth`` is the gold letter ("A", "B", "C", or
+            "D") so ``LightevalGPQAGrader`` can pass it directly into
+            its ``Doc.choices=["A","B","C","D"], gold_index=...``
+            shape.
+        """
+        ds: Dataset = await asyncio.to_thread(
+            load_dataset, DATASET_NAME, DATASET_CONFIG, split="train"
+        )
+        return await asyncio.to_thread(self._build_problems, ds)
+
+    def _build_problems(self, ds: Dataset) -> list[BenchmarkProblem]:
+        problems: list[BenchmarkProblem] = []
+        for row in ds:
+            choices, gold_letter = self._build_choices(row)
+            prompt = self._format_prompt(row, choices)
+            messages: list[AccuracyChatMessage] = [{"role": "user", "content": prompt}]
+            problems.append(
+                BenchmarkProblem(
+                    prompt=prompt,
+                    ground_truth=gold_letter,
+                    task=TASK_NAME,
+                    metadata={
+                        "domain": row.get(DOMAIN_FIELD, ""),
+                        "subdomain": row.get(SUBDOMAIN_FIELD, ""),
+                        "generation_size": DEFAULT_GENERATION_SIZE,
+                    },
+                    raw_messages=messages,
+                )
+            )
+        return problems
+
+    @staticmethod
+    def _build_choices(row: dict[str, Any]) -> tuple[list[str], str]:
+        """Assemble 4 lettered choices and report the gold letter.
+
+        Uses SHA-256-seeded permutation (see ``_seeded_shuffle_indices``)
+        — deterministic per-question shuffle, distinct from the
+        recipe's stochastic ``random.randint(0, 3)``.
+        """
+        raw = [
+            row[CORRECT_ANSWER_FIELD],
+            row[INCORRECT_ANSWER_FIELDS[0]],
+            row[INCORRECT_ANSWER_FIELDS[1]],
+            row[INCORRECT_ANSWER_FIELDS[2]],
+        ]
+        order = _seeded_shuffle_indices(row[QUESTION_FIELD], len(raw))
+        ordered = [raw[i] for i in order]
+        gold_index = order.index(0)
+        gold_letter = "ABCD"[gold_index]
+        return ordered, gold_letter
+
+    def _format_prompt(self, row: dict[str, Any], choices: list[str]) -> str:
+        """Render the simple-evals template byte-equal to the recipe."""
+        return _PROMPT_TEMPLATE.format(
+            Question=row[QUESTION_FIELD],
+            A=choices[0],
+            B=choices[1],
+            C=choices[2],
+            D=choices[3],
         )
diff --git a/src/aiperf/plugin/plugins.yaml b/src/aiperf/plugin/plugins.yaml
@@ -1517,12 +1517,14 @@ accuracy_benchmark:
   gpqa_diamond:
     class: aiperf.accuracy.benchmarks.gpqa_diamond:GPQADiamondBenchmark
     description: |
-      GPQA Diamond benchmark with graduate-level science questions in physics,
-      chemistry, and biology requiring expert-level reasoning.
+      GPQA Diamond benchmark, aligned with the trt-llm benchmark recipe's
+      lighteval-backed configuration (Idavidrein/gpqa subset gpqa_diamond
+      + lighteval ``gpqa_metric``). Uses the simple-evals prompt template
+      with SHA-256-seeded deterministic A/B/C/D shuffling so gold
+      positions are reproducible across runs.
     metadata:
-      default_grader: multiple_choice
+      default_grader: lighteval_gpqa
       default_n_shots: 0
-      is_implemented: false
 
   lcb_codegeneration:
     class: aiperf.accuracy.benchmarks.lcb_codegeneration:LCBCodeGenerationBenchmark

diff --git a/tests/unit/accuracy/test_accuracy_config.py b/tests/unit/accuracy/test_accuracy_config.py
@@ -22,10 +22,7 @@
 # implementation (and remove the ``is_implemented: false`` from the YAML).
 # This branch (AIP-874) implements ``aime``, ``math``, and ``code_execution``,
 # so those names are absent from the stub lists.
-STUB_BENCHMARKS = (
-    "gpqa_diamond",
-    "lcb_codegeneration",
-)
+STUB_BENCHMARKS = ("lcb_codegeneration",)
 STUB_GRADERS: tuple[str, ...] = ()