feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) by debermudez · Pull Request #928 · ai-dynamo/aiperf

debermudez · 2026-05-12T23:08:44Z

Plugin: registers gpqa_diamond benchmark in plugins.yaml (default_grader: multiple_choice, default_n_shots: 0); scaffold loader raises NotImplementedError until the full lighteval-backed implementation lands.
Graduate-level science questions in physics, chemistry, and biology requiring expert-level reasoning; multiple-choice (A/B/C/D format).
Uses LightevalGPQAGrader (lighteval_gpqa, IndicesExtractionConfig(prefix_for_extraction="NativeLetters")) introduced in AIP-874.
Depends on AIP-879 (lighteval sub-stack ordering).

Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py (gpqa_diamond task)

Summary by CodeRabbit

Release Notes

New Features
- GPQA-Diamond benchmark is now fully implemented with lighteval alignment and deterministic shuffling to ensure reproducible results.
Documentation
- Updated accuracy benchmarking documentation reflecting GPQA-Diamond implementation status and updated benchmark configuration.
Tests
- Added comprehensive test coverage validating GPQA-Diamond benchmark prompt generation, shuffling behavior, and dataset loading.

copy-pr-bot · 2026-05-12T23:08:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-12T23:08:52Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2c3073941a666c1b9a437f876ef26821ef8bdc24

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2c3073941a666c1b9a437f876ef26821ef8bdc24

Last updated for commit: 2c30739 • Browse code

github-actions · 2026-05-12T23:09:24Z

Fern Docs Preview: https://nvidia-preview-9cf17bdf-fb99-4a99-846a-e48a9639868d.docs.buildwithfern.com/aiperf/dev

debermudez · 2026-05-12T23:11:43Z

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 7 of 8 — base branch is dbermudez/aip-879-implement-math500-benchmark-loader,
depends on #927 (AIP-879) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

codecov · 2026-05-13T21:40:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…880) Implement ``GPQADiamondBenchmark`` mirroring the trt-llm benchmark recipe's ``acc_bench_lighteval.py:gpqa_diamond`` configuration: loads ``Idavidrein/gpqa`` (subset ``gpqa_diamond``, train split) and renders the simple-evals prompt template: Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering. {Question} A) {A} B) {B} C) {C} D) {D} The four answer choices (1 correct + 3 distractors) are shuffled into A/B/C/D positions via **SHA-256-seeded deterministic shuffling** — the one intentional deviation from the recipe's stochastic ``random.randint(0, 3)``. Seeding off the question text gives a stable, locale-independent, Python-version-independent permutation so gold positions reproduce across runs while still distributed uniformly. Pair with ``LightevalGPQAGrader`` (the recipe's ``gpqa_metric``). Per-row ``task`` is set to ``row["High-level domain"]`` so the accuracy CSV breaks down by physics/chemistry/biology; ``metadata.subdomain`` and ``metadata.gold_letter`` are carried for post-hoc analysis. Built on top of AIP-879 in the lighteval sub-stack (875 → 876 → 879 → 880). No heavy optional dep — ``datasets`` is core — so CI gets 100% line + branch coverage out of the box. Updates the stub registry: drop ``gpqa_diamond`` from ``test_accuracy_config.STUB_BENCHMARKS``, drop ``is_implemented: false`` from the ``gpqa_diamond`` plugins.yaml entry, switch ``default_grader`` to ``lighteval_gpqa``, add the ``gpqa_diamond`` row to ``docs/accuracy/accuracy-benchmarking.md``, and move it from "Still Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing the Status Summary, Method Count Summary, and Suggested Implementation Order accordingly). Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

``ruff format`` collapses the now-single-element ``STUB_BENCHMARKS`` tuple in ``test_accuracy_config.py`` onto a single line. Pre-AIP-879 the tuple had two entries (``math_500`` + ``lcb_codegeneration``) which justified the multi-line layout; after 879 landed there's only ``lcb_codegeneration`` left and ruff's formatter would otherwise flag this on CI as a needed reformat. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

coderabbitai · 2026-06-02T23:59:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0f0228c7-6e46-4346-bb18-9242d91cfa17

📥 Commits

Reviewing files that changed from the base of the PR and between 94ae26e and a4e9834.

📒 Files selected for processing (6)

docs/accuracy/accuracy-benchmarking.md
docs/accuracy/accuracy_stubs.md
src/aiperf/accuracy/benchmarks/gpqa_diamond.py
src/aiperf/plugin/plugins.yaml
tests/unit/accuracy/test_accuracy_config.py
tests/unit/accuracy/test_gpqa_diamond_benchmark.py

Walkthrough

This PR implements GPQADiamondBenchmark, converting it from a stub to a functional benchmark loader using the Idavidrein/gpqa dataset with deterministic SHA-256-seeded option shuffling. The implementation is registered in plugins.yaml with the lighteval_gpqa grader, accompanied by comprehensive test coverage and updated documentation.

Changes

GPQADiamondBenchmark Implementation

Layer / File(s)	Summary
Benchmark Implementation with Deterministic Shuffling `src/aiperf/accuracy/benchmarks/gpqa_diamond.py`	Replaces stub with full module docstring, imports, constants, and complete loader: `_seeded_shuffle_indices` produces deterministic permutations via SHA-256 hashing; `load_problems` loads the dataset asynchronously and delegates to `_build_problems`, which creates `BenchmarkProblem` objects; `_build_choices` assembles ordered options with correct gold letters; `_format_prompt` renders the simple-evals template with question and choices.
Plugin Configuration Registration `src/aiperf/plugin/plugins.yaml`	Updates `accuracy_benchmark.gpqa_diamond` to use `lighteval_gpqa` as default grader, removes `is_implemented: false` flag, and documents the deterministic SHA-256-seeded shuffling and simple-evals template alignment.
Comprehensive Test Suite `tests/unit/accuracy/test_accuracy_config.py`, `tests/unit/accuracy/test_gpqa_diamond_benchmark.py`	Removes `gpqa_diamond` from stub validation list and adds full test module covering prompt template compliance (response format and A/B/C/D formatting), ground-truth letter correctness, deterministic shuffling reproducibility and uniformity, parameter invariance, load_problems cardinality and metadata correctness, and edge cases (empty datasets, missing fields, Unicode preservation).
Documentation Status Updates `docs/accuracy/accuracy-benchmarking.md`, `docs/accuracy/accuracy_stubs.md`	Adds `gpqa_diamond` entry to Available Benchmarks table with grader and dataset details; removes from remaining stub list; updates method counts from 18/3 to 19/2; removes from suggested implementation order.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A diamond benchmark now shines bright,
With SHA-256 seeding done right,
Prompts rendered true, choices all squared,
No more NotImplementedError feared!
From stub to shining state—the PR prepared.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.39% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: implementing a GPQA-Diamond benchmark with lighteval support, which is the primary objective of the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…chmark-loader

github-actions Bot added the feat label May 12, 2026

debermudez mentioned this pull request May 12, 2026

feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

Draft

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from e4b5d58 to 28c1485 Compare May 12, 2026 23:25

debermudez force-pushed the dbermudez/aip-880-implement-gpqa-diamond-benchmark-loader branch from 7d0633c to 2a1fb77 Compare May 12, 2026 23:25

debermudez marked this pull request as draft May 12, 2026 23:27

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 28c1485 to 9e0491c Compare May 13, 2026 00:39

debermudez force-pushed the dbermudez/aip-880-implement-gpqa-diamond-benchmark-loader branch from 2a1fb77 to fc53006 Compare May 13, 2026 00:39

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 9e0491c to 3770e19 Compare May 13, 2026 21:23

debermudez force-pushed the dbermudez/aip-880-implement-gpqa-diamond-benchmark-loader branch from fc53006 to b2fb5de Compare May 13, 2026 21:29

debermudez force-pushed the dbermudez/aip-880-implement-gpqa-diamond-benchmark-loader branch from b2fb5de to 3c311f8 Compare May 13, 2026 23:32

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 3770e19 to c2ac1d4 Compare May 13, 2026 23:32

debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch 2 times, most recently from eae6f8a to d5d733f Compare June 2, 2026 17:14

Base automatically changed from dbermudez/aip-879-implement-math500-benchmark-loader to main June 2, 2026 23:42

debermudez added 2 commits June 2, 2026 16:44

debermudez force-pushed the dbermudez/aip-880-implement-gpqa-diamond-benchmark-loader branch from 3c311f8 to a4e9834 Compare June 2, 2026 23:53

debermudez marked this pull request as ready for review June 2, 2026 23:54

Merge branch 'main' into dbermudez/aip-880-implement-gpqa-diamond-ben…

2c30739

…chmark-loader

FrankD412 approved these changes Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880)#928

feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880)#928
debermudez wants to merge 3 commits into
mainfrom
dbermudez/aip-880-implement-gpqa-diamond-benchmark-loader

debermudez commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 12, 2026 •

edited

Loading

Uh oh!

debermudez commented May 12, 2026

Uh oh!

codecov Bot commented May 13, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

debermudez commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

github-actions Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

debermudez commented May 12, 2026

Stack dependency

Uh oh!

codecov Bot commented May 13, 2026

Codecov Report

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

debermudez commented May 12, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading

github-actions Bot commented May 12, 2026 •

edited

Loading