Skip to content

feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880)#928

Open
debermudez wants to merge 3 commits into
mainfrom
dbermudez/aip-880-implement-gpqa-diamond-benchmark-loader
Open

feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880)#928
debermudez wants to merge 3 commits into
mainfrom
dbermudez/aip-880-implement-gpqa-diamond-benchmark-loader

Conversation

@debermudez
Copy link
Copy Markdown
Contributor

@debermudez debermudez commented May 12, 2026

  • Plugin: registers gpqa_diamond benchmark in plugins.yaml (default_grader: multiple_choice, default_n_shots: 0); scaffold loader raises NotImplementedError until the full lighteval-backed implementation lands.
  • Graduate-level science questions in physics, chemistry, and biology requiring expert-level reasoning; multiple-choice (A/B/C/D format).
  • Uses LightevalGPQAGrader (lighteval_gpqa, IndicesExtractionConfig(prefix_for_extraction="NativeLetters")) introduced in AIP-874.
  • Depends on AIP-879 (lighteval sub-stack ordering).

Reference: trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py (gpqa_diamond task)

Summary by CodeRabbit

Release Notes

  • New Features

    • GPQA-Diamond benchmark is now fully implemented with lighteval alignment and deterministic shuffling to ensure reproducible results.
  • Documentation

    • Updated accuracy benchmarking documentation reflecting GPQA-Diamond implementation status and updated benchmark configuration.
  • Tests

    • Added comprehensive test coverage validating GPQA-Diamond benchmark prompt generation, shuffling behavior, and dataset loading.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2c3073941a666c1b9a437f876ef26821ef8bdc24

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2c3073941a666c1b9a437f876ef26821ef8bdc24

Last updated for commit: 2c30739Browse code

@github-actions github-actions Bot added the feat label May 12, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

@debermudez
Copy link
Copy Markdown
Contributor Author

Stack dependency

This PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with
the trt-llm-benchmark-recipe reference implementation. The branches were
rebased together; each PR depends on its parent landing first.

Merge order:

  1. AIP-874 — feat(accuracy): implement AIME accuracy benchmark #849 ← foundation (base: main)
  2. AIP-877 — feat(accuracy): HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) #923 (DeepEval stack)
  3. AIP-878 — feat(accuracy): BigBench-Hard DeepEval-backed benchmark (AIP-878) #924
  4. AIP-875 — feat(accuracy): AIME 2024 lighteval-backed benchmark (AIP-875) #925 (lighteval stack)
  5. AIP-876 — feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876) #926
  6. AIP-879 — feat(accuracy): MATH-500 lighteval-backed benchmark (AIP-879) #927
  7. AIP-880 — feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880) #928
  8. AIP-881 — feat(accuracy): LCB CodeGen lighteval-backed benchmark + code_execution grader (AIP-881) #929

This PR: position 7 of 8 — base branch is dbermudez/aip-879-implement-math500-benchmark-loader,
depends on #927 (AIP-879) landing first.

After each upstream PR merges, the downstream PR's branch will be rebased
onto the updated parent before its own merge.

@debermudez debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from e4b5d58 to 28c1485 Compare May 12, 2026 23:25
@debermudez debermudez force-pushed the dbermudez/aip-880-implement-gpqa-diamond-benchmark-loader branch from 7d0633c to 2a1fb77 Compare May 12, 2026 23:25
@debermudez debermudez marked this pull request as draft May 12, 2026 23:27
@debermudez debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 28c1485 to 9e0491c Compare May 13, 2026 00:39
@debermudez debermudez force-pushed the dbermudez/aip-880-implement-gpqa-diamond-benchmark-loader branch from 2a1fb77 to fc53006 Compare May 13, 2026 00:39
@debermudez debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 9e0491c to 3770e19 Compare May 13, 2026 21:23
@debermudez debermudez force-pushed the dbermudez/aip-880-implement-gpqa-diamond-benchmark-loader branch from fc53006 to b2fb5de Compare May 13, 2026 21:29
@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@debermudez debermudez force-pushed the dbermudez/aip-880-implement-gpqa-diamond-benchmark-loader branch from b2fb5de to 3c311f8 Compare May 13, 2026 23:32
@debermudez debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch from 3770e19 to c2ac1d4 Compare May 13, 2026 23:32
@debermudez debermudez force-pushed the dbermudez/aip-879-implement-math500-benchmark-loader branch 2 times, most recently from eae6f8a to d5d733f Compare June 2, 2026 17:14
Base automatically changed from dbermudez/aip-879-implement-math500-benchmark-loader to main June 2, 2026 23:42
…880)

Implement ``GPQADiamondBenchmark`` mirroring the trt-llm benchmark
recipe's ``acc_bench_lighteval.py:gpqa_diamond`` configuration:
loads ``Idavidrein/gpqa`` (subset ``gpqa_diamond``, train split) and
renders the simple-evals prompt template:

    Answer the following multiple choice question. The last line of
    your response should be of the following format: 'Answer: $LETTER'
    (without quotes) where LETTER is one of ABCD. Think step by step
    before answering.

    {Question}

    A) {A}
    B) {B}
    C) {C}
    D) {D}

The four answer choices (1 correct + 3 distractors) are shuffled into
A/B/C/D positions via **SHA-256-seeded deterministic shuffling** — the
one intentional deviation from the recipe's stochastic
``random.randint(0, 3)``. Seeding off the question text gives a
stable, locale-independent, Python-version-independent permutation so
gold positions reproduce across runs while still distributed
uniformly. Pair with ``LightevalGPQAGrader`` (the recipe's
``gpqa_metric``).

Per-row ``task`` is set to ``row["High-level domain"]`` so the
accuracy CSV breaks down by physics/chemistry/biology;
``metadata.subdomain`` and ``metadata.gold_letter`` are carried for
post-hoc analysis.

Built on top of AIP-879 in the lighteval sub-stack
(875 → 876 → 879 → 880). No heavy optional dep — ``datasets`` is core
— so CI gets 100% line + branch coverage out of the box.

Updates the stub registry: drop ``gpqa_diamond`` from
``test_accuracy_config.STUB_BENCHMARKS``, drop ``is_implemented:
false`` from the ``gpqa_diamond`` plugins.yaml entry, switch
``default_grader`` to ``lighteval_gpqa``, add the ``gpqa_diamond``
row to ``docs/accuracy/accuracy-benchmarking.md``, and move it from
"Still Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing
the Status Summary, Method Count Summary, and Suggested
Implementation Order accordingly).

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
``ruff format`` collapses the now-single-element ``STUB_BENCHMARKS``
tuple in ``test_accuracy_config.py`` onto a single line. Pre-AIP-879
the tuple had two entries (``math_500`` + ``lcb_codegeneration``)
which justified the multi-line layout; after 879 landed there's only
``lcb_codegeneration`` left and ruff's formatter would otherwise flag
this on CI as a needed reformat.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
@debermudez debermudez force-pushed the dbermudez/aip-880-implement-gpqa-diamond-benchmark-loader branch from 3c311f8 to a4e9834 Compare June 2, 2026 23:53
@debermudez debermudez marked this pull request as ready for review June 2, 2026 23:54
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0f0228c7-6e46-4346-bb18-9242d91cfa17

📥 Commits

Reviewing files that changed from the base of the PR and between 94ae26e and a4e9834.

📒 Files selected for processing (6)
  • docs/accuracy/accuracy-benchmarking.md
  • docs/accuracy/accuracy_stubs.md
  • src/aiperf/accuracy/benchmarks/gpqa_diamond.py
  • src/aiperf/plugin/plugins.yaml
  • tests/unit/accuracy/test_accuracy_config.py
  • tests/unit/accuracy/test_gpqa_diamond_benchmark.py

Walkthrough

This PR implements GPQADiamondBenchmark, converting it from a stub to a functional benchmark loader using the Idavidrein/gpqa dataset with deterministic SHA-256-seeded option shuffling. The implementation is registered in plugins.yaml with the lighteval_gpqa grader, accompanied by comprehensive test coverage and updated documentation.

Changes

GPQADiamondBenchmark Implementation

Layer / File(s) Summary
Benchmark Implementation with Deterministic Shuffling
src/aiperf/accuracy/benchmarks/gpqa_diamond.py
Replaces stub with full module docstring, imports, constants, and complete loader: _seeded_shuffle_indices produces deterministic permutations via SHA-256 hashing; load_problems loads the dataset asynchronously and delegates to _build_problems, which creates BenchmarkProblem objects; _build_choices assembles ordered options with correct gold letters; _format_prompt renders the simple-evals template with question and choices.
Plugin Configuration Registration
src/aiperf/plugin/plugins.yaml
Updates accuracy_benchmark.gpqa_diamond to use lighteval_gpqa as default grader, removes is_implemented: false flag, and documents the deterministic SHA-256-seeded shuffling and simple-evals template alignment.
Comprehensive Test Suite
tests/unit/accuracy/test_accuracy_config.py, tests/unit/accuracy/test_gpqa_diamond_benchmark.py
Removes gpqa_diamond from stub validation list and adds full test module covering prompt template compliance (response format and A/B/C/D formatting), ground-truth letter correctness, deterministic shuffling reproducibility and uniformity, parameter invariance, load_problems cardinality and metadata correctness, and edge cases (empty datasets, missing fields, Unicode preservation).
Documentation Status Updates
docs/accuracy/accuracy-benchmarking.md, docs/accuracy/accuracy_stubs.md
Adds gpqa_diamond entry to Available Benchmarks table with grader and dataset details; removes from remaining stub list; updates method counts from 18/3 to 19/2; removes from suggested implementation order.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes


Poem

🐰 A diamond benchmark now shines bright,
With SHA-256 seeding done right,
Prompts rendered true, choices all squared,
No more NotImplementedError feared!
From stub to shining state—the PR prepared.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.39% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: implementing a GPQA-Diamond benchmark with lighteval support, which is the primary objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants