feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880)#928
feat(accuracy): GPQA-Diamond lighteval-backed benchmark (AIP-880)#928debermudez wants to merge 3 commits into
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2c3073941a666c1b9a437f876ef26821ef8bdc24Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@2c3073941a666c1b9a437f876ef26821ef8bdc24Last updated for commit: |
Stack dependencyThis PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with Merge order:
This PR: position 7 of 8 — base branch is After each upstream PR merges, the downstream PR's branch will be rebased |
e4b5d58 to
28c1485
Compare
7d0633c to
2a1fb77
Compare
28c1485 to
9e0491c
Compare
2a1fb77 to
fc53006
Compare
9e0491c to
3770e19
Compare
fc53006 to
b2fb5de
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
b2fb5de to
3c311f8
Compare
3770e19 to
c2ac1d4
Compare
eae6f8a to
d5d733f
Compare
…880)
Implement ``GPQADiamondBenchmark`` mirroring the trt-llm benchmark
recipe's ``acc_bench_lighteval.py:gpqa_diamond`` configuration:
loads ``Idavidrein/gpqa`` (subset ``gpqa_diamond``, train split) and
renders the simple-evals prompt template:
Answer the following multiple choice question. The last line of
your response should be of the following format: 'Answer: $LETTER'
(without quotes) where LETTER is one of ABCD. Think step by step
before answering.
{Question}
A) {A}
B) {B}
C) {C}
D) {D}
The four answer choices (1 correct + 3 distractors) are shuffled into
A/B/C/D positions via **SHA-256-seeded deterministic shuffling** — the
one intentional deviation from the recipe's stochastic
``random.randint(0, 3)``. Seeding off the question text gives a
stable, locale-independent, Python-version-independent permutation so
gold positions reproduce across runs while still distributed
uniformly. Pair with ``LightevalGPQAGrader`` (the recipe's
``gpqa_metric``).
Per-row ``task`` is set to ``row["High-level domain"]`` so the
accuracy CSV breaks down by physics/chemistry/biology;
``metadata.subdomain`` and ``metadata.gold_letter`` are carried for
post-hoc analysis.
Built on top of AIP-879 in the lighteval sub-stack
(875 → 876 → 879 → 880). No heavy optional dep — ``datasets`` is core
— so CI gets 100% line + branch coverage out of the box.
Updates the stub registry: drop ``gpqa_diamond`` from
``test_accuracy_config.STUB_BENCHMARKS``, drop ``is_implemented:
false`` from the ``gpqa_diamond`` plugins.yaml entry, switch
``default_grader`` to ``lighteval_gpqa``, add the ``gpqa_diamond``
row to ``docs/accuracy/accuracy-benchmarking.md``, and move it from
"Still Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing
the Status Summary, Method Count Summary, and Suggested
Implementation Order accordingly).
Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
``ruff format`` collapses the now-single-element ``STUB_BENCHMARKS`` tuple in ``test_accuracy_config.py`` onto a single line. Pre-AIP-879 the tuple had two entries (``math_500`` + ``lcb_codegeneration``) which justified the multi-line layout; after 879 landed there's only ``lcb_codegeneration`` left and ruff's formatter would otherwise flag this on CI as a needed reformat. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
3c311f8 to
a4e9834
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
WalkthroughThis PR implements ChangesGPQADiamondBenchmark Implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
gpqa_diamondbenchmark inplugins.yaml(default_grader: multiple_choice,default_n_shots: 0); scaffold loader raisesNotImplementedErroruntil the full lighteval-backed implementation lands.LightevalGPQAGrader(lighteval_gpqa,IndicesExtractionConfig(prefix_for_extraction="NativeLetters")) introduced in AIP-874.Reference:
trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py(gpqa_diamondtask)Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests