first pass at outputting annotations #1413

rogthefrog · 2025-12-10T02:34:28Z

This generates the list of annotations for a given job, for consumption by modelrunner.

I still need to double check that the job ID is what we expect.

Questions:

is this the right place to pull the annotations?
is the source_id sufficient to reconcile these with ther matching prompts or responses?
is there anything else that would be useful to output?

Related to https://github.com/mlcommons/sugar/pull/303

github-actions · 2025-12-10T02:34:38Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

superdosh

I remember we had a conversation about how these annotations would be filtered to only those specifically marked as "exportable" in some way. Is that still correct?

rogthefrog · 2025-12-10T18:01:58Z

I remember we had a conversation about how these annotations would be filtered to only those specifically marked as "exportable" in some way. Is that still correct?

That's right, good reminder. IIRC only the annotations from a run using the prompts in the demo set would be available?

superdosh · 2025-12-10T18:17:29Z

That's right, good reminder. IIRC only the annotations from a run using the prompts in the demo set would be available?

@wpietri had the idea that we would add a column to our prompt datasets that would indicate exportability, and then we'd just check against that?

wpietri · 2025-12-10T18:37:04Z

@wpietri had the idea that we would add a column to our prompt datasets that would indicate exportability, and then we'd just check against that?

Yes, I think that's the best solution. I have little reason to think that we will have a demo prompt set for security or other future benchmarks, but we'll want to be able to offer annotations on a stable fraction of the non-official prompts. So I think annotation exportability should just be made explicit for every prompt and be part of the prompt's metadata.

wpietri

This looks like a plausible start. I'm glad to see it wasn't too hard. My thoughts:

We need to be very sure we never give anybody annotations outside the set of prompts for which that is allowed. Doing so is an error on the order of a $50k cost. So modelrunner should never put anything in the file that isn't explicitly allowed.
I'm not sure we should be exposing our source_id fields generally, as they may contain information about vendors and such. Kurt would know.
We should include the prompt and response along with the annotation, so it's easy for them to understand what about their model they need to improve. Probably also the hazard as long as we're at it.
It looks like compile_annotations could go on the BenchmarkRun object nicely, so it becomes part of the API.

…lass

wpietri

Looks good, and I like the choice of only doing demo for now. Should have some tests eventually, though, at least for the extractor.

…acters

src/modelbench/cli.py

…d an always-passing test caused by missing () in a function call

rogthefrog · 2025-12-12T01:45:30Z

Looks good, and I like the choice of only doing demo for now. Should have some tests eventually, though, at least for the extractor.

@wpietri do you mind taking a look at the tests I added?

first pass at outputting annotations

a54fb25

rogthefrog had a problem deploying to Scheduled Testing December 10, 2025 02:34 — with GitHub Actions Failure

superdosh reviewed Dec 10, 2025

View reviewed changes

wpietri self-requested a review December 10, 2025 17:09

wpietri approved these changes Dec 10, 2025

View reviewed changes

finish annotation collection logic, and move it to the BenchmarkRun c…

18b0925

…lass

rogthefrog had a problem deploying to Scheduled Testing December 11, 2025 00:23 — with GitHub Actions Failure

noop; whitespace

b60aead

rogthefrog had a problem deploying to Scheduled Testing December 11, 2025 00:49 — with GitHub Actions Failure

only export annotations from runs using the demo prompt set, for now

8cca1cd

rogthefrog force-pushed the 226-return-annotations branch from 2be3f3a to 8cca1cd Compare December 11, 2025 00:50

rogthefrog had a problem deploying to Scheduled Testing December 11, 2025 00:51 — with GitHub Actions Failure

rogthefrog marked this pull request as ready for review December 11, 2025 00:51

rogthefrog requested a review from a team as a code owner December 11, 2025 00:51

rogthefrog requested review from bkorycki and bollacker December 11, 2025 00:51

wpietri approved these changes Dec 11, 2025

View reviewed changes

remove SUT UID from file name, as those UIDs can contain hostile char…

215ae23

…acters

rogthefrog temporarily deployed to Scheduled Testing December 11, 2025 20:58 — with GitHub Actions Inactive

bkorycki reviewed Dec 11, 2025

View reviewed changes

src/modelbench/cli.py Show resolved Hide resolved

add test of annotation file contents; fix a bug that was hiding behin…

6f5c78d

…d an always-passing test caused by missing () in a function call

rogthefrog temporarily deployed to Scheduled Testing December 12, 2025 01:28 — with GitHub Actions Inactive

bkorycki approved these changes Dec 12, 2025

View reviewed changes

rogthefrog merged commit 3e8782d into main Dec 12, 2025
2 checks passed

rogthefrog deleted the 226-return-annotations branch December 12, 2025 22:21

github-actions bot locked and limited conversation to collaborators Dec 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

first pass at outputting annotations #1413

first pass at outputting annotations #1413

Uh oh!

rogthefrog commented Dec 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 10, 2025 •

edited

Loading

Uh oh!

superdosh left a comment

Uh oh!

rogthefrog commented Dec 10, 2025

Uh oh!

superdosh commented Dec 10, 2025

Uh oh!

wpietri commented Dec 10, 2025

Uh oh!

wpietri left a comment

Uh oh!

wpietri left a comment •

edited

Loading

Uh oh!

Uh oh!

rogthefrog commented Dec 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

first pass at outputting annotations #1413

first pass at outputting annotations #1413

Uh oh!

Conversation

rogthefrog commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

superdosh left a comment

Choose a reason for hiding this comment

Uh oh!

rogthefrog commented Dec 10, 2025

Uh oh!

superdosh commented Dec 10, 2025

Uh oh!

wpietri commented Dec 10, 2025

Uh oh!

wpietri left a comment

Choose a reason for hiding this comment

Uh oh!

wpietri left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rogthefrog commented Dec 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rogthefrog commented Dec 10, 2025 •

edited

Loading

github-actions bot commented Dec 10, 2025 •

edited

Loading

wpietri left a comment •

edited

Loading