Add MedHALT evaluation environment by geetua · Pull Request #65 · MedARC-AI/med-lm-envs

geetua · 2025-10-30T18:39:51Z

Adds support for the MedHALT dataset for evaluating medical LLMs on multiple-choice questions.
Summary

New environment: medhalt
Two configs: reasoning_FCT and reasoning_nota
~18,866 examples per config
Supports answer shuffling

Testing
Tested on qwen2.5:3b (1000 examples):

reasoning_FCT: 50.7% accuracy
reasoning_nota: 33.1% accuracy

See environments/medhalt/README.md for full documentation and usage.

CLAassistant · 2025-10-30T18:39:57Z

All committers have signed the CLA.

warner-benjamin · 2025-10-31T05:00:35Z

Looks like a good start, we need to update the environment to use the medarc_verifiers helpers for randomization randomize_multiple_choice and multiple_choice_accuracy introduced in #63.

Is the question format and lack of a system prompt match the author's code? If there's no system prompt, then we should add the boxed/xml system prompts with thinking and non-thinking options and parsers so it's easier to grab the correct model output.

…ment

geetua · 2025-11-01T03:44:58Z

Thanks for the feedback! Good flag on using the original MedHALT prompts and add boxed/XML variants.

I have a few clarification questions about data filtering and scope - I'll DM you on Discord to avoid cluttering the PR. Will implement once I understand your preferences.

Working on the medarc_verifiers integration now!

Add MedHALT evaluation environment

a9f61d9

Merge remote-tracking branch 'upstream/main' into add-medhalt-environ…

412bc4c

…ment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MedHALT evaluation environment#65

Add MedHALT evaluation environment#65
geetua wants to merge 2 commits intoMedARC-AI:mainfrom
geetua:add-medhalt-environment

geetua commented Oct 30, 2025

Uh oh!

CLAassistant commented Oct 30, 2025 •

edited

Loading

Uh oh!

warner-benjamin commented Oct 31, 2025

Uh oh!

geetua commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

geetua commented Oct 30, 2025

Uh oh!

CLAassistant commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

warner-benjamin commented Oct 31, 2025

Uh oh!

geetua commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Oct 30, 2025 •

edited

Loading