Skip to content

Add MedHALT evaluation environment#65

Open
geetua wants to merge 2 commits intoMedARC-AI:mainfrom
geetua:add-medhalt-environment
Open

Add MedHALT evaluation environment#65
geetua wants to merge 2 commits intoMedARC-AI:mainfrom
geetua:add-medhalt-environment

Conversation

@geetua
Copy link
Copy Markdown
Contributor

@geetua geetua commented Oct 30, 2025

Adds support for the MedHALT dataset for evaluating medical LLMs on multiple-choice questions.
Summary

New environment: medhalt
Two configs: reasoning_FCT and reasoning_nota
~18,866 examples per config
Supports answer shuffling

Testing
Tested on qwen2.5:3b (1000 examples):

reasoning_FCT: 50.7% accuracy
reasoning_nota: 33.1% accuracy

See environments/medhalt/README.md for full documentation and usage.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Oct 30, 2025

CLA assistant check
All committers have signed the CLA.

@warner-benjamin
Copy link
Copy Markdown
Collaborator

Looks like a good start, we need to update the environment to use the medarc_verifiers helpers for randomization randomize_multiple_choice and multiple_choice_accuracy introduced in #63.

Is the question format and lack of a system prompt match the author's code? If there's no system prompt, then we should add the boxed/xml system prompts with thinking and non-thinking options and parsers so it's easier to grab the correct model output.

@geetua
Copy link
Copy Markdown
Contributor Author

geetua commented Nov 1, 2025

Thanks for the feedback! Good flag on using the original MedHALT prompts and add boxed/XML variants.

I have a few clarification questions about data filtering and scope - I'll DM you on Discord to avoid cluttering the PR. Will implement once I understand your preferences.

Working on the medarc_verifiers integration now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants