Conversation
|
Looks like a good start, we need to update the environment to use the medarc_verifiers helpers for randomization Is the question format and lack of a system prompt match the author's code? If there's no system prompt, then we should add the boxed/xml system prompts with thinking and non-thinking options and parsers so it's easier to grab the correct model output. |
|
Thanks for the feedback! Good flag on using the original MedHALT prompts and add boxed/XML variants. I have a few clarification questions about data filtering and scope - I'll DM you on Discord to avoid cluttering the PR. Will implement once I understand your preferences. Working on the medarc_verifiers integration now! |
Adds support for the MedHALT dataset for evaluating medical LLMs on multiple-choice questions.
Summary
New environment: medhalt
Two configs: reasoning_FCT and reasoning_nota
~18,866 examples per config
Supports answer shuffling
Testing
Tested on qwen2.5:3b (1000 examples):
reasoning_FCT: 50.7% accuracy
reasoning_nota: 33.1% accuracy
See environments/medhalt/README.md for full documentation and usage.