Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions environments/careqa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,34 @@ env_mcq = vf.load_environment("careqa", mode="en", shuffle_answers=True)
# Load open-ended environment
env_open = vf.load_environment("careqa", mode="open", judge_model="gpt-4o-mini")
```

### References

```bibtex
@inproceedings{arias-duart-etal-2025-automatic,
title = "Automatic Evaluation of Healthcare {LLM}s Beyond Question-Answering",
author = "Arias-Duart, Anna and
Martin-Torres, Pablo Agustin and
Hinjos, Daniel and
Bernabeu-Perez, Pablo and
Ganzabal, Lucia Urcelay and
Mallo, Marta Gonzalez and
Gururajan, Ashwin Kumar and
Lopez-Cuena, Enrique and
Alvarez-Napagao, Sergio and
Garcia-Gasulla, Dario",
editor = "Chiruzzo, Luis and
Ritter, Alan and
Wang, Lu",
booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)",
month = apr,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.naacl-short.10/",
pages = "108--130",
ISBN = "979-8-89176-190-2",
}

```

15 changes: 15 additions & 0 deletions environments/healthbench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,3 +177,18 @@ This allows you to see exactly which criteria the model passed or failed, along
- Arrays in `info` (criteria, points_list, axes, consensus_criteria) are all aligned by index - the first element of each corresponds to the first rubric criterion
- Point values can be negative for undesirable behaviors (e.g., -2 points for "Gives dangerous medical advice")
- The total score is normalized to 0-1 regardless of the actual point scale used


### References

```bibtex
@misc{arora2025healthbenchevaluatinglargelanguage,
title={HealthBench: Evaluating Large Language Models Towards Improved Human Health},
author={Rahul K. Arora and Jason Wei and Rebecca Soskin Hicks and Preston Bowman and Joaquin Quiñonero-Candela and Foivos Tsimpourlas and Michael Sharman and Meghan Shah and Andrea Vallone and Alex Beutel and Johannes Heidecke and Karan Singhal},
year={2025},
eprint={2505.08775},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.08775},
}
```
13 changes: 12 additions & 1 deletion environments/longhealth/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,4 +99,15 @@ results = await env.evaluate(client, "gpt-4.1-mini", num_examples=10)
### Authors
This environment has been put together by:

Shamus Sim Zi Yang - ([@ss8319](https://github.com/ss8319))
Shamus Sim Zi Yang - ([@ss8319](https://github.com/ss8319))

### References

```bibtex
@article{adams2024longhealth,
title={LongHealth: A Question Answering Benchmark with Long Clinical Documents},
author={Adams, Lisa and Busch, Felix and Han, Tianyu and Excoffier, Jean-Baptiste and Ortala, Matthieu and L{\"o}ser, Alexander and Aerts, Hugo JWL and Kather, Jakob Nikolas and Truhn, Daniel and Bressem, Keno},
journal={arXiv preprint arXiv:2401.14490},
year={2024}
}
```
14 changes: 14 additions & 0 deletions environments/m_arc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,17 @@ Notes:
| `correct_answer_reward_func` | (weight 1.0): 1.0 if parsed letter is correct, else 0.0|



### References

```bibtex
@misc{kim2025limitationslargelanguagemodels,
title={Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning},
author={Jonathan Kim and Anna Podlasek and Kie Shidara and Feng Liu and Ahmed Alaa and Danilo Bernardo},
year={2025},
eprint={2502.04381},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.04381},
}
```
14 changes: 14 additions & 0 deletions environments/medagentbench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,17 @@ Notes:

### Note
This environment is adapted from the original Prime Intellect [MedAgentBench implementation](https://app.primeintellect.ai/dashboard/environments/primeintellect/med-agent-bench). It has been modified to report the query success rate and action success rate as unweighted rewards to match the paper.


### References

```bibtex
@article{jiang2025medagentbench,
title={MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents},
author={Jiang, Yixing and Black, Kameron C and Geng, Gloria and Park, Danny and Zou, James and Ng, Andrew Y and Chen, Jonathan H},
journal={NEJM AI},
pages={AIdbp2500144},
year={2025},
publisher={Massachusetts Medical Society}
}
```
24 changes: 24 additions & 0 deletions environments/medbullets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,27 @@ Summarize key metrics your rubric emits and how they’re interpreted.
| ------ | ------- |
| `correct_answer_reward_func` | (weight 1.0): 1.0 if parsed letter is correct, else 0.0|
| `parser.get_format_reward_func()` | (weight 0.0): optional format adherence (not counted) |

### References

```bibtex
@inproceedings{chen-etal-2025-benchmarking,
title = "Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions",
author = "Chen, Hanjie and
Fang, Zhouxiang and
Singla, Yash and
Dredze, Mark",
editor = "Chiruzzo, Luis and
Ritter, Alan and
Wang, Lu",
booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = apr,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.naacl-long.182/",
doi = "10.18653/v1/2025.naacl-long.182",
pages = "3563--3599",
ISBN = "979-8-89176-189-6",
}
```
14 changes: 14 additions & 0 deletions environments/medcalc_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,17 @@ Notes:
### Adjustments

Adjusted the prompt to output the step-by-step thinking and final answer with the <think> and <answer> tags instead of responding with a JSON.


### References

```bibtex
@misc{khandekar2024medcalcbench,
title={MedCalc-Bench: Evaluating Large Language Models for Medical Calculations},
author={Nikhil Khandekar and Qiao Jin and Guangzhi Xiong and Soren Dunn and Serina S Applebaum and Zain Anwar and Maame Sarfo-Gyamfi and Conrad W Safranek and Abid A Anwar and Andrew Zhang and Aidan Gilson and Maxwell B Singer and Amisha Dave and Andrew Taylor and Aidong Zhang and Qingyu Chen and Zhiyong Lu},
year={2024},
eprint={2406.12036},
archivePrefix={arXiv},
primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}
```
11 changes: 11 additions & 0 deletions environments/medcasereasoning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,14 @@ Configure model and sampling:
uv run vf-eval medcasereasoning -m gpt-4.1-mini -n 20 -r 3 -t 1024 -T 0.7
```

### References

```bibtex
@inproceedings{wu2025medcase,
title = {MedCaseReasoning: Evaluating and Learning Diagnostic Reasoning from Clinical Case Reports},
author = {Wu, Kevin and Wu, Eric and Thapa, Rahul and others},
booktitle = {NeurIPS},
year = {2025},
url = {https://github.com/kevinwu23/MedCaseReasoning}
}
```
19 changes: 18 additions & 1 deletion environments/medconceptsqa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,4 +43,21 @@ Summarize key metrics your rubric emits and how they’re interpreted.
### Authors
This environment has been put together by:

Anish Mahishi - ([@macandro96](https://github.com/macandro96))
Anish Mahishi - ([@macandro96](https://github.com/macandro96))




### References

```bibtex
@misc{shoham2024medconceptsqaopensourcemedical,
title={MedConceptsQA: Open Source Medical Concepts QA Benchmark},
author={Ofir Ben Shoham and Nadav Rappoport},
year={2024},
eprint={2405.07348},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2405.07348},
}
```
16 changes: 15 additions & 1 deletion environments/medec/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,4 +115,18 @@ The **primary reward score** is the weighted sum of the first three metrics.
| `rouge_reward` | 0 | ROUGE-1 F1 score (for analysis only). |
| `bertscore_reward` | 0 | BERTScore F1 (for analysis only). |
| `bleurt_reward` | 0 | BLEURT score (for analysis only). |
| `reward` | N/A | Final weighted sum of non-zero weight metrics (0.0–1.0). |
| `reward` | N/A | Final weighted sum of non-zero weight metrics (0.0–1.0). |


### References

```bibtex
@article{medec,
author = {Asma {Ben Abacha} and Wen-wai Yim and Yujuan Fu and Zhaoyi Sun and Meliha Yetisgen and Fei Xia and Thomas Lin},
title = {MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes},
journal = {CoRR},
eprinttype = {arXiv},
url = {https://arxiv.org/pdf/2412.19260},
year = {2024}
}
```
32 changes: 31 additions & 1 deletion environments/medredqa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,4 +99,34 @@ uv run vf-eval medredqa \
### Authors
This environment has been put together by:

Kunal Bagga - ([@bagga005](https://github.com/bagga005))
Kunal Bagga - ([@bagga005](https://github.com/bagga005))



### References

```bibtex
@inproceedings{nguyen-etal-2023-medredqa,
title = "{M}ed{R}ed{QA} for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines",
author = "Nguyen, Vincent and
Karimi, Sarvnaz and
Rybinski, Maciej and
Xing, Zhenchang",
editor = "Park, Jong C. and
Arase, Yuki and
Hu, Baotian and
Lu, Wei and
Wijaya, Derry and
Purwarianti, Ayu and
Krisnadhi, Adila Alfa",
booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = nov,
year = "2023",
address = "Nusa Dua, Bali",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.ijcnlp-main.42/",
doi = "10.18653/v1/2023.ijcnlp-main.42",
pages = "629--648"
}

```
11 changes: 11 additions & 0 deletions environments/medxpertqa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,14 @@ Summarize key metrics your rubric emits and how they’re interpreted.
| `reward` | Main scalar reward (weighted sum of criteria) |
| `accuracy` | Exact match on target answer |


### References

```bibtex
@article{zuo2025medxpertqa,
title={MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding},
author={Zuo, Yuxin and Qu, Shang and Li, Yifei and Chen, Zhangren and Zhu, Xuekai and Hua, Ermo and Zhang, Kaiyan and Ding, Ning and Zhou, Bowen},
journal={arXiv preprint arXiv:2501.18362},
year={2025}
}
```
20 changes: 20 additions & 0 deletions environments/metamedqa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,23 @@ The evaluation environment is defined in `metamedqa.py` and uses the HuggingFace
This environment has been put together by:

Aymane Ouraq - ([@aymaneo](https://github.com/aymaneo))


### References

```bibtex
@article{griot_large_2025,
title = {Large {Language} {Models} lack essential metacognition for reliable medical reasoning},
volume = {16},
issn = {2041-1723},
url = {https://doi.org/10.1038/s41467-024-55628-6},
doi = {10.1038/s41467-024-55628-6},
number = {1},
journal = {Nature Communications},
author = {Griot, Maxime and Hemptinne, Coralie and Vanderdonckt, Jean and Yuksel, Demet},
month = jan,
year = {2025},
pages = {642},
}

```
10 changes: 10 additions & 0 deletions environments/mmlu_pro_health/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,3 +59,13 @@ Notes:
| `accuracy` | (weight 1.0): 1.0 if parsed letter is correct, else 0.0 |


### References

```bibtex
@article{wang2024mmlu,
title={Mmlu-pro: A more robust and challenging multi-task language understanding benchmark},
author={Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and others},
journal={arXiv preprint arXiv:2406.01574},
year={2024}
}
```