Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
d2146f9
Add BioHopR env
marii-moe Oct 25, 2025
56da494
Removed nonmarkdown for BioHopR Readme
marii-moe Oct 25, 2025
cbb3a00
Fix BioHopR metric
marii-moe Oct 25, 2025
df4e049
Added ability to select task for BioHopR
marii-moe Oct 26, 2025
83f0fcd
Exactly matched question prompt for BioHopR. Added Docs.
marii-moe Oct 26, 2025
b49e8c8
Updated biohopr readme to have description
marii-moe Oct 28, 2025
39c1fbe
added llm-as-a-judge first pass
marii-moe Nov 5, 2025
c8f4239
Fixed bug where judge was not getting the completion for evaluation
marii-moe Nov 10, 2025
56ed6d5
Small update to prompt
marii-moe Nov 16, 2025
d210caf
Added filtering for judge. Added modernbert for filtering.
marii-moe Dec 23, 2025
c916ec5
Added tests and improved performance
marii-moe Jan 20, 2026
b963d96
added remote embedding for clinical modern bert, not working correctly
marii-moe Jan 21, 2026
b303b21
Added ability for embedding model to use cuda. Added embedding batch …
marii-moe Jan 21, 2026
969c631
Merge branch 'BioHopR' into biohopr_remote_embedding
marii-moe Jan 23, 2026
aca643c
Added ability to use infinity server for embeddings
marii-moe Jan 23, 2026
b082903
Fix typing
marii-moe Jan 23, 2026
6626906
added biohopr citation
marii-moe Jan 23, 2026
8a9dd06
Biohopr fix readme title
marii-moe Jan 23, 2026
7b8b736
Added more realistic quick start
marii-moe Jan 23, 2026
2269d76
Biohopr Fixed pip wanting to change huggingfacehub version.
marii-moe Jan 24, 2026
b39c743
Biohopr no longer have issue with installing wrong version of hugging…
marii-moe Jan 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions environments/biohopr/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# BioHopR

### Overview
- **Environment ID**: `biohopr`
- **Short description**: BioHopR multiple-hop biomedical question answering evaluation
- **Tags**: biohopr, multi-hop, biomedical, question-answering, retrieval-augmented

### Datasets
- **Primary dataset(s)**: All datasets are based on the knowledge graph. [PrimeKG](https://zitniklab.hms.harvard.edu/projects/PrimeKG/). The BioHopR paper introduced 4 evaluation sets based on PrimeKG. Examples provided here:
- **1-hop**: Name a disease that is related to effect/phenotype Pain.
- **2-hop**(default eval): Name a disease that is related to a effect/phenotype that is associated with drug Benzyl benzoate.
- **1-hop multi**: Name all diseases that are related to effect/phenotype Pain.
- **2-hop multi**: Name all diseases that are related to a effect/phenotype that is associated with drug Benzyl benzoate.

- **Source links**: [Paper](https://arxiv.org/abs/2505.22240) |
[Dataset](https://huggingface.co/datasets/knowlab-research/BioHopR)
- **Split sizes**: Eval 7,633 (for 2-hop)

### Task
- **Type**: single-turn
- **Parser**: XMLParser(default)/ThinkParser
- **Rubric overview**: LLM as a Judge or Precision based on embedded cosine similarity of list of possible answers

### Quickstart
Run an evaluation with default settings:

```bash
uv run vf-eval biohopr
```

Configure model and sampling:

```bash
vf-eval biohopr -n 32 --api-base-url http://localhost:5000/v1/ -a '{"eval_method":"metrics","judge_base_url":"http://localhost:5000/v1/","judge_model":"openai/gpt-oss-20b","embedding_model_url":"http://localhost:8001/"}' -m 'openai/gpt-oss-20b'
```

Notes:
- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.

### Environment Arguments
Document any supported environment arguments and their meaning. Example:

| Arg | Type | Default | Description |
| --- | ---- | ------- | ----------- |
| `use_think` | bool | `False` | Use reasoning when set to true. |
| `system_prompt` | str | `None` | Optional custom system prompt. |
| `answer_format` | str | `AnswerFormat.XML` | Determines how to parse completion for answer. Also sets system prompt if `system_prompt` not set. |
| `task` | str | `biohopr_hop2` | The task to evaluate against. Determines which prompts are used from the BioHopR paper. Valid options are `['biohopr_hop1','biohopr_hop2','biohopr_hop1_multi','biohopr_hop2_multi', 'all']`. `task` also set for verifiers dataset for use with EnvGroup and RubricGroup |
| `eval_method` | str | `metrics` | Whether to use metrics from paper, or llm as a judge. Values: "judge" , "metrics", or "judge-only"
| `judge_model` | str | `gpt-4o-mini` | Model name to use for judging
| `judge_base_url` | str | `None` | Optional base URL for custom OpenAI-compatible API endpoint
| `judge_api_key` | str | `None` | Optional API key for OpenAI-compatible API endpoint
| `embeddings_model` | str | `FremyCompany/BioLORD-2023` | SentenceTransformer model name for computing answer-completion similarity. Valid options: `['FremyCompany/BioLORD-2023', 'Simonlee711/Clinical_ModernBERT']`
| `tau` | float | 0.9 | Cosine similarity threshold for embedded precision
| `use_cuda` | bool | "if available" | Whether to use CUDA for embedding model, uses cuda if available, set to false to disable.
| `embedding_batch_size` | int | 32 | Maximum batch size for embedding model encoding, use in case of out of memory error
| `embedding_model_url` | str | None | Optional URL for OpenAI-compatible embedding model API
| `embedding_api_key` | str | None | Optional API key for OpenAI-compatible embedding model API

### Metrics
Summarize key metrics your rubric emits and how they’re interpreted.

| Metric | Meaning |
| ------ | ------- |
| `reward` | Main scalar reward (weighted sum of criteria). Is the same as `embedded_precision` |
| `llm_as_a_judge` | Reward function that uses LLM judge to evaluate medical diagnosis equivalence. |
| `embedded_precision` | Precision based on embedded cosine similarity. A completion is embedded, that compared to a list of possible answer embeddings. If cosine similarity is >tau(0.9) then it is considered a true positive. Final score is true_positives/predicted_responses. |

### Embedding Server Note
We recommend using infinity embed server for serving embeddings. This was the only server we found compatible with BioLORD embeddings. You may find them here: https://github.com/michaelfeil/infinity
At the time of writing their server errors at the time of setup. Further instructions can be found here: https://github.com/michaelfeil/infinity/issues/649

### Citation
Dataset:
```bibtex
@misc{kim2025biohoprbenchmarkmultihopmultianswer,
title={BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain},
author={Yunsoo Kim and Yusuf Abdulle and Honghan Wu},
year={2025},
eprint={2505.22240},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.22240},
}
```
Loading