Skip to content

Commit 70f7acd

Browse files
CareQA env #33 (#48)
* Add careqa mcq eval environment * add careqa open-ended env * add careqa open-ended env * resolving issues * removing redundant imports * resolving comments * resolving commits * resolving comments * Update careqa_openended.py * Update careqa_openended.py * Update careqa_openended.py * update careqa to use No Free Labels style prompt --------- Co-authored-by: Benjamin Warner <me@benjaminwarner.dev>
1 parent ac53fe5 commit 70f7acd

3 files changed

Lines changed: 366 additions & 0 deletions

File tree

environments/careqa/README.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# careqa
2+
3+
Evaluation environment for the [HPAI-BSC/CareQA](https://huggingface.co/datasets/HPAI-BSC/CareQA) dataset.
4+
5+
### Overview
6+
- **Environment ID**: `careqa`
7+
- **Short description**: CareQA is a healthcare QA dataset with **multiple-choice** and **open-ended clinical reasoning questions**. This environment supports both modes through the `mode` parameter.
8+
- **Tags**: healthcare, medical QA, clinical reasoning, MCQ, single-turn
9+
10+
### Datasets
11+
- **Primary dataset(s)**:
12+
- `CareQA_en` – multiple-choice clinical questions with 4 options and correct answer labels
13+
- `CareQA_en_open` – open-ended clinical questions with reference answers
14+
- **Source links**:
15+
- [Hugging Face CareQA dataset](https://huggingface.co/datasets/HPAI-BSC/CareQA)
16+
17+
### Task
18+
- **Type**: single-turn
19+
- **Parser**:
20+
- MCQ mode: `vf.Parser()` or `vf.ThinkParser()` for extracting boxed answers
21+
- Open-ended mode: `XMLParser()` for judge responses
22+
- **Rubric overview**:
23+
- **MCQ mode (`en`)**: `vf.Rubric()` measuring **accuracy** (letter match A–D)
24+
- **Open-ended mode (`open`)**: `vf.JudgeRubric()` using an LLM-as-judge to score free-text answers for correctness and clinical reasoning
25+
26+
### Quickstart
27+
28+
**Multiple-choice evaluation:**
29+
```bash
30+
medarc-eval careqa --mode en --model gpt-4.1-mini --num-examples 10 -s
31+
```
32+
33+
**Open-ended evaluation:**
34+
```bash
35+
medarc-eval careqa --mode open --model gpt-4.1-mini --num-examples 10 -s
36+
```
37+
38+
**With think-mode prompting (MCQ only):**
39+
```bash
40+
medarc-eval careqa --mode en --use-think --model gpt-4.1-mini --num-examples 10 -s
41+
```
42+
43+
**With shuffled answer options (MCQ only):**
44+
```bash
45+
medarc-eval careqa --mode en --shuffle-answers --shuffle-seed 42 --model gpt-4.1-mini -n 10 -s
46+
```
47+
48+
### Configuration Options
49+
50+
#### Common Parameters
51+
- `--mode`: Select mode: `en` (multiple-choice) or `open` (open-ended). Default: `open`
52+
- `--split`: Dataset split to use. Default: `test`
53+
- `--system-prompt`: Custom system prompt (uses mode-appropriate default if not specified)
54+
55+
#### MCQ-Specific Parameters
56+
- `--use-think`: Enable think-style prompting with boxed answers
57+
- `--shuffle-answers`: Randomly shuffle answer options
58+
- `--shuffle-seed`: Seed for answer shuffling (default: 1618)
59+
60+
#### Open-Ended-Specific Parameters
61+
- `--judge-model`: Model for LLM-as-judge evaluation (default: `gpt-4o-mini`)
62+
- `--judge-base-url`: Base URL for judge API
63+
- `--judge-api-key`: API key for judge (falls back to `OPENAI_API_KEY` env var)
64+
65+
### Metrics
66+
67+
#### MCQ Mode
68+
| Metric | Meaning |
69+
|---------------|---------|
70+
| `reward` | Main scalar reward (weighted sum of rubric criteria) |
71+
| `accuracy` | Exact match on target MCQ answer (letter A–D) |
72+
73+
#### Open-Ended Mode
74+
| Metric | Meaning |
75+
|---------------|---------|
76+
| `reward` | Main scalar reward (weighted sum of rubric criteria) |
77+
| `judge_score` | LLM-assigned score evaluating answer quality, correctness, and clinical reasoning |
78+
79+
### Example Usage
80+
81+
```python
82+
import verifiers as vf
83+
84+
# Load MCQ environment
85+
env_mcq = vf.load_environment("careqa", mode="en", shuffle_answers=True)
86+
87+
# Load open-ended environment
88+
env_open = vf.load_environment("careqa", mode="open", judge_model="gpt-4o-mini")
89+
```

environments/careqa/careqa.py

Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
import re
2+
from enum import Enum
3+
from typing import Optional
4+
5+
from datasets import load_dataset
6+
from openai import AsyncOpenAI
7+
import verifiers as vf
8+
from medarc_verifiers.rewards.multiple_choice_accuracy import multiple_choice_accuracy
9+
from medarc_verifiers.utils.randomize_multiple_choice import randomize_multiple_choice
10+
from medarc_verifiers.parsers.xml_parser import XMLParser
11+
from verifiers.types import Info, State
12+
from verifiers.utils.data_utils import extract_boxed_answer, BOXED_SYSTEM_PROMPT
13+
14+
15+
class CareQASplit(Enum):
16+
"""Mode selector for CareQA environment."""
17+
18+
EN = "en"
19+
OPEN = "open"
20+
21+
22+
# --- MCQ Helpers ---
23+
24+
25+
def _build_mcq_prompt(question: str, options: dict[str, str]) -> str:
26+
"""Create an MCQ prompt."""
27+
formatted_opts = "\n".join(f"{k}. {v}" for k, v in options.items())
28+
return f"Question: {question}\nChoices:\n{formatted_opts}\nAnswer:"
29+
30+
31+
def accuracy(completion, answer: str, parser: vf.Parser, info: dict | None = None, **kwargs) -> float:
32+
"""Reward based on shared multiple-choice accuracy grading."""
33+
parsed = parser.parse_answer(completion) or ""
34+
answer_text = info.get("answer_text", None) if info else None
35+
is_correct = multiple_choice_accuracy(llm_answer=parsed, answer_letter=answer, answer_text=answer_text)
36+
return 1.0 if is_correct else 0.0
37+
38+
39+
# --- Open-Ended Helpers ---
40+
41+
42+
JUDGE_TEMPLATE = """You are grading an AI assistant's answer to a medical/science exam questions.
43+
44+
Input:
45+
- <question>: The exam question.
46+
- <reference_answer>: The correct answer.
47+
- <assistant_answer>: The AI's response to grade.
48+
49+
Task: Determine if the assistant's answer is correct or incorrect by comparing it to the reference answer and output your grade in <grade>...</grade> tags.
50+
51+
Grading Rules:
52+
- Assume the reference answer is correct and reflects the expected exam solution.
53+
- Focus on factual content and meaning, not style, length, or confidence.
54+
55+
Correct if the assistant's answer conveys the same essential fact(s) as the reference, including:
56+
- Synonyms, acronyms (expanded or abbreviated), or rephrasing with equivalent meaning
57+
- Slightly more general/specific phrasing that captures the key concept
58+
- Shorter or longer answers that express the tested fact without contradictions
59+
- Additional supporting details that don't contradict the reference
60+
61+
Incorrect if any of these apply:
62+
- Different main concept, mechanism, structure, or relationship
63+
- Contradicts the reference on key points (wrong organ, drug, effect, process, etc.)
64+
- Contains clearly incorrect information
65+
- Too vague/incomplete to match the reference
66+
- Merely repeats question words without the core information from the reference
67+
68+
Be strict: clear mismatches on main concepts or incorrect claims = Incorrect.
69+
70+
<question>{question}</question>
71+
<reference_answer>{answer}</reference_answer>
72+
<assistant_answer>{response}</assistant_answer>
73+
74+
Briefly explain whether the assistant's answer matches or conflicts with the reference. Then output your grade as:
75+
76+
<grade>[Correct or Incorrect]</grade>
77+
""".strip()
78+
79+
80+
def extract_answer_section(completion_text: str) -> str:
81+
"""Extract final answer after think tags."""
82+
if not completion_text:
83+
return ""
84+
if "<think>" in completion_text and "</think>" in completion_text:
85+
return re.sub(r".*?</think>", "", completion_text, flags=re.DOTALL).strip()
86+
return completion_text.strip()
87+
88+
89+
def load_environment(
90+
split: str | CareQASplit,
91+
system_prompt: Optional[str] = None,
92+
# MCQ-specific options
93+
shuffle_answers: bool = False,
94+
shuffle_seed: int | None = 1618,
95+
# Open-ended specific options
96+
judge_model: str = "gpt-4o-mini",
97+
judge_base_url: str | None = None,
98+
judge_api_key: str | None = None,
99+
**kwargs,
100+
) -> vf.Environment:
101+
"""
102+
CareQA evaluation environment supporting both MCQ and Open-Ended modes.
103+
104+
Args:
105+
split: CareQASplit.EN for multiple-choice or CareQASplit.OPEN for open-ended QA.
106+
system_prompt: Custom system prompt (uses mode-appropriate default if None).
107+
shuffle_answers: Shuffle MCQ answer options (MCQ mode only).
108+
shuffle_seed: Seed for answer shuffling (MCQ mode only).
109+
judge_model: Model to use for LLM-as-judge evaluation (Open-ended mode only).
110+
judge_base_url: Base URL for judge API (Open-ended mode only).
111+
judge_api_key: API key for judge (Open-ended mode only).
112+
113+
Returns:
114+
A vf.Environment configured for the selected mode.
115+
"""
116+
split = CareQASplit(split) if isinstance(split, str) else split
117+
if split == CareQASplit.EN:
118+
return _load_mcq_environment(
119+
system_prompt=system_prompt,
120+
shuffle_answers=shuffle_answers,
121+
shuffle_seed=shuffle_seed,
122+
)
123+
elif split == CareQASplit.OPEN:
124+
return _load_open_ended_environment(
125+
system_prompt=system_prompt,
126+
judge_model=judge_model,
127+
judge_base_url=judge_base_url,
128+
judge_api_key=judge_api_key,
129+
)
130+
else:
131+
raise ValueError(f"Invalid mode: {split}")
132+
133+
134+
def _load_mcq_environment(
135+
system_prompt: Optional[str],
136+
shuffle_answers: bool,
137+
shuffle_seed: int | None,
138+
) -> vf.Environment:
139+
"""Load CareQA multiple-choice environment."""
140+
eval_dataset = load_dataset("HPAI-BSC/CareQA", "CareQA_en", split="test")
141+
142+
def _map(ex, idx=None):
143+
options = {"A": ex["op1"], "B": ex["op2"], "C": ex["op3"], "D": ex["op4"]}
144+
gold_letter = ["A", "B", "C", "D"][ex["cop"] - 1]
145+
146+
if shuffle_answers and gold_letter in options:
147+
options, gold_letter, _ = randomize_multiple_choice(
148+
options=options,
149+
answer_choice=gold_letter,
150+
seed=shuffle_seed,
151+
row_id=ex.get("id", idx),
152+
)
153+
154+
return {
155+
"question": _build_mcq_prompt(ex["question"], options),
156+
"answer": gold_letter,
157+
"info": {
158+
"answer_text": options.get(gold_letter, None),
159+
**({"options": options} if shuffle_answers else {}),
160+
},
161+
}
162+
163+
load_from_cache_file = not shuffle_answers
164+
eval_dataset = eval_dataset.map(
165+
_map,
166+
with_indices=True,
167+
remove_columns=eval_dataset.column_names,
168+
load_from_cache_file=load_from_cache_file,
169+
)
170+
171+
parser = vf.Parser(extract_boxed_answer)
172+
final_system_prompt = BOXED_SYSTEM_PROMPT or system_prompt
173+
174+
rubric = vf.Rubric(funcs=[accuracy], weights=[1.0], parser=parser)
175+
176+
return vf.SingleTurnEnv(
177+
eval_dataset=eval_dataset,
178+
rubric=rubric,
179+
parser=parser,
180+
system_prompt=final_system_prompt,
181+
)
182+
183+
184+
def _load_open_ended_environment(
185+
system_prompt: Optional[str],
186+
judge_model: str,
187+
judge_base_url: str | None,
188+
judge_api_key: str | None,
189+
) -> vf.Environment:
190+
"""Load CareQA open-ended environment with LLM-as-judge evaluation."""
191+
eval_dataset = load_dataset("HPAI-BSC/CareQA", "CareQA_en_open", split="test")
192+
193+
def _map(ex):
194+
info = {}
195+
info["question"] = ex["question"].strip()
196+
return {
197+
"question": ex["question"].strip(),
198+
"answer": ex.get("answer_explanation", ex.get("answer", "")),
199+
"task": "careqa_open",
200+
"info": info,
201+
}
202+
203+
eval_dataset = eval_dataset.map(_map, remove_columns=eval_dataset.column_names)
204+
205+
final_system_prompt = system_prompt or (
206+
"Instructions: The following text is a medical question. Answer it in the most factual, concise, and informative way possible."
207+
)
208+
209+
# Judge client setup
210+
judge_client = AsyncOpenAI(base_url=judge_base_url, api_key=judge_api_key)
211+
judge_parser = XMLParser(fields=["grade"], answer_field="grade")
212+
213+
judge_rubric = vf.JudgeRubric(
214+
parser=judge_parser,
215+
judge_client=judge_client,
216+
judge_model=judge_model,
217+
judge_prompt="{question}",
218+
)
219+
220+
async def accuracy(judge, prompt, completion, answer, state: State, info: Info) -> float:
221+
"""Evaluate medical equivalence using LLM-as-judge."""
222+
completion_text = completion if isinstance(completion, str) else str(completion)
223+
response = extract_answer_section(completion_text)
224+
225+
try:
226+
judge_prompt = JUDGE_TEMPLATE.format(question=info.get("question", ""), answer=answer, response=response)
227+
judge_response = await judge_rubric.judge(judge_prompt, "", "", state)
228+
grade = judge_parser.parse_answer(judge_response).strip().lower()
229+
except AttributeError:
230+
judge_response = await judge_rubric.judge(judge_prompt, "", "", state)
231+
grade = judge_parser.parse_answer(judge_response).strip().lower()
232+
233+
info.setdefault("judge_feedback", []).append(
234+
{
235+
"grade": grade,
236+
"raw_judge": str(judge_response),
237+
}
238+
)
239+
240+
if "correct" in grade and "incorrect" not in grade:
241+
return 1.0
242+
else:
243+
return 0.0
244+
245+
judge_rubric.add_reward_func(accuracy, weight=1.0)
246+
247+
return vf.SingleTurnEnv(
248+
eval_dataset=eval_dataset,
249+
system_prompt=final_system_prompt,
250+
rubric=judge_rubric,
251+
)

environments/careqa/pyproject.toml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
[project]
2+
name = "careqa"
3+
description = "Evaluation environment for the HPAI-BSC/CareQA MCQ dataset"
4+
tags = ["healthcare", "medical-qa", "mcq", "clinical", "single-turn", "open-ended"]
5+
version = "0.1.0"
6+
requires-python = ">=3.11"
7+
dependencies = [
8+
"verifiers>=0.1.4",
9+
"datasets>=2.13.0",
10+
"medarc_verifiers>=0.1.0",
11+
]
12+
13+
[tool.prime.environment]
14+
loader = "careqa:load_environment"
15+
display_name = "CareQA"
16+
visibility = "PUBLIC"
17+
18+
[build-system]
19+
requires = ["hatchling"]
20+
build-backend = "hatchling.build"
21+
22+
[tool.hatch.build]
23+
include = ["careqa.py"]
24+
25+
[tool.uv.sources]
26+
medarc_verifiers = { git = "https://github.com/MedARC-AI/med-lm-envs" }

0 commit comments

Comments
 (0)