diff --git a/configs/envs/open_i_summarization.yaml b/configs/envs/open_i_summarization.yaml new file mode 100644 index 00000000..a88700aa --- /dev/null +++ b/configs/envs/open_i_summarization.yaml @@ -0,0 +1,28 @@ +# Open-I Summarization environment configuration +# Radiology findings to impression summarization benchmark + +- id: open_i_summarization + module: open_i_summarization + num_examples: -1 + verbose: false + env_args: + split: test + compute_auto_metrics: true + +# Validation split variant +- id: open_i_summarization_val + module: open_i_summarization + num_examples: -1 + verbose: false + env_args: + split: validation + compute_auto_metrics: true + +# Fast evaluation (no automatic metrics) +- id: open_i_summarization_fast + module: open_i_summarization + num_examples: -1 + verbose: false + env_args: + split: test + compute_auto_metrics: false diff --git a/environments/open_i_summarization/README.md b/environments/open_i_summarization/README.md new file mode 100644 index 00000000..7a5c9b21 --- /dev/null +++ b/environments/open_i_summarization/README.md @@ -0,0 +1,213 @@ +# Open-I Summarization + +Evaluation environment for radiology report summarization: generating impressions from findings. + +### Overview +- **Environment ID**: `open_i_summarization` +- **Short description**: Radiology findings-to-impression summarization benchmark using the Open-I chest X-ray dataset. This environment evaluates how well models can distill radiology findings into concise, clinically accurate impressions. +- **Tags**: medical, radiology, summarization, single-turn, llm-judge, nlg-metrics +- **System Prompt**: "Summarize the radiology report findings into an impression with minimal text." + +--- + +### Dataset +- **Source**: [medarc/open-i-summarization](https://huggingface.co/datasets/medarc/open-i-summarization) +- **Based on**: Indiana University Chest X-ray Collection (Open-I), as used in [Van Veen et al., Nature Medicine 2024](https://www.nature.com/articles/s41591-024-02855-5) - "Adapted large language models can outperform medical experts in clinical text summarization" +- **Split sizes**: + - **Train:** 2,735 examples + - **Validation:** 341 examples + - **Test:** 343 examples +- **Task**: Given radiology findings, generate a concise clinical impression. + +--- + +### Task +- **Type:** Single-Turn Summarization +- **Input:** Radiology findings (chest X-ray report text) +- **Output:** Clinical impression (summary) +- **Evaluation:** Dual evaluation approach following the Nature Medicine paper: + 1. **LLM-as-Judge**: Evaluates correctness, completeness, and conciseness (1-5 scale each) + 2. **Automatic Metrics**: BLEU, ROUGE (1/2/L/Lsum), BERTScore (precision/recall/F1) + +The LLM-as-judge criteria are adapted from the **Nature Medicine reader study** (Methods section): +- **Correctness**: "Which summary includes less false information?" — evaluates precision (penalizes fabricated information) +- **Completeness**: "Which summary more completely captures important information?" — evaluates recall (clinically important detail retained) +- **Conciseness**: "Which summary contains less non-important information?" — evaluates brevity (penalizes superfluous information) + +The implementation follows the pattern established in `medicationqa`, using multi-dimensional scoring with a JSONParser. + +--- + +### Quickstart + +**Basic evaluation with default settings:** +```bash +python -m medarc_verifiers.cli.main open_i_summarization -m gpt-4.1-mini -n 5 -r 1 --judge-model gpt-4.1-mini -s +``` + +**Example output:** +``` +--- Evaluation --- +Environment: open_i_summarization +Model: gpt-4.1-mini +Provider: https://api.openai.com/v1/ +Examples: 5 +Rollouts per example: 1 +--- Example --- +╭──────────────────────────────────── Step 0 ────────────────────────────────────╮ +│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ │ +│ ┃ Prompt ┃ Completion ┃ Reward ┃ │ +│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ │ +│ │ system: Summarize the │ assistant: Normal heart size; │ 1.00 │ │ +│ │ radiology report findings into │ no acute pulmonary findings; │ │ │ +│ │ an impression with minimal │ mediastinal calcification and │ │ │ +│ │ text. │ dense right upper lung nodule │ │ │ +│ │ │ consistent with prior │ │ │ +│ │ user: Heart size within normal │ granulomatous disease. │ │ │ +│ │ limits. No focal alveolar │ │ │ │ +│ │ consolidation, no definite │ │ │ │ +│ │ pleural effusion seen... │ │ │ │ +│ └────────────────────────────────┴─────────────────────────────────┴────────┘ │ +╰────────────────────────────────────────────────────────────────────────────────╯ +--- All --- +Rewards: +reward: avg - 1.000, std - 0.000 +r1: [1.0, 1.0, 1.0, 1.0, 1.0] +``` + +**Run on validation split:** +```bash +python -m medarc_verifiers.cli.main open_i_summarization --split validation -m gpt-4.1-mini -n 10 -r 1 --judge-model gpt-4.1-mini -s +``` + +**Fast evaluation (without automatic metrics):** +```bash +python -m medarc_verifiers.cli.main open_i_summarization -m gpt-4.1-mini -n 10 -r 1 --judge-model gpt-4.1-mini --no-compute-auto-metrics -s +``` + +**Using a local model (e.g., Ollama):** +```bash +python -m medarc_verifiers.cli.main open_i_summarization \ + -m llama3 \ + --api-base-url http://localhost:11434/v1 \ + --env-args '{"judge_model":"llama3","judge_base_url":"http://localhost:11434/v1","judge_api_key":"ollama"}' \ + -n 5 -r 1 -s +``` + +--- + +### Environment Arguments + +| Arg | Type | Default | Description | +| --- | ---- | ------- | ----------- | +| `split` | `str` | `"test"` | Dataset split to use (`train`, `validation`, `test`). | +| `judge_model` | `str` | `"gpt-4o-mini"` | Model identifier for the LLM judge. | +| `judge_base_url` | `str \| None` | `None` | Custom API base URL (e.g., for Ollama or local models). | +| `judge_api_key` | `str \| None` | `None` | API key for the judge model (falls back to `OPENAI_API_KEY`). | +| `compute_auto_metrics` | `bool` | `True` | Whether to compute BLEU/ROUGE/BERTScore metrics. | +| `system_prompt` | `str \| None` | `None` | Custom system prompt (uses default if not provided). | + +--- + +### Metrics + +#### Primary Metric (Reward) +| Metric | Meaning | +|--------|---------| +| `reward` | Normalized LLM-judge score (0-1), averaged across correctness, completeness, and conciseness | + +#### LLM-Judge Dimensions (1-5 scale) + +Criteria adapted from [Van Veen et al., Nature Medicine 2024](https://doi.org/10.1038/s41591-024-02855-5) (Methods - Reader study): + +| Dimension | Description | +|-----------|-------------| +| `correctness` | Does the summary include false information? Evaluates precision—penalizes fabricated or incorrect information. | +| `completeness` | Does the summary completely capture important information? Evaluates recall—clinically important detail retained. | +| `conciseness` | Does the summary contain non-important information? Evaluates brevity—penalizes superfluous information. Compares output length to reference. | + +#### Automatic Metrics (following Van Veen et al.) +| Metric | Description | +|--------|-------------| +| `bleu` | BLEU score (n-gram precision) | +| `rouge1` | ROUGE-1 (unigram overlap) | +| `rouge2` | ROUGE-2 (bigram overlap) | +| `rougeL` | ROUGE-L (longest common subsequence) | +| `rougeLsum` | ROUGE-Lsum (sentence-level LCS) | +| `bertscore_precision` | BERTScore precision | +| `bertscore_recall` | BERTScore recall | +| `bertscore_f1` | BERTScore F1 | + +--- + +### Results Dataset Structure + +#### Core Evaluation Fields +- **`prompt`** – The radiology findings presented to the model. +- **`completion`** – The model-generated impression. +- **`reward`** – Normalized LLM-judge score in `[0, 1]`. + +#### Example Metadata (`info`) +- **`idx`** – Original dataset index. +- **`findings`** – The input radiology findings text. +- **`judge_feedback`** – Detailed LLM-judge evaluation with scores and reasoning. +- **`auto_metrics`** – Dictionary containing BLEU, ROUGE, and BERTScore values. + +--- + +### Example Results + +The LLM-as-judge now applies stricter conciseness scoring based on the Nature Medicine paper criteria: + +``` +=== Example 1 === +Model output: Normal heart size; no alveolar consolidation, pleural effusion, + or pulmonary edema; mediastinal calcification and right upper + lung nodule indicate prior granulomatous disease. +Reference: No acute cardiopulmonary findings +Length ratio: 5.3x +Reward: 0.933 + +LLM-as-Judge Scores: + correctness: 5/5 (no false information) + completeness: 5/5 (all key findings captured) + conciseness: 4/5 (longer than reference, but clinically relevant) + +=== Example 2 === +Model output: Right middle and lower lobe opacity; mediastinal contours normal; + no fissure displacement or pneumothorax. +Reference: Opacification of the right middle and lower lobes. +Length ratio: 2.1x +Reward: 0.867 + +LLM-as-Judge Scores: + correctness: 5/5 + completeness: 5/5 + conciseness: 3/5 (approximately 2x longer with additional details) +``` + +--- + +### References + +**Dataset Source** +- Van Veen, D., Van Uden, C., Blankemeier, L. et al. "Adapted large language models can outperform medical experts in clinical text summarization." *Nature Medicine* 30, 1134–1142 (2024). https://doi.org/10.1038/s41591-024-02855-5 + +**Open-I Dataset** +```bibtex +@article{demner2016preparing, + title={Preparing a collection of radiology examinations for distribution and retrieval}, + author={Demner-Fushman, Dina and Kohli, Marc D and Rosenman, Marc B and Shooshan, Sonya E and Rodriguez, Laritza and Antani, Sameer and Thoma, George R and McDonald, Clement J}, + journal={Journal of the American Medical Informatics Association}, + volume={23}, + number={2}, + pages={304--310}, + year={2016}, + publisher={Oxford University Press} +} +``` + +**Evaluation Metrics** +- BLEU: Papineni et al., "BLEU: a Method for Automatic Evaluation of Machine Translation" (ACL 2002) +- ROUGE: Lin, "ROUGE: A Package for Automatic Evaluation of Summaries" (ACL 2004) +- BERTScore: Zhang et al., "BERTScore: Evaluating Text Generation with BERT" (ICLR 2020) diff --git a/environments/open_i_summarization/open_i_summarization.py b/environments/open_i_summarization/open_i_summarization.py new file mode 100644 index 00000000..f4531ce0 --- /dev/null +++ b/environments/open_i_summarization/open_i_summarization.py @@ -0,0 +1,364 @@ +""" +Open-I Summarization Environment + +Evaluation environment for radiology report summarization (findings → impression). +Uses LLM-as-judge evaluation plus automatic metrics (BLEU, ROUGE, BERTScore). + +Dataset: medarc/open-i-summarization +""" + +import re +from typing import Any + +import evaluate +import verifiers as vf +from datasets import load_dataset +from datasets.utils.logging import disable_progress_bar +from medarc_verifiers.parsers import JSONParser +from medarc_verifiers.utils import default_judge_api_key, judge_sampling_args_and_headers +from openai import AsyncOpenAI +from verifiers.types import Info, Messages, State + +disable_progress_bar() # suppress datasets progress bar + +# --- Judge Prompt Template --- +# Evaluation criteria adapted from: +# Buckley T. et al., "Accuracy of a vision-language model on challenging medical cases" +# Nature Medicine (2024). https://doi.org/10.1038/s41591-024-02855-5 +# See: Methods - Reader study section + +JUDGE_TEMPLATE = """\ +You are a radiology expert tasked with evaluating how well a model summarized radiology findings into an impression. + +Your goal is to assess the quality of the model's generated impression compared to the reference impression. +You will rate the impression along three dimensions: correctness, completeness, and conciseness. + +Definitions (from Nature Medicine reader study criteria): +- Correctness: Does the summary include false information? Evaluate precision—penalize any fabricated or incorrect information not supported by the findings. +- Completeness: Does the summary completely capture important information? Evaluate recall—the amount of clinically important detail retained from the input findings. +- Conciseness: Does the summary contain non-important information? Evaluate brevity—the value of a summary decreases with superfluous information. A good summary should be as brief as possible while capturing key findings. + +The radiology findings will be provided in these tags: + +{findings} + + +The model's generated impression will be provided in these tags: + +{response} + + +The reference impression will be provided in these tags: + +{reference} + + +Length information (pre-computed): +- Model impression length: {model_length} characters +- Reference impression length: {reference_length} characters +- Length ratio: {length_ratio}x + +For each dimension: +1. Provide a brief explanation (1–3 sentences) describing why you assigned the score. +2. Then assign a score from 1 to 5: + +Correctness scoring: +- 5: No false or fabricated information +- 4: Minor inaccuracy that doesn't affect clinical interpretation +- 3: Some inaccurate information present +- 2: Multiple inaccuracies or one significant error +- 1: Major fabrications or errors + +Completeness scoring: +- 5: All clinically important information captured +- 4: Most important information captured, minor omissions +- 3: Key findings present but some important details missing +- 2: Several important findings omitted +- 1: Critical findings missing + +Conciseness scoring (STRICTLY based on length ratio above): +- 5: Length ratio ≤ 1.0 (same length or shorter than reference) +- 4: Length ratio 1.0-1.5 (up to 50% longer) +- 3: Length ratio 1.5-2.0 (50-100% longer) +- 2: Length ratio 2.0-3.0 (2-3x longer) +- 1: Length ratio > 3.0 (more than 3x longer) + +IMPORTANT: The conciseness score MUST follow the length ratio guidelines above. Do not override based on content quality. + +{output_format} +""" + +JUDGE_OUTPUT_JSON = """ +Output your evaluation as a single valid JSON object matching the following structure: +{ + "correctness": { + "reason": "Brief explanation of why this score was given.", + "score": 0 + }, + "completeness": { + "reason": "Brief explanation of why this score was given.", + "score": 0 + }, + "conciseness": { + "reason": "Brief explanation of why this score was given.", + "score": 0 + } +} + +Ensure the output is valid JSON: +- Use double quotes (") for all keys and string values. +- Escape any internal quotes inside the reason fields. +- Do not include any additional text outside the JSON object. +- Do not explain your reasoning outside the JSON object; all justification must appear only in the "reason" fields. +""" + +# Scored dimensions must match the keys emitted by JUDGE_TEMPLATE +# Note: "correctness" replaces "accuracy" per Nature Medicine paper terminology +JUDGE_DIMENSIONS = ["correctness", "completeness", "conciseness"] + + +def _extract_completion_text(completion: Messages) -> str: + """Extract the assistant's text content from a chat-style completion.""" + if isinstance(completion, list) and completion: + last_msg = completion[-1] + if isinstance(last_msg, dict): + return str(last_msg.get("content", "")) + return str(completion) + + +def extract_answer_section(completion_text: str) -> str: + """Extract final answer after think tags if present.""" + if not completion_text: + return "" + if "" in completion_text and "" in completion_text: + return re.sub(r".*?", "", completion_text, flags=re.DOTALL).strip() + return completion_text.strip() + + +def _coerce_score(value: Any) -> float | None: + """Best-effort conversion of a score value to a float, or None if not possible.""" + if value is None: + return None + if isinstance(value, (int, float)): + return float(value) + if isinstance(value, str): + value = value.strip() + if not value: + return None + try: + return float(value) + except ValueError: + return None + return None + + +def _compute_normalized_judge_reward(scores: dict[str, dict[str, Any]]) -> float: + """Normalize per-dimension judge scores to a single value in [0.0, 1.0]. + + Each dimension is expected to be on a 1–5 scale. Scores are clamped to + [0, 5], divided by 5 to map to [0, 1], and then averaged across dimensions. + """ + total_dims = len(JUDGE_DIMENSIONS) + if total_dims == 0: + return 0.0 + + accumulated = 0.0 + for dimension in JUDGE_DIMENSIONS: + score = _coerce_score(scores.get(dimension, {}).get("score")) + if score is None: + continue + clamped = max(0.0, min(5.0, score)) + accumulated += clamped / 5.0 + + return max(0.0, min(1.0, accumulated / total_dims)) + + +def load_environment( + split: str = "test", + judge_model: str = "gpt-4o-mini", + judge_base_url: str | None = None, + judge_api_key: str | None = None, + compute_auto_metrics: bool = True, + system_prompt: str | None = None, + **kwargs: Any, +) -> vf.SingleTurnEnv: + """ + Load the Open-I Summarization evaluation environment. + + This environment evaluates radiology findings → impression summarization using: + 1. LLM-as-judge evaluation (accuracy, completeness, conciseness) + 2. Automatic metrics: BLEU, ROUGE, BERTScore (optional) + + Args: + split: Dataset split to use ('train', 'validation', 'test'). Default: 'test'. + judge_model: Model identifier for the LLM judge. Default: 'gpt-4o-mini'. + judge_base_url: Base URL for judge API (for non-OpenAI endpoints). + judge_api_key: API key for judge model. Falls back to env vars if not provided. + compute_auto_metrics: Whether to compute BLEU/ROUGE/BERTScore. Default: True. + system_prompt: Custom system prompt. Uses default if not provided. + **kwargs: Additional arguments forwarded to vf.SingleTurnEnv. + + Returns: + A configured vf.SingleTurnEnv for radiology summarization evaluation. + """ + # Load dataset + eval_dataset = load_dataset("medarc/open-i-summarization", split=split) + + def _map(ex: dict) -> dict: + """Map dataset example to environment format.""" + return { + "question": ex["inputs"].strip(), # radiology findings + "answer": ex["target"].strip(), # reference impression + "info": { + "idx": ex["idx"], + "findings": ex["inputs"].strip(), + }, + } + + eval_dataset = eval_dataset.map(_map, remove_columns=eval_dataset.column_names) + + # Default system prompt for summarization + final_system_prompt = system_prompt or ( + "Summarize the radiology report findings into an impression with minimal text." + ) + + # Initialize automatic metrics if enabled + bleu_metric = None + rouge_metric = None + bertscore_metric = None + + if compute_auto_metrics: + try: + bleu_metric = evaluate.load("bleu") + rouge_metric = evaluate.load("rouge") + bertscore_metric = evaluate.load("bertscore") + except Exception as e: + print(f"Warning: Could not load automatic metrics: {e}") + compute_auto_metrics = False + + # Judge client setup + api_key = default_judge_api_key(judge_base_url) if judge_api_key is None else judge_api_key + sampling_args, default_headers = judge_sampling_args_and_headers(judge_model, judge_base_url) + # Remove extra_body as OpenAI doesn't support the usage tracking parameter + sampling_args.pop("extra_body", None) + judge_client = AsyncOpenAI(base_url=judge_base_url, api_key=api_key, default_headers=default_headers) + judge_parser = JSONParser(fields=list(JUDGE_DIMENSIONS)) + + judge_rubric = vf.JudgeRubric( + parallelize_scoring=True, + judge_client=judge_client, + judge_model=judge_model, + judge_prompt="{question}", + judge_sampling_args=sampling_args, + ) + + async def reward_open_i_summarization( + prompt: Messages, + completion: Messages, + info: Info, + state: State, + ) -> float: + """Evaluate radiology summarization using LLM-judge and automatic metrics.""" + findings = str(info.get("findings", state.get("question", ""))) + reference = str(state.get("answer", "")) + model_response = extract_answer_section(_extract_completion_text(completion)) + + # Compute length metrics for conciseness scoring + model_length = len(model_response) + reference_length = len(reference) if reference else 1 # avoid division by zero + length_ratio = model_length / reference_length + + # Store length info for analysis + info["length_metrics"] = { + "model_length": model_length, + "reference_length": reference_length, + "length_ratio": round(length_ratio, 2), + } + + # --- LLM-as-Judge Evaluation --- + judge_prompt = JUDGE_TEMPLATE.format( + findings=findings, + response=model_response, + reference=reference, + model_length=model_length, + reference_length=reference_length, + length_ratio=f"{length_ratio:.1f}", + output_format=JUDGE_OUTPUT_JSON, + ) + + try: + judge_raw = await judge_rubric.judge(judge_prompt, model_response, reference, state) + parsed = judge_parser.parse(str(judge_raw), strip=True) + except AttributeError: + judge_raw = await judge_rubric.judge(judge_prompt, "", "", state) + parsed = judge_parser.parse(str(judge_raw), strip=True) + + if parsed is None: + parsed = {dim: {"score": None, "reason": None} for dim in JUDGE_DIMENSIONS} + + judge_reward = _compute_normalized_judge_reward(parsed) + + # Store judge feedback + info.setdefault("judge_feedback", []).append( + { + "scores": parsed, + "normalized_reward": judge_reward, + "raw_judge": str(judge_raw), + } + ) + + # --- Automatic Metrics (BLEU, ROUGE, BERTScore) --- + auto_metrics: dict[str, Any] = {} + + if compute_auto_metrics and model_response and reference: + predictions = [model_response] + references = [reference] + + # BLEU + try: + bleu_scores = bleu_metric.compute(predictions=predictions, references=references) + auto_metrics["bleu"] = bleu_scores.get("bleu", 0.0) + except Exception: + auto_metrics["bleu"] = 0.0 + + # ROUGE + try: + rouge_scores = rouge_metric.compute(predictions=predictions, references=references) + auto_metrics["rouge1"] = rouge_scores.get("rouge1", 0.0) + auto_metrics["rouge2"] = rouge_scores.get("rouge2", 0.0) + auto_metrics["rougeL"] = rouge_scores.get("rougeL", 0.0) + auto_metrics["rougeLsum"] = rouge_scores.get("rougeLsum", 0.0) + except Exception: + auto_metrics["rouge1"] = 0.0 + auto_metrics["rouge2"] = 0.0 + auto_metrics["rougeL"] = 0.0 + auto_metrics["rougeLsum"] = 0.0 + + # BERTScore + try: + bert_scores = bertscore_metric.compute(predictions=predictions, references=references, lang="en") + auto_metrics["bertscore_precision"] = ( + bert_scores["precision"][0] if bert_scores.get("precision") else 0.0 + ) + auto_metrics["bertscore_recall"] = bert_scores["recall"][0] if bert_scores.get("recall") else 0.0 + auto_metrics["bertscore_f1"] = bert_scores["f1"][0] if bert_scores.get("f1") else 0.0 + except Exception: + auto_metrics["bertscore_precision"] = 0.0 + auto_metrics["bertscore_recall"] = 0.0 + auto_metrics["bertscore_f1"] = 0.0 + + info["auto_metrics"] = auto_metrics + + # Return the LLM-judge reward as the primary metric + return judge_reward + + judge_rubric.add_reward_func(reward_open_i_summarization, weight=1.0) + + return vf.SingleTurnEnv( + dataset=eval_dataset, + eval_dataset=eval_dataset, + system_prompt=final_system_prompt, + rubric=judge_rubric, + name="open_i_summarization", + **kwargs, + ) diff --git a/environments/open_i_summarization/pyproject.toml b/environments/open_i_summarization/pyproject.toml new file mode 100644 index 00000000..e71a3e8b --- /dev/null +++ b/environments/open_i_summarization/pyproject.toml @@ -0,0 +1,30 @@ +[project] +name = "open_i_summarization" +description = "Radiology findings to impression summarization benchmark using Open-I dataset" +tags = ["medical", "radiology", "summarization", "single-turn", "llm-judge", "nlg-metrics"] +version = "0.1.0" +requires-python = ">=3.11" +dependencies = [ + "verifiers>=0.1.4", + "datasets>=2.13.0", + "medarc_verifiers>=0.1.0", + "evaluate>=0.4.0", + "bert_score>=0.3.13", + "rouge_score>=0.1.2", + "sacrebleu>=2.3.0", +] + +[build-system] +requires = ["hatchling"] +build-backend = "hatchling.build" + +[tool.hatch.build] +include = ["open_i_summarization.py"] + +[tool.uv.sources] +medarc_verifiers = { git = "https://github.com/MedARC-AI/med-lm-envs" } + +[tool.prime.environment] +loader = "open_i_summarization:load_environment" +display_name = "Open-I Summarization" +visibility = "PUBLIC" diff --git a/tests/test_open_i_summarization.py b/tests/test_open_i_summarization.py new file mode 100644 index 00000000..789a5c46 --- /dev/null +++ b/tests/test_open_i_summarization.py @@ -0,0 +1,205 @@ +"""Tests for the open_i_summarization environment.""" + +import pytest + + +def test_extract_completion_text_from_list() -> None: + """Test extracting text from chat-style completion.""" + from open_i_summarization import _extract_completion_text + + completion = [{"role": "assistant", "content": "No acute findings"}] + assert _extract_completion_text(completion) == "No acute findings" + + +def test_extract_completion_text_from_string() -> None: + """Test extracting text from string completion.""" + from open_i_summarization import _extract_completion_text + + completion = "No acute findings" + assert _extract_completion_text(completion) == "No acute findings" + + +def test_extract_completion_text_empty_list() -> None: + """Test handling empty completion list.""" + from open_i_summarization import _extract_completion_text + + completion = [] + assert _extract_completion_text(completion) == "[]" + + +def test_extract_answer_section_with_think_tags() -> None: + """Test extracting answer after think tags.""" + from open_i_summarization import extract_answer_section + + text = "Some reasoning here...The actual answer" + assert extract_answer_section(text) == "The actual answer" + + +def test_extract_answer_section_without_think_tags() -> None: + """Test extracting answer without think tags.""" + from open_i_summarization import extract_answer_section + + text = "Just a plain answer" + assert extract_answer_section(text) == "Just a plain answer" + + +def test_extract_answer_section_empty() -> None: + """Test handling empty string.""" + from open_i_summarization import extract_answer_section + + assert extract_answer_section("") == "" + + +def test_compute_normalized_judge_reward_perfect_scores() -> None: + """Test normalized reward with perfect scores.""" + from open_i_summarization import _compute_normalized_judge_reward + + scores = { + "correctness": {"score": 5, "reason": "Perfect"}, + "completeness": {"score": 5, "reason": "Perfect"}, + "conciseness": {"score": 5, "reason": "Perfect"}, + } + assert _compute_normalized_judge_reward(scores) == pytest.approx(1.0) + + +def test_compute_normalized_judge_reward_mixed_scores() -> None: + """Test normalized reward with mixed scores.""" + from open_i_summarization import _compute_normalized_judge_reward + + scores = { + "correctness": {"score": 5, "reason": "Good"}, + "completeness": {"score": 3, "reason": "Partial"}, + "conciseness": {"score": 4, "reason": "Good"}, + } + # (5/5 + 3/5 + 4/5) / 3 = (1.0 + 0.6 + 0.8) / 3 = 0.8 + assert _compute_normalized_judge_reward(scores) == pytest.approx(0.8) + + +def test_compute_normalized_judge_reward_zero_scores() -> None: + """Test normalized reward with zero scores.""" + from open_i_summarization import _compute_normalized_judge_reward + + scores = { + "correctness": {"score": 0, "reason": "Poor"}, + "completeness": {"score": 0, "reason": "Poor"}, + "conciseness": {"score": 0, "reason": "Poor"}, + } + assert _compute_normalized_judge_reward(scores) == pytest.approx(0.0) + + +def test_compute_normalized_judge_reward_missing_scores() -> None: + """Test normalized reward with missing scores.""" + from open_i_summarization import _compute_normalized_judge_reward + + scores = { + "correctness": {"score": 5, "reason": "Good"}, + # completeness and conciseness missing + } + # Only correctness is counted: (5/5) / 3 = 0.333... + assert _compute_normalized_judge_reward(scores) == pytest.approx(1 / 3) + + +def test_compute_normalized_judge_reward_string_scores() -> None: + """Test normalized reward handles string scores.""" + from open_i_summarization import _compute_normalized_judge_reward + + scores = { + "correctness": {"score": "4", "reason": "Good"}, + "completeness": {"score": "3", "reason": "Partial"}, + "conciseness": {"score": "5", "reason": "Excellent"}, + } + # (4/5 + 3/5 + 5/5) / 3 = (0.8 + 0.6 + 1.0) / 3 = 0.8 + assert _compute_normalized_judge_reward(scores) == pytest.approx(0.8) + + +def test_judge_template_contains_placeholders() -> None: + """Test that judge template has required placeholders.""" + from open_i_summarization import JUDGE_TEMPLATE + + assert "{findings}" in JUDGE_TEMPLATE + assert "{response}" in JUDGE_TEMPLATE + assert "{reference}" in JUDGE_TEMPLATE + assert "{output_format}" in JUDGE_TEMPLATE + + +def test_judge_dimensions_defined() -> None: + """Test that judge dimensions are properly defined.""" + from open_i_summarization import JUDGE_DIMENSIONS + + # Note: "correctness" replaces "accuracy" per Nature Medicine paper terminology + assert "correctness" in JUDGE_DIMENSIONS + assert "completeness" in JUDGE_DIMENSIONS + assert "conciseness" in JUDGE_DIMENSIONS + assert len(JUDGE_DIMENSIONS) == 3 + + +def test_dataset_loading() -> None: + """Test that the dataset can be loaded.""" + from datasets import load_dataset + + ds = load_dataset("medarc/open-i-summarization", split="test") + assert len(ds) == 343 + assert "idx" in ds.features + assert "inputs" in ds.features + assert "target" in ds.features + + +def test_dataset_sample_structure() -> None: + """Test sample structure from dataset.""" + from datasets import load_dataset + + ds = load_dataset("medarc/open-i-summarization", split="test") + sample = ds[0] + + assert isinstance(sample["idx"], int) + assert isinstance(sample["inputs"], str) + assert isinstance(sample["target"], str) + assert len(sample["inputs"]) > 0 + assert len(sample["target"]) > 0 + + +@pytest.fixture +def mock_api_key(monkeypatch: pytest.MonkeyPatch) -> None: + """Set a mock API key for testing.""" + monkeypatch.setenv("OPENAI_API_KEY", "test-api-key") + + +def test_environment_loading(mock_api_key: None) -> None: + """Test that environment loads successfully with mock API key.""" + import verifiers as vf + + env = vf.load_environment("open_i_summarization", compute_auto_metrics=False) + + assert env.name == "open_i_summarization" + assert len(env.eval_dataset) == 343 + assert "Summarize the radiology report findings" in env.system_prompt + + +def test_environment_dataset_mapping(mock_api_key: None) -> None: + """Test that dataset is properly mapped to environment format.""" + import verifiers as vf + + env = vf.load_environment("open_i_summarization", compute_auto_metrics=False) + sample = env.eval_dataset[0] + + assert "question" in sample + assert "answer" in sample + assert "info" in sample + assert "findings" in sample["info"] + assert "idx" in sample["info"] + + +def test_environment_with_validation_split(mock_api_key: None) -> None: + """Test loading validation split.""" + import verifiers as vf + + env = vf.load_environment("open_i_summarization", split="validation", compute_auto_metrics=False) + assert len(env.eval_dataset) == 341 + + +def test_environment_with_train_split(mock_api_key: None) -> None: + """Test loading train split.""" + import verifiers as vf + + env = vf.load_environment("open_i_summarization", split="train", compute_auto_metrics=False) + assert len(env.eval_dataset) == 2735