diff --git a/configs/envs/open_i_summarization.yaml b/configs/envs/open_i_summarization.yaml
new file mode 100644
index 00000000..a88700aa
--- /dev/null
+++ b/configs/envs/open_i_summarization.yaml
@@ -0,0 +1,28 @@
+# Open-I Summarization environment configuration
+# Radiology findings to impression summarization benchmark
+
+- id: open_i_summarization
+ module: open_i_summarization
+ num_examples: -1
+ verbose: false
+ env_args:
+ split: test
+ compute_auto_metrics: true
+
+# Validation split variant
+- id: open_i_summarization_val
+ module: open_i_summarization
+ num_examples: -1
+ verbose: false
+ env_args:
+ split: validation
+ compute_auto_metrics: true
+
+# Fast evaluation (no automatic metrics)
+- id: open_i_summarization_fast
+ module: open_i_summarization
+ num_examples: -1
+ verbose: false
+ env_args:
+ split: test
+ compute_auto_metrics: false
diff --git a/environments/open_i_summarization/README.md b/environments/open_i_summarization/README.md
new file mode 100644
index 00000000..7a5c9b21
--- /dev/null
+++ b/environments/open_i_summarization/README.md
@@ -0,0 +1,213 @@
+# Open-I Summarization
+
+Evaluation environment for radiology report summarization: generating impressions from findings.
+
+### Overview
+- **Environment ID**: `open_i_summarization`
+- **Short description**: Radiology findings-to-impression summarization benchmark using the Open-I chest X-ray dataset. This environment evaluates how well models can distill radiology findings into concise, clinically accurate impressions.
+- **Tags**: medical, radiology, summarization, single-turn, llm-judge, nlg-metrics
+- **System Prompt**: "Summarize the radiology report findings into an impression with minimal text."
+
+---
+
+### Dataset
+- **Source**: [medarc/open-i-summarization](https://huggingface.co/datasets/medarc/open-i-summarization)
+- **Based on**: Indiana University Chest X-ray Collection (Open-I), as used in [Van Veen et al., Nature Medicine 2024](https://www.nature.com/articles/s41591-024-02855-5) - "Adapted large language models can outperform medical experts in clinical text summarization"
+- **Split sizes**:
+ - **Train:** 2,735 examples
+ - **Validation:** 341 examples
+ - **Test:** 343 examples
+- **Task**: Given radiology findings, generate a concise clinical impression.
+
+---
+
+### Task
+- **Type:** Single-Turn Summarization
+- **Input:** Radiology findings (chest X-ray report text)
+- **Output:** Clinical impression (summary)
+- **Evaluation:** Dual evaluation approach following the Nature Medicine paper:
+ 1. **LLM-as-Judge**: Evaluates correctness, completeness, and conciseness (1-5 scale each)
+ 2. **Automatic Metrics**: BLEU, ROUGE (1/2/L/Lsum), BERTScore (precision/recall/F1)
+
+The LLM-as-judge criteria are adapted from the **Nature Medicine reader study** (Methods section):
+- **Correctness**: "Which summary includes less false information?" — evaluates precision (penalizes fabricated information)
+- **Completeness**: "Which summary more completely captures important information?" — evaluates recall (clinically important detail retained)
+- **Conciseness**: "Which summary contains less non-important information?" — evaluates brevity (penalizes superfluous information)
+
+The implementation follows the pattern established in `medicationqa`, using multi-dimensional scoring with a JSONParser.
+
+---
+
+### Quickstart
+
+**Basic evaluation with default settings:**
+```bash
+python -m medarc_verifiers.cli.main open_i_summarization -m gpt-4.1-mini -n 5 -r 1 --judge-model gpt-4.1-mini -s
+```
+
+**Example output:**
+```
+--- Evaluation ---
+Environment: open_i_summarization
+Model: gpt-4.1-mini
+Provider: https://api.openai.com/v1/
+Examples: 5
+Rollouts per example: 1
+--- Example ---
+╭──────────────────────────────────── Step 0 ────────────────────────────────────╮
+│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ │
+│ ┃ Prompt ┃ Completion ┃ Reward ┃ │
+│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ │
+│ │ system: Summarize the │ assistant: Normal heart size; │ 1.00 │ │
+│ │ radiology report findings into │ no acute pulmonary findings; │ │ │
+│ │ an impression with minimal │ mediastinal calcification and │ │ │
+│ │ text. │ dense right upper lung nodule │ │ │
+│ │ │ consistent with prior │ │ │
+│ │ user: Heart size within normal │ granulomatous disease. │ │ │
+│ │ limits. No focal alveolar │ │ │ │
+│ │ consolidation, no definite │ │ │ │
+│ │ pleural effusion seen... │ │ │ │
+│ └────────────────────────────────┴─────────────────────────────────┴────────┘ │
+╰────────────────────────────────────────────────────────────────────────────────╯
+--- All ---
+Rewards:
+reward: avg - 1.000, std - 0.000
+r1: [1.0, 1.0, 1.0, 1.0, 1.0]
+```
+
+**Run on validation split:**
+```bash
+python -m medarc_verifiers.cli.main open_i_summarization --split validation -m gpt-4.1-mini -n 10 -r 1 --judge-model gpt-4.1-mini -s
+```
+
+**Fast evaluation (without automatic metrics):**
+```bash
+python -m medarc_verifiers.cli.main open_i_summarization -m gpt-4.1-mini -n 10 -r 1 --judge-model gpt-4.1-mini --no-compute-auto-metrics -s
+```
+
+**Using a local model (e.g., Ollama):**
+```bash
+python -m medarc_verifiers.cli.main open_i_summarization \
+ -m llama3 \
+ --api-base-url http://localhost:11434/v1 \
+ --env-args '{"judge_model":"llama3","judge_base_url":"http://localhost:11434/v1","judge_api_key":"ollama"}' \
+ -n 5 -r 1 -s
+```
+
+---
+
+### Environment Arguments
+
+| Arg | Type | Default | Description |
+| --- | ---- | ------- | ----------- |
+| `split` | `str` | `"test"` | Dataset split to use (`train`, `validation`, `test`). |
+| `judge_model` | `str` | `"gpt-4o-mini"` | Model identifier for the LLM judge. |
+| `judge_base_url` | `str \| None` | `None` | Custom API base URL (e.g., for Ollama or local models). |
+| `judge_api_key` | `str \| None` | `None` | API key for the judge model (falls back to `OPENAI_API_KEY`). |
+| `compute_auto_metrics` | `bool` | `True` | Whether to compute BLEU/ROUGE/BERTScore metrics. |
+| `system_prompt` | `str \| None` | `None` | Custom system prompt (uses default if not provided). |
+
+---
+
+### Metrics
+
+#### Primary Metric (Reward)
+| Metric | Meaning |
+|--------|---------|
+| `reward` | Normalized LLM-judge score (0-1), averaged across correctness, completeness, and conciseness |
+
+#### LLM-Judge Dimensions (1-5 scale)
+
+Criteria adapted from [Van Veen et al., Nature Medicine 2024](https://doi.org/10.1038/s41591-024-02855-5) (Methods - Reader study):
+
+| Dimension | Description |
+|-----------|-------------|
+| `correctness` | Does the summary include false information? Evaluates precision—penalizes fabricated or incorrect information. |
+| `completeness` | Does the summary completely capture important information? Evaluates recall—clinically important detail retained. |
+| `conciseness` | Does the summary contain non-important information? Evaluates brevity—penalizes superfluous information. Compares output length to reference. |
+
+#### Automatic Metrics (following Van Veen et al.)
+| Metric | Description |
+|--------|-------------|
+| `bleu` | BLEU score (n-gram precision) |
+| `rouge1` | ROUGE-1 (unigram overlap) |
+| `rouge2` | ROUGE-2 (bigram overlap) |
+| `rougeL` | ROUGE-L (longest common subsequence) |
+| `rougeLsum` | ROUGE-Lsum (sentence-level LCS) |
+| `bertscore_precision` | BERTScore precision |
+| `bertscore_recall` | BERTScore recall |
+| `bertscore_f1` | BERTScore F1 |
+
+---
+
+### Results Dataset Structure
+
+#### Core Evaluation Fields
+- **`prompt`** – The radiology findings presented to the model.
+- **`completion`** – The model-generated impression.
+- **`reward`** – Normalized LLM-judge score in `[0, 1]`.
+
+#### Example Metadata (`info`)
+- **`idx`** – Original dataset index.
+- **`findings`** – The input radiology findings text.
+- **`judge_feedback`** – Detailed LLM-judge evaluation with scores and reasoning.
+- **`auto_metrics`** – Dictionary containing BLEU, ROUGE, and BERTScore values.
+
+---
+
+### Example Results
+
+The LLM-as-judge now applies stricter conciseness scoring based on the Nature Medicine paper criteria:
+
+```
+=== Example 1 ===
+Model output: Normal heart size; no alveolar consolidation, pleural effusion,
+ or pulmonary edema; mediastinal calcification and right upper
+ lung nodule indicate prior granulomatous disease.
+Reference: No acute cardiopulmonary findings
+Length ratio: 5.3x
+Reward: 0.933
+
+LLM-as-Judge Scores:
+ correctness: 5/5 (no false information)
+ completeness: 5/5 (all key findings captured)
+ conciseness: 4/5 (longer than reference, but clinically relevant)
+
+=== Example 2 ===
+Model output: Right middle and lower lobe opacity; mediastinal contours normal;
+ no fissure displacement or pneumothorax.
+Reference: Opacification of the right middle and lower lobes.
+Length ratio: 2.1x
+Reward: 0.867
+
+LLM-as-Judge Scores:
+ correctness: 5/5
+ completeness: 5/5
+ conciseness: 3/5 (approximately 2x longer with additional details)
+```
+
+---
+
+### References
+
+**Dataset Source**
+- Van Veen, D., Van Uden, C., Blankemeier, L. et al. "Adapted large language models can outperform medical experts in clinical text summarization." *Nature Medicine* 30, 1134–1142 (2024). https://doi.org/10.1038/s41591-024-02855-5
+
+**Open-I Dataset**
+```bibtex
+@article{demner2016preparing,
+ title={Preparing a collection of radiology examinations for distribution and retrieval},
+ author={Demner-Fushman, Dina and Kohli, Marc D and Rosenman, Marc B and Shooshan, Sonya E and Rodriguez, Laritza and Antani, Sameer and Thoma, George R and McDonald, Clement J},
+ journal={Journal of the American Medical Informatics Association},
+ volume={23},
+ number={2},
+ pages={304--310},
+ year={2016},
+ publisher={Oxford University Press}
+}
+```
+
+**Evaluation Metrics**
+- BLEU: Papineni et al., "BLEU: a Method for Automatic Evaluation of Machine Translation" (ACL 2002)
+- ROUGE: Lin, "ROUGE: A Package for Automatic Evaluation of Summaries" (ACL 2004)
+- BERTScore: Zhang et al., "BERTScore: Evaluating Text Generation with BERT" (ICLR 2020)
diff --git a/environments/open_i_summarization/open_i_summarization.py b/environments/open_i_summarization/open_i_summarization.py
new file mode 100644
index 00000000..f4531ce0
--- /dev/null
+++ b/environments/open_i_summarization/open_i_summarization.py
@@ -0,0 +1,364 @@
+"""
+Open-I Summarization Environment
+
+Evaluation environment for radiology report summarization (findings → impression).
+Uses LLM-as-judge evaluation plus automatic metrics (BLEU, ROUGE, BERTScore).
+
+Dataset: medarc/open-i-summarization
+"""
+
+import re
+from typing import Any
+
+import evaluate
+import verifiers as vf
+from datasets import load_dataset
+from datasets.utils.logging import disable_progress_bar
+from medarc_verifiers.parsers import JSONParser
+from medarc_verifiers.utils import default_judge_api_key, judge_sampling_args_and_headers
+from openai import AsyncOpenAI
+from verifiers.types import Info, Messages, State
+
+disable_progress_bar() # suppress datasets progress bar
+
+# --- Judge Prompt Template ---
+# Evaluation criteria adapted from:
+# Buckley T. et al., "Accuracy of a vision-language model on challenging medical cases"
+# Nature Medicine (2024). https://doi.org/10.1038/s41591-024-02855-5
+# See: Methods - Reader study section
+
+JUDGE_TEMPLATE = """\
+You are a radiology expert tasked with evaluating how well a model summarized radiology findings into an impression.
+
+Your goal is to assess the quality of the model's generated impression compared to the reference impression.
+You will rate the impression along three dimensions: correctness, completeness, and conciseness.
+
+Definitions (from Nature Medicine reader study criteria):
+- Correctness: Does the summary include false information? Evaluate precision—penalize any fabricated or incorrect information not supported by the findings.
+- Completeness: Does the summary completely capture important information? Evaluate recall—the amount of clinically important detail retained from the input findings.
+- Conciseness: Does the summary contain non-important information? Evaluate brevity—the value of a summary decreases with superfluous information. A good summary should be as brief as possible while capturing key findings.
+
+The radiology findings will be provided in these tags:
+
+{findings}
+
+
+The model's generated impression will be provided in these tags:
+
+{response}
+
+
+The reference impression will be provided in these tags:
+
+{reference}
+
+
+Length information (pre-computed):
+- Model impression length: {model_length} characters
+- Reference impression length: {reference_length} characters
+- Length ratio: {length_ratio}x
+
+For each dimension:
+1. Provide a brief explanation (1–3 sentences) describing why you assigned the score.
+2. Then assign a score from 1 to 5:
+
+Correctness scoring:
+- 5: No false or fabricated information
+- 4: Minor inaccuracy that doesn't affect clinical interpretation
+- 3: Some inaccurate information present
+- 2: Multiple inaccuracies or one significant error
+- 1: Major fabrications or errors
+
+Completeness scoring:
+- 5: All clinically important information captured
+- 4: Most important information captured, minor omissions
+- 3: Key findings present but some important details missing
+- 2: Several important findings omitted
+- 1: Critical findings missing
+
+Conciseness scoring (STRICTLY based on length ratio above):
+- 5: Length ratio ≤ 1.0 (same length or shorter than reference)
+- 4: Length ratio 1.0-1.5 (up to 50% longer)
+- 3: Length ratio 1.5-2.0 (50-100% longer)
+- 2: Length ratio 2.0-3.0 (2-3x longer)
+- 1: Length ratio > 3.0 (more than 3x longer)
+
+IMPORTANT: The conciseness score MUST follow the length ratio guidelines above. Do not override based on content quality.
+
+{output_format}
+"""
+
+JUDGE_OUTPUT_JSON = """
+Output your evaluation as a single valid JSON object matching the following structure:
+{
+ "correctness": {
+ "reason": "Brief explanation of why this score was given.",
+ "score": 0
+ },
+ "completeness": {
+ "reason": "Brief explanation of why this score was given.",
+ "score": 0
+ },
+ "conciseness": {
+ "reason": "Brief explanation of why this score was given.",
+ "score": 0
+ }
+}
+
+Ensure the output is valid JSON:
+- Use double quotes (") for all keys and string values.
+- Escape any internal quotes inside the reason fields.
+- Do not include any additional text outside the JSON object.
+- Do not explain your reasoning outside the JSON object; all justification must appear only in the "reason" fields.
+"""
+
+# Scored dimensions must match the keys emitted by JUDGE_TEMPLATE
+# Note: "correctness" replaces "accuracy" per Nature Medicine paper terminology
+JUDGE_DIMENSIONS = ["correctness", "completeness", "conciseness"]
+
+
+def _extract_completion_text(completion: Messages) -> str:
+ """Extract the assistant's text content from a chat-style completion."""
+ if isinstance(completion, list) and completion:
+ last_msg = completion[-1]
+ if isinstance(last_msg, dict):
+ return str(last_msg.get("content", ""))
+ return str(completion)
+
+
+def extract_answer_section(completion_text: str) -> str:
+ """Extract final answer after think tags if present."""
+ if not completion_text:
+ return ""
+ if "" in completion_text and "" in completion_text:
+ return re.sub(r".*?", "", completion_text, flags=re.DOTALL).strip()
+ return completion_text.strip()
+
+
+def _coerce_score(value: Any) -> float | None:
+ """Best-effort conversion of a score value to a float, or None if not possible."""
+ if value is None:
+ return None
+ if isinstance(value, (int, float)):
+ return float(value)
+ if isinstance(value, str):
+ value = value.strip()
+ if not value:
+ return None
+ try:
+ return float(value)
+ except ValueError:
+ return None
+ return None
+
+
+def _compute_normalized_judge_reward(scores: dict[str, dict[str, Any]]) -> float:
+ """Normalize per-dimension judge scores to a single value in [0.0, 1.0].
+
+ Each dimension is expected to be on a 1–5 scale. Scores are clamped to
+ [0, 5], divided by 5 to map to [0, 1], and then averaged across dimensions.
+ """
+ total_dims = len(JUDGE_DIMENSIONS)
+ if total_dims == 0:
+ return 0.0
+
+ accumulated = 0.0
+ for dimension in JUDGE_DIMENSIONS:
+ score = _coerce_score(scores.get(dimension, {}).get("score"))
+ if score is None:
+ continue
+ clamped = max(0.0, min(5.0, score))
+ accumulated += clamped / 5.0
+
+ return max(0.0, min(1.0, accumulated / total_dims))
+
+
+def load_environment(
+ split: str = "test",
+ judge_model: str = "gpt-4o-mini",
+ judge_base_url: str | None = None,
+ judge_api_key: str | None = None,
+ compute_auto_metrics: bool = True,
+ system_prompt: str | None = None,
+ **kwargs: Any,
+) -> vf.SingleTurnEnv:
+ """
+ Load the Open-I Summarization evaluation environment.
+
+ This environment evaluates radiology findings → impression summarization using:
+ 1. LLM-as-judge evaluation (accuracy, completeness, conciseness)
+ 2. Automatic metrics: BLEU, ROUGE, BERTScore (optional)
+
+ Args:
+ split: Dataset split to use ('train', 'validation', 'test'). Default: 'test'.
+ judge_model: Model identifier for the LLM judge. Default: 'gpt-4o-mini'.
+ judge_base_url: Base URL for judge API (for non-OpenAI endpoints).
+ judge_api_key: API key for judge model. Falls back to env vars if not provided.
+ compute_auto_metrics: Whether to compute BLEU/ROUGE/BERTScore. Default: True.
+ system_prompt: Custom system prompt. Uses default if not provided.
+ **kwargs: Additional arguments forwarded to vf.SingleTurnEnv.
+
+ Returns:
+ A configured vf.SingleTurnEnv for radiology summarization evaluation.
+ """
+ # Load dataset
+ eval_dataset = load_dataset("medarc/open-i-summarization", split=split)
+
+ def _map(ex: dict) -> dict:
+ """Map dataset example to environment format."""
+ return {
+ "question": ex["inputs"].strip(), # radiology findings
+ "answer": ex["target"].strip(), # reference impression
+ "info": {
+ "idx": ex["idx"],
+ "findings": ex["inputs"].strip(),
+ },
+ }
+
+ eval_dataset = eval_dataset.map(_map, remove_columns=eval_dataset.column_names)
+
+ # Default system prompt for summarization
+ final_system_prompt = system_prompt or (
+ "Summarize the radiology report findings into an impression with minimal text."
+ )
+
+ # Initialize automatic metrics if enabled
+ bleu_metric = None
+ rouge_metric = None
+ bertscore_metric = None
+
+ if compute_auto_metrics:
+ try:
+ bleu_metric = evaluate.load("bleu")
+ rouge_metric = evaluate.load("rouge")
+ bertscore_metric = evaluate.load("bertscore")
+ except Exception as e:
+ print(f"Warning: Could not load automatic metrics: {e}")
+ compute_auto_metrics = False
+
+ # Judge client setup
+ api_key = default_judge_api_key(judge_base_url) if judge_api_key is None else judge_api_key
+ sampling_args, default_headers = judge_sampling_args_and_headers(judge_model, judge_base_url)
+ # Remove extra_body as OpenAI doesn't support the usage tracking parameter
+ sampling_args.pop("extra_body", None)
+ judge_client = AsyncOpenAI(base_url=judge_base_url, api_key=api_key, default_headers=default_headers)
+ judge_parser = JSONParser(fields=list(JUDGE_DIMENSIONS))
+
+ judge_rubric = vf.JudgeRubric(
+ parallelize_scoring=True,
+ judge_client=judge_client,
+ judge_model=judge_model,
+ judge_prompt="{question}",
+ judge_sampling_args=sampling_args,
+ )
+
+ async def reward_open_i_summarization(
+ prompt: Messages,
+ completion: Messages,
+ info: Info,
+ state: State,
+ ) -> float:
+ """Evaluate radiology summarization using LLM-judge and automatic metrics."""
+ findings = str(info.get("findings", state.get("question", "")))
+ reference = str(state.get("answer", ""))
+ model_response = extract_answer_section(_extract_completion_text(completion))
+
+ # Compute length metrics for conciseness scoring
+ model_length = len(model_response)
+ reference_length = len(reference) if reference else 1 # avoid division by zero
+ length_ratio = model_length / reference_length
+
+ # Store length info for analysis
+ info["length_metrics"] = {
+ "model_length": model_length,
+ "reference_length": reference_length,
+ "length_ratio": round(length_ratio, 2),
+ }
+
+ # --- LLM-as-Judge Evaluation ---
+ judge_prompt = JUDGE_TEMPLATE.format(
+ findings=findings,
+ response=model_response,
+ reference=reference,
+ model_length=model_length,
+ reference_length=reference_length,
+ length_ratio=f"{length_ratio:.1f}",
+ output_format=JUDGE_OUTPUT_JSON,
+ )
+
+ try:
+ judge_raw = await judge_rubric.judge(judge_prompt, model_response, reference, state)
+ parsed = judge_parser.parse(str(judge_raw), strip=True)
+ except AttributeError:
+ judge_raw = await judge_rubric.judge(judge_prompt, "", "", state)
+ parsed = judge_parser.parse(str(judge_raw), strip=True)
+
+ if parsed is None:
+ parsed = {dim: {"score": None, "reason": None} for dim in JUDGE_DIMENSIONS}
+
+ judge_reward = _compute_normalized_judge_reward(parsed)
+
+ # Store judge feedback
+ info.setdefault("judge_feedback", []).append(
+ {
+ "scores": parsed,
+ "normalized_reward": judge_reward,
+ "raw_judge": str(judge_raw),
+ }
+ )
+
+ # --- Automatic Metrics (BLEU, ROUGE, BERTScore) ---
+ auto_metrics: dict[str, Any] = {}
+
+ if compute_auto_metrics and model_response and reference:
+ predictions = [model_response]
+ references = [reference]
+
+ # BLEU
+ try:
+ bleu_scores = bleu_metric.compute(predictions=predictions, references=references)
+ auto_metrics["bleu"] = bleu_scores.get("bleu", 0.0)
+ except Exception:
+ auto_metrics["bleu"] = 0.0
+
+ # ROUGE
+ try:
+ rouge_scores = rouge_metric.compute(predictions=predictions, references=references)
+ auto_metrics["rouge1"] = rouge_scores.get("rouge1", 0.0)
+ auto_metrics["rouge2"] = rouge_scores.get("rouge2", 0.0)
+ auto_metrics["rougeL"] = rouge_scores.get("rougeL", 0.0)
+ auto_metrics["rougeLsum"] = rouge_scores.get("rougeLsum", 0.0)
+ except Exception:
+ auto_metrics["rouge1"] = 0.0
+ auto_metrics["rouge2"] = 0.0
+ auto_metrics["rougeL"] = 0.0
+ auto_metrics["rougeLsum"] = 0.0
+
+ # BERTScore
+ try:
+ bert_scores = bertscore_metric.compute(predictions=predictions, references=references, lang="en")
+ auto_metrics["bertscore_precision"] = (
+ bert_scores["precision"][0] if bert_scores.get("precision") else 0.0
+ )
+ auto_metrics["bertscore_recall"] = bert_scores["recall"][0] if bert_scores.get("recall") else 0.0
+ auto_metrics["bertscore_f1"] = bert_scores["f1"][0] if bert_scores.get("f1") else 0.0
+ except Exception:
+ auto_metrics["bertscore_precision"] = 0.0
+ auto_metrics["bertscore_recall"] = 0.0
+ auto_metrics["bertscore_f1"] = 0.0
+
+ info["auto_metrics"] = auto_metrics
+
+ # Return the LLM-judge reward as the primary metric
+ return judge_reward
+
+ judge_rubric.add_reward_func(reward_open_i_summarization, weight=1.0)
+
+ return vf.SingleTurnEnv(
+ dataset=eval_dataset,
+ eval_dataset=eval_dataset,
+ system_prompt=final_system_prompt,
+ rubric=judge_rubric,
+ name="open_i_summarization",
+ **kwargs,
+ )
diff --git a/environments/open_i_summarization/pyproject.toml b/environments/open_i_summarization/pyproject.toml
new file mode 100644
index 00000000..e71a3e8b
--- /dev/null
+++ b/environments/open_i_summarization/pyproject.toml
@@ -0,0 +1,30 @@
+[project]
+name = "open_i_summarization"
+description = "Radiology findings to impression summarization benchmark using Open-I dataset"
+tags = ["medical", "radiology", "summarization", "single-turn", "llm-judge", "nlg-metrics"]
+version = "0.1.0"
+requires-python = ">=3.11"
+dependencies = [
+ "verifiers>=0.1.4",
+ "datasets>=2.13.0",
+ "medarc_verifiers>=0.1.0",
+ "evaluate>=0.4.0",
+ "bert_score>=0.3.13",
+ "rouge_score>=0.1.2",
+ "sacrebleu>=2.3.0",
+]
+
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+
+[tool.hatch.build]
+include = ["open_i_summarization.py"]
+
+[tool.uv.sources]
+medarc_verifiers = { git = "https://github.com/MedARC-AI/med-lm-envs" }
+
+[tool.prime.environment]
+loader = "open_i_summarization:load_environment"
+display_name = "Open-I Summarization"
+visibility = "PUBLIC"
diff --git a/tests/test_open_i_summarization.py b/tests/test_open_i_summarization.py
new file mode 100644
index 00000000..789a5c46
--- /dev/null
+++ b/tests/test_open_i_summarization.py
@@ -0,0 +1,205 @@
+"""Tests for the open_i_summarization environment."""
+
+import pytest
+
+
+def test_extract_completion_text_from_list() -> None:
+ """Test extracting text from chat-style completion."""
+ from open_i_summarization import _extract_completion_text
+
+ completion = [{"role": "assistant", "content": "No acute findings"}]
+ assert _extract_completion_text(completion) == "No acute findings"
+
+
+def test_extract_completion_text_from_string() -> None:
+ """Test extracting text from string completion."""
+ from open_i_summarization import _extract_completion_text
+
+ completion = "No acute findings"
+ assert _extract_completion_text(completion) == "No acute findings"
+
+
+def test_extract_completion_text_empty_list() -> None:
+ """Test handling empty completion list."""
+ from open_i_summarization import _extract_completion_text
+
+ completion = []
+ assert _extract_completion_text(completion) == "[]"
+
+
+def test_extract_answer_section_with_think_tags() -> None:
+ """Test extracting answer after think tags."""
+ from open_i_summarization import extract_answer_section
+
+ text = "Some reasoning here...The actual answer"
+ assert extract_answer_section(text) == "The actual answer"
+
+
+def test_extract_answer_section_without_think_tags() -> None:
+ """Test extracting answer without think tags."""
+ from open_i_summarization import extract_answer_section
+
+ text = "Just a plain answer"
+ assert extract_answer_section(text) == "Just a plain answer"
+
+
+def test_extract_answer_section_empty() -> None:
+ """Test handling empty string."""
+ from open_i_summarization import extract_answer_section
+
+ assert extract_answer_section("") == ""
+
+
+def test_compute_normalized_judge_reward_perfect_scores() -> None:
+ """Test normalized reward with perfect scores."""
+ from open_i_summarization import _compute_normalized_judge_reward
+
+ scores = {
+ "correctness": {"score": 5, "reason": "Perfect"},
+ "completeness": {"score": 5, "reason": "Perfect"},
+ "conciseness": {"score": 5, "reason": "Perfect"},
+ }
+ assert _compute_normalized_judge_reward(scores) == pytest.approx(1.0)
+
+
+def test_compute_normalized_judge_reward_mixed_scores() -> None:
+ """Test normalized reward with mixed scores."""
+ from open_i_summarization import _compute_normalized_judge_reward
+
+ scores = {
+ "correctness": {"score": 5, "reason": "Good"},
+ "completeness": {"score": 3, "reason": "Partial"},
+ "conciseness": {"score": 4, "reason": "Good"},
+ }
+ # (5/5 + 3/5 + 4/5) / 3 = (1.0 + 0.6 + 0.8) / 3 = 0.8
+ assert _compute_normalized_judge_reward(scores) == pytest.approx(0.8)
+
+
+def test_compute_normalized_judge_reward_zero_scores() -> None:
+ """Test normalized reward with zero scores."""
+ from open_i_summarization import _compute_normalized_judge_reward
+
+ scores = {
+ "correctness": {"score": 0, "reason": "Poor"},
+ "completeness": {"score": 0, "reason": "Poor"},
+ "conciseness": {"score": 0, "reason": "Poor"},
+ }
+ assert _compute_normalized_judge_reward(scores) == pytest.approx(0.0)
+
+
+def test_compute_normalized_judge_reward_missing_scores() -> None:
+ """Test normalized reward with missing scores."""
+ from open_i_summarization import _compute_normalized_judge_reward
+
+ scores = {
+ "correctness": {"score": 5, "reason": "Good"},
+ # completeness and conciseness missing
+ }
+ # Only correctness is counted: (5/5) / 3 = 0.333...
+ assert _compute_normalized_judge_reward(scores) == pytest.approx(1 / 3)
+
+
+def test_compute_normalized_judge_reward_string_scores() -> None:
+ """Test normalized reward handles string scores."""
+ from open_i_summarization import _compute_normalized_judge_reward
+
+ scores = {
+ "correctness": {"score": "4", "reason": "Good"},
+ "completeness": {"score": "3", "reason": "Partial"},
+ "conciseness": {"score": "5", "reason": "Excellent"},
+ }
+ # (4/5 + 3/5 + 5/5) / 3 = (0.8 + 0.6 + 1.0) / 3 = 0.8
+ assert _compute_normalized_judge_reward(scores) == pytest.approx(0.8)
+
+
+def test_judge_template_contains_placeholders() -> None:
+ """Test that judge template has required placeholders."""
+ from open_i_summarization import JUDGE_TEMPLATE
+
+ assert "{findings}" in JUDGE_TEMPLATE
+ assert "{response}" in JUDGE_TEMPLATE
+ assert "{reference}" in JUDGE_TEMPLATE
+ assert "{output_format}" in JUDGE_TEMPLATE
+
+
+def test_judge_dimensions_defined() -> None:
+ """Test that judge dimensions are properly defined."""
+ from open_i_summarization import JUDGE_DIMENSIONS
+
+ # Note: "correctness" replaces "accuracy" per Nature Medicine paper terminology
+ assert "correctness" in JUDGE_DIMENSIONS
+ assert "completeness" in JUDGE_DIMENSIONS
+ assert "conciseness" in JUDGE_DIMENSIONS
+ assert len(JUDGE_DIMENSIONS) == 3
+
+
+def test_dataset_loading() -> None:
+ """Test that the dataset can be loaded."""
+ from datasets import load_dataset
+
+ ds = load_dataset("medarc/open-i-summarization", split="test")
+ assert len(ds) == 343
+ assert "idx" in ds.features
+ assert "inputs" in ds.features
+ assert "target" in ds.features
+
+
+def test_dataset_sample_structure() -> None:
+ """Test sample structure from dataset."""
+ from datasets import load_dataset
+
+ ds = load_dataset("medarc/open-i-summarization", split="test")
+ sample = ds[0]
+
+ assert isinstance(sample["idx"], int)
+ assert isinstance(sample["inputs"], str)
+ assert isinstance(sample["target"], str)
+ assert len(sample["inputs"]) > 0
+ assert len(sample["target"]) > 0
+
+
+@pytest.fixture
+def mock_api_key(monkeypatch: pytest.MonkeyPatch) -> None:
+ """Set a mock API key for testing."""
+ monkeypatch.setenv("OPENAI_API_KEY", "test-api-key")
+
+
+def test_environment_loading(mock_api_key: None) -> None:
+ """Test that environment loads successfully with mock API key."""
+ import verifiers as vf
+
+ env = vf.load_environment("open_i_summarization", compute_auto_metrics=False)
+
+ assert env.name == "open_i_summarization"
+ assert len(env.eval_dataset) == 343
+ assert "Summarize the radiology report findings" in env.system_prompt
+
+
+def test_environment_dataset_mapping(mock_api_key: None) -> None:
+ """Test that dataset is properly mapped to environment format."""
+ import verifiers as vf
+
+ env = vf.load_environment("open_i_summarization", compute_auto_metrics=False)
+ sample = env.eval_dataset[0]
+
+ assert "question" in sample
+ assert "answer" in sample
+ assert "info" in sample
+ assert "findings" in sample["info"]
+ assert "idx" in sample["info"]
+
+
+def test_environment_with_validation_split(mock_api_key: None) -> None:
+ """Test loading validation split."""
+ import verifiers as vf
+
+ env = vf.load_environment("open_i_summarization", split="validation", compute_auto_metrics=False)
+ assert len(env.eval_dataset) == 341
+
+
+def test_environment_with_train_split(mock_api_key: None) -> None:
+ """Test loading train split."""
+ import verifiers as vf
+
+ env = vf.load_environment("open_i_summarization", split="train", compute_auto_metrics=False)
+ assert len(env.eval_dataset) == 2735