diff --git a/configs/envs/open_i_summarization.yaml b/configs/envs/open_i_summarization.yaml
new file mode 100644
index 00000000..a88700aa
--- /dev/null
+++ b/configs/envs/open_i_summarization.yaml
@@ -0,0 +1,28 @@
+# Open-I Summarization environment configuration
+# Radiology findings to impression summarization benchmark
+
+- id: open_i_summarization
+  module: open_i_summarization
+  num_examples: -1
+  verbose: false
+  env_args:
+    split: test
+    compute_auto_metrics: true
+
+# Validation split variant
+- id: open_i_summarization_val
+  module: open_i_summarization
+  num_examples: -1
+  verbose: false
+  env_args:
+    split: validation
+    compute_auto_metrics: true
+
+# Fast evaluation (no automatic metrics)
+- id: open_i_summarization_fast
+  module: open_i_summarization
+  num_examples: -1
+  verbose: false
+  env_args:
+    split: test
+    compute_auto_metrics: false
diff --git a/environments/open_i_summarization/README.md b/environments/open_i_summarization/README.md
new file mode 100644
index 00000000..7a5c9b21
--- /dev/null
+++ b/environments/open_i_summarization/README.md
@@ -0,0 +1,213 @@
+# Open-I Summarization
+
+Evaluation environment for radiology report summarization: generating impressions from findings.
+
+### Overview
+- **Environment ID**: `open_i_summarization`
+- **Short description**: Radiology findings-to-impression summarization benchmark using the Open-I chest X-ray dataset. This environment evaluates how well models can distill radiology findings into concise, clinically accurate impressions.
+- **Tags**: medical, radiology, summarization, single-turn, llm-judge, nlg-metrics
+- **System Prompt**: "Summarize the radiology report findings into an impression with minimal text."
+
+---
+
+### Dataset
+- **Source**: [medarc/open-i-summarization](https://huggingface.co/datasets/medarc/open-i-summarization)
+- **Based on**: Indiana University Chest X-ray Collection (Open-I), as used in [Van Veen et al., Nature Medicine 2024](https://www.nature.com/articles/s41591-024-02855-5) - "Adapted large language models can outperform medical experts in clinical text summarization"
+- **Split sizes**:
+  - **Train:** 2,735 examples
+  - **Validation:** 341 examples
+  - **Test:** 343 examples
+- **Task**: Given radiology findings, generate a concise clinical impression.
+
+---
+
+### Task
+- **Type:** Single-Turn Summarization
+- **Input:** Radiology findings (chest X-ray report text)
+- **Output:** Clinical impression (summary)
+- **Evaluation:** Dual evaluation approach following the Nature Medicine paper:
+  1. **LLM-as-Judge**: Evaluates correctness, completeness, and conciseness (1-5 scale each)
+  2. **Automatic Metrics**: BLEU, ROUGE (1/2/L/Lsum), BERTScore (precision/recall/F1)
+
+The LLM-as-judge criteria are adapted from the **Nature Medicine reader study** (Methods section):
+- **Correctness**: "Which summary includes less false information?" — evaluates precision (penalizes fabricated information)
+- **Completeness**: "Which summary more completely captures important information?" — evaluates recall (clinically important detail retained)
+- **Conciseness**: "Which summary contains less non-important information?" — evaluates brevity (penalizes superfluous information)
+
+The implementation follows the pattern established in `medicationqa`, using multi-dimensional scoring with a JSONParser.
+
+---
+
+### Quickstart
+
+**Basic evaluation with default settings:**
+```bash
+python -m medarc_verifiers.cli.main open_i_summarization -m gpt-4.1-mini -n 5 -r 1 --judge-model gpt-4.1-mini -s
+```
+
+**Example output:**
+```
+--- Evaluation ---
+Environment: open_i_summarization
+Model: gpt-4.1-mini
+Provider: https://api.openai.com/v1/
+Examples: 5
+Rollouts per example: 1
+--- Example ---
+╭──────────────────────────────────── Step 0 ────────────────────────────────────╮
+│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓  │
+│ ┃ Prompt                         ┃ Completion                      ┃ Reward ┃  │
+│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩  │
+│ │ system: Summarize the          │ assistant: Normal heart size;   │   1.00 │  │
+│ │ radiology report findings into │ no acute pulmonary findings;    │        │  │
+│ │ an impression with minimal     │ mediastinal calcification and   │        │  │
+│ │ text.                          │ dense right upper lung nodule   │        │  │
+│ │                                │ consistent with prior           │        │  │
+│ │ user: Heart size within normal │ granulomatous disease.          │        │  │
+│ │ limits. No focal alveolar      │                                 │        │  │
+│ │ consolidation, no definite     │                                 │        │  │
+│ │ pleural effusion seen...       │                                 │        │  │
+│ └────────────────────────────────┴─────────────────────────────────┴────────┘  │
+╰────────────────────────────────────────────────────────────────────────────────╯
+--- All ---
+Rewards:
+reward: avg - 1.000, std - 0.000
+r1: [1.0, 1.0, 1.0, 1.0, 1.0]
+```
+
+**Run on validation split:**
+```bash
+python -m medarc_verifiers.cli.main open_i_summarization --split validation -m gpt-4.1-mini -n 10 -r 1 --judge-model gpt-4.1-mini -s
+```
+
+**Fast evaluation (without automatic metrics):**
+```bash
+python -m medarc_verifiers.cli.main open_i_summarization -m gpt-4.1-mini -n 10 -r 1 --judge-model gpt-4.1-mini --no-compute-auto-metrics -s
+```
+
+**Using a local model (e.g., Ollama):**
+```bash
+python -m medarc_verifiers.cli.main open_i_summarization \
+  -m llama3 \
+  --api-base-url http://localhost:11434/v1 \
+  --env-args '{"judge_model":"llama3","judge_base_url":"http://localhost:11434/v1","judge_api_key":"ollama"}' \
+  -n 5 -r 1 -s
+```
+
+---
+
+### Environment Arguments
+
+| Arg | Type | Default | Description |
+| --- | ---- | ------- | ----------- |
+| `split` | `str` | `"test"` | Dataset split to use (`train`, `validation`, `test`). |
+| `judge_model` | `str` | `"gpt-4o-mini"` | Model identifier for the LLM judge. |
+| `judge_base_url` | `str \| None` | `None` | Custom API base URL (e.g., for Ollama or local models). |
+| `judge_api_key` | `str \| None` | `None` | API key for the judge model (falls back to `OPENAI_API_KEY`). |
+| `compute_auto_metrics` | `bool` | `True` | Whether to compute BLEU/ROUGE/BERTScore metrics. |
+| `system_prompt` | `str \| None` | `None` | Custom system prompt (uses default if not provided). |
+
+---
+
+### Metrics
+
+#### Primary Metric (Reward)
+| Metric | Meaning |
+|--------|---------|
+| `reward` | Normalized LLM-judge score (0-1), averaged across correctness, completeness, and conciseness |
+
+#### LLM-Judge Dimensions (1-5 scale)
+
+Criteria adapted from [Van Veen et al., Nature Medicine 2024](https://doi.org/10.1038/s41591-024-02855-5) (Methods - Reader study):
+
+| Dimension | Description |
+|-----------|-------------|
+| `correctness` | Does the summary include false information? Evaluates precision—penalizes fabricated or incorrect information. |
+| `completeness` | Does the summary completely capture important information? Evaluates recall—clinically important detail retained. |
+| `conciseness` | Does the summary contain non-important information? Evaluates brevity—penalizes superfluous information. Compares output length to reference. |
+
+#### Automatic Metrics (following Van Veen et al.)
+| Metric | Description |
+|--------|-------------|
+| `bleu` | BLEU score (n-gram precision) |
+| `rouge1` | ROUGE-1 (unigram overlap) |
+| `rouge2` | ROUGE-2 (bigram overlap) |
+| `rougeL` | ROUGE-L (longest common subsequence) |
+| `rougeLsum` | ROUGE-Lsum (sentence-level LCS) |
+| `bertscore_precision` | BERTScore precision |
+| `bertscore_recall` | BERTScore recall |
+| `bertscore_f1` | BERTScore F1 |
+
+---
+
+### Results Dataset Structure
+
+#### Core Evaluation Fields
+- **`prompt`** – The radiology findings presented to the model.
+- **`completion`** – The model-generated impression.
+- **`reward`** – Normalized LLM-judge score in `[0, 1]`.
+
+#### Example Metadata (`info`)
+- **`idx`** – Original dataset index.
+- **`findings`** – The input radiology findings text.
+- **`judge_feedback`** – Detailed LLM-judge evaluation with scores and reasoning.
+- **`auto_metrics`** – Dictionary containing BLEU, ROUGE, and BERTScore values.
+
+---
+
+### Example Results
+
+The LLM-as-judge now applies stricter conciseness scoring based on the Nature Medicine paper criteria:
+
+```
+=== Example 1 ===
+Model output: Normal heart size; no alveolar consolidation, pleural effusion, 
+              or pulmonary edema; mediastinal calcification and right upper 
+              lung nodule indicate prior granulomatous disease.
+Reference: No acute cardiopulmonary findings
+Length ratio: 5.3x
+Reward: 0.933
+
+LLM-as-Judge Scores:
+  correctness: 5/5 (no false information)
+  completeness: 5/5 (all key findings captured)
+  conciseness: 4/5 (longer than reference, but clinically relevant)
+
+=== Example 2 ===
+Model output: Right middle and lower lobe opacity; mediastinal contours normal; 
+              no fissure displacement or pneumothorax.
+Reference: Opacification of the right middle and lower lobes.
+Length ratio: 2.1x
+Reward: 0.867
+
+LLM-as-Judge Scores:
+  correctness: 5/5
+  completeness: 5/5
+  conciseness: 3/5 (approximately 2x longer with additional details)
+```
+
+---
+
+### References
+
+**Dataset Source**
+- Van Veen, D., Van Uden, C., Blankemeier, L. et al. "Adapted large language models can outperform medical experts in clinical text summarization." *Nature Medicine* 30, 1134–1142 (2024). https://doi.org/10.1038/s41591-024-02855-5
+
+**Open-I Dataset**
+```bibtex
+@article{demner2016preparing,
+  title={Preparing a collection of radiology examinations for distribution and retrieval},
+  author={Demner-Fushman, Dina and Kohli, Marc D and Rosenman, Marc B and Shooshan, Sonya E and Rodriguez, Laritza and Antani, Sameer and Thoma, George R and McDonald, Clement J},
+  journal={Journal of the American Medical Informatics Association},
+  volume={23},
+  number={2},
+  pages={304--310},
+  year={2016},
+  publisher={Oxford University Press}
+}
+```
+
+**Evaluation Metrics**
+- BLEU: Papineni et al., "BLEU: a Method for Automatic Evaluation of Machine Translation" (ACL 2002)
+- ROUGE: Lin, "ROUGE: A Package for Automatic Evaluation of Summaries" (ACL 2004)
+- BERTScore: Zhang et al., "BERTScore: Evaluating Text Generation with BERT" (ICLR 2020)
diff --git a/environments/open_i_summarization/open_i_summarization.py b/environments/open_i_summarization/open_i_summarization.py
new file mode 100644
index 00000000..f4531ce0
--- /dev/null
+++ b/environments/open_i_summarization/open_i_summarization.py
@@ -0,0 +1,364 @@
+"""
+Open-I Summarization Environment
+
+Evaluation environment for radiology report summarization (findings → impression).
+Uses LLM-as-judge evaluation plus automatic metrics (BLEU, ROUGE, BERTScore).
+
+Dataset: medarc/open-i-summarization
+"""
+
+import re
+from typing import Any
+
+import evaluate
+import verifiers as vf
+from datasets import load_dataset
+from datasets.utils.logging import disable_progress_bar
+from medarc_verifiers.parsers import JSONParser
+from medarc_verifiers.utils import default_judge_api_key, judge_sampling_args_and_headers
+from openai import AsyncOpenAI
+from verifiers.types import Info, Messages, State
+
+disable_progress_bar()  # suppress datasets progress bar
+
+# --- Judge Prompt Template ---
+# Evaluation criteria adapted from:
+# Buckley T. et al., "Accuracy of a vision-language model on challenging medical cases"
+# Nature Medicine (2024). https://doi.org/10.1038/s41591-024-02855-5
+# See: Methods - Reader study section
+
+JUDGE_TEMPLATE = """\
+You are a radiology expert tasked with evaluating how well a model summarized radiology findings into an impression.
+
+Your goal is to assess the quality of the model's generated impression compared to the reference impression.
+You will rate the impression along three dimensions: correctness, completeness, and conciseness.
+
+Definitions (from Nature Medicine reader study criteria):
+- Correctness: Does the summary include false information? Evaluate precision—penalize any fabricated or incorrect information not supported by the findings.
+- Completeness: Does the summary completely capture important information? Evaluate recall—the amount of clinically important detail retained from the input findings.
+- Conciseness: Does the summary contain non-important information? Evaluate brevity—the value of a summary decreases with superfluous information. A good summary should be as brief as possible while capturing key findings.
+
+The radiology findings will be provided in these tags:
+<radiology_findings>
+{findings}
+</radiology_findings>
+
+The model's generated impression will be provided in these tags:
+<model_impression>
+{response}
+</model_impression>
+
+The reference impression will be provided in these tags:
+<reference_impression>
+{reference}
+</reference_impression>
+
+Length information (pre-computed):
+- Model impression length: {model_length} characters
+- Reference impression length: {reference_length} characters
+- Length ratio: {length_ratio}x
+
+For each dimension:
+1. Provide a brief explanation (1–3 sentences) describing why you assigned the score.
+2. Then assign a score from 1 to 5:
+
+Correctness scoring:
+- 5: No false or fabricated information
+- 4: Minor inaccuracy that doesn't affect clinical interpretation
+- 3: Some inaccurate information present
+- 2: Multiple inaccuracies or one significant error
+- 1: Major fabrications or errors
+
+Completeness scoring:
+- 5: All clinically important information captured
+- 4: Most important information captured, minor omissions
+- 3: Key findings present but some important details missing
+- 2: Several important findings omitted
+- 1: Critical findings missing
+
+Conciseness scoring (STRICTLY based on length ratio above):
+- 5: Length ratio ≤ 1.0 (same length or shorter than reference)
+- 4: Length ratio 1.0-1.5 (up to 50% longer)
+- 3: Length ratio 1.5-2.0 (50-100% longer)
+- 2: Length ratio 2.0-3.0 (2-3x longer)
+- 1: Length ratio > 3.0 (more than 3x longer)
+
+IMPORTANT: The conciseness score MUST follow the length ratio guidelines above. Do not override based on content quality.
+
+{output_format}
+"""
+
+JUDGE_OUTPUT_JSON = """
+Output your evaluation as a single valid JSON object matching the following structure:
+{
+  "correctness": {
+    "reason": "Brief explanation of why this score was given.",
+    "score": 0
+  },
+  "completeness": {
+    "reason": "Brief explanation of why this score was given.",
+    "score": 0
+  },
+  "conciseness": {
+    "reason": "Brief explanation of why this score was given.",
+    "score": 0
+  }
+}
+
+Ensure the output is valid JSON:
+- Use double quotes (") for all keys and string values.
+- Escape any internal quotes inside the reason fields.
+- Do not include any additional text outside the JSON object.
+- Do not explain your reasoning outside the JSON object; all justification must appear only in the "reason" fields.
+"""
+
+# Scored dimensions must match the keys emitted by JUDGE_TEMPLATE
+# Note: "correctness" replaces "accuracy" per Nature Medicine paper terminology
+JUDGE_DIMENSIONS = ["correctness", "completeness", "conciseness"]
+
+
+def _extract_completion_text(completion: Messages) -> str:
+    """Extract the assistant's text content from a chat-style completion."""
+    if isinstance(completion, list) and completion:
+        last_msg = completion[-1]
+        if isinstance(last_msg, dict):
+            return str(last_msg.get("content", ""))
+    return str(completion)
+
+
+def extract_answer_section(completion_text: str) -> str:
+    """Extract final answer after think tags if present."""
+    if not completion_text:
+        return ""
+    if "<think>" in completion_text and "</think>" in completion_text:
+        return re.sub(r".*?</think>", "", completion_text, flags=re.DOTALL).strip()
+    return completion_text.strip()
+
+
+def _coerce_score(value: Any) -> float | None:
+    """Best-effort conversion of a score value to a float, or None if not possible."""
+    if value is None:
+        return None
+    if isinstance(value, (int, float)):
+        return float(value)
+    if isinstance(value, str):
+        value = value.strip()
+        if not value:
+            return None
+        try:
+            return float(value)
+        except ValueError:
+            return None
+    return None
+
+
+def _compute_normalized_judge_reward(scores: dict[str, dict[str, Any]]) -> float:
+    """Normalize per-dimension judge scores to a single value in [0.0, 1.0].
+
+    Each dimension is expected to be on a 1–5 scale. Scores are clamped to
+    [0, 5], divided by 5 to map to [0, 1], and then averaged across dimensions.
+    """
+    total_dims = len(JUDGE_DIMENSIONS)
+    if total_dims == 0:
+        return 0.0
+
+    accumulated = 0.0
+    for dimension in JUDGE_DIMENSIONS:
+        score = _coerce_score(scores.get(dimension, {}).get("score"))
+        if score is None:
+            continue
+        clamped = max(0.0, min(5.0, score))
+        accumulated += clamped / 5.0
+
+    return max(0.0, min(1.0, accumulated / total_dims))
+
+
+def load_environment(
+    split: str = "test",
+    judge_model: str = "gpt-4o-mini",
+    judge_base_url: str | None = None,
+    judge_api_key: str | None = None,
+    compute_auto_metrics: bool = True,
+    system_prompt: str | None = None,
+    **kwargs: Any,
+) -> vf.SingleTurnEnv:
+    """
+    Load the Open-I Summarization evaluation environment.
+
+    This environment evaluates radiology findings → impression summarization using:
+    1. LLM-as-judge evaluation (accuracy, completeness, conciseness)
+    2. Automatic metrics: BLEU, ROUGE, BERTScore (optional)
+
+    Args:
+        split: Dataset split to use ('train', 'validation', 'test'). Default: 'test'.
+        judge_model: Model identifier for the LLM judge. Default: 'gpt-4o-mini'.
+        judge_base_url: Base URL for judge API (for non-OpenAI endpoints).
+        judge_api_key: API key for judge model. Falls back to env vars if not provided.
+        compute_auto_metrics: Whether to compute BLEU/ROUGE/BERTScore. Default: True.
+        system_prompt: Custom system prompt. Uses default if not provided.
+        **kwargs: Additional arguments forwarded to vf.SingleTurnEnv.
+
+    Returns:
+        A configured vf.SingleTurnEnv for radiology summarization evaluation.
+    """
+    # Load dataset
+    eval_dataset = load_dataset("medarc/open-i-summarization", split=split)
+
+    def _map(ex: dict) -> dict:
+        """Map dataset example to environment format."""
+        return {
+            "question": ex["inputs"].strip(),  # radiology findings
+            "answer": ex["target"].strip(),  # reference impression
+            "info": {
+                "idx": ex["idx"],
+                "findings": ex["inputs"].strip(),
+            },
+        }
+
+    eval_dataset = eval_dataset.map(_map, remove_columns=eval_dataset.column_names)
+
+    # Default system prompt for summarization
+    final_system_prompt = system_prompt or (
+        "Summarize the radiology report findings into an impression with minimal text."
+    )
+
+    # Initialize automatic metrics if enabled
+    bleu_metric = None
+    rouge_metric = None
+    bertscore_metric = None
+
+    if compute_auto_metrics:
+        try:
+            bleu_metric = evaluate.load("bleu")
+            rouge_metric = evaluate.load("rouge")
+            bertscore_metric = evaluate.load("bertscore")
+        except Exception as e:
+            print(f"Warning: Could not load automatic metrics: {e}")
+            compute_auto_metrics = False
+
+    # Judge client setup
+    api_key = default_judge_api_key(judge_base_url) if judge_api_key is None else judge_api_key
+    sampling_args, default_headers = judge_sampling_args_and_headers(judge_model, judge_base_url)
+    # Remove extra_body as OpenAI doesn't support the usage tracking parameter
+    sampling_args.pop("extra_body", None)
+    judge_client = AsyncOpenAI(base_url=judge_base_url, api_key=api_key, default_headers=default_headers)
+    judge_parser = JSONParser(fields=list(JUDGE_DIMENSIONS))
+
+    judge_rubric = vf.JudgeRubric(
+        parallelize_scoring=True,
+        judge_client=judge_client,
+        judge_model=judge_model,
+        judge_prompt="{question}",
+        judge_sampling_args=sampling_args,
+    )
+
+    async def reward_open_i_summarization(
+        prompt: Messages,
+        completion: Messages,
+        info: Info,
+        state: State,
+    ) -> float:
+        """Evaluate radiology summarization using LLM-judge and automatic metrics."""
+        findings = str(info.get("findings", state.get("question", "")))
+        reference = str(state.get("answer", ""))
+        model_response = extract_answer_section(_extract_completion_text(completion))
+
+        # Compute length metrics for conciseness scoring
+        model_length = len(model_response)
+        reference_length = len(reference) if reference else 1  # avoid division by zero
+        length_ratio = model_length / reference_length
+
+        # Store length info for analysis
+        info["length_metrics"] = {
+            "model_length": model_length,
+            "reference_length": reference_length,
+            "length_ratio": round(length_ratio, 2),
+        }
+
+        # --- LLM-as-Judge Evaluation ---
+        judge_prompt = JUDGE_TEMPLATE.format(
+            findings=findings,
+            response=model_response,
+            reference=reference,
+            model_length=model_length,
+            reference_length=reference_length,
+            length_ratio=f"{length_ratio:.1f}",
+            output_format=JUDGE_OUTPUT_JSON,
+        )
+
+        try:
+            judge_raw = await judge_rubric.judge(judge_prompt, model_response, reference, state)
+            parsed = judge_parser.parse(str(judge_raw), strip=True)
+        except AttributeError:
+            judge_raw = await judge_rubric.judge(judge_prompt, "", "", state)
+            parsed = judge_parser.parse(str(judge_raw), strip=True)
+
+        if parsed is None:
+            parsed = {dim: {"score": None, "reason": None} for dim in JUDGE_DIMENSIONS}
+
+        judge_reward = _compute_normalized_judge_reward(parsed)
+
+        # Store judge feedback
+        info.setdefault("judge_feedback", []).append(
+            {
+                "scores": parsed,
+                "normalized_reward": judge_reward,
+                "raw_judge": str(judge_raw),
+            }
+        )
+
+        # --- Automatic Metrics (BLEU, ROUGE, BERTScore) ---
+        auto_metrics: dict[str, Any] = {}
+
+        if compute_auto_metrics and model_response and reference:
+            predictions = [model_response]
+            references = [reference]
+
+            # BLEU
+            try:
+                bleu_scores = bleu_metric.compute(predictions=predictions, references=references)
+                auto_metrics["bleu"] = bleu_scores.get("bleu", 0.0)
+            except Exception:
+                auto_metrics["bleu"] = 0.0
+
+            # ROUGE
+            try:
+                rouge_scores = rouge_metric.compute(predictions=predictions, references=references)
+                auto_metrics["rouge1"] = rouge_scores.get("rouge1", 0.0)
+                auto_metrics["rouge2"] = rouge_scores.get("rouge2", 0.0)
+                auto_metrics["rougeL"] = rouge_scores.get("rougeL", 0.0)
+                auto_metrics["rougeLsum"] = rouge_scores.get("rougeLsum", 0.0)
+            except Exception:
+                auto_metrics["rouge1"] = 0.0
+                auto_metrics["rouge2"] = 0.0
+                auto_metrics["rougeL"] = 0.0
+                auto_metrics["rougeLsum"] = 0.0
+
+            # BERTScore
+            try:
+                bert_scores = bertscore_metric.compute(predictions=predictions, references=references, lang="en")
+                auto_metrics["bertscore_precision"] = (
+                    bert_scores["precision"][0] if bert_scores.get("precision") else 0.0
+                )
+                auto_metrics["bertscore_recall"] = bert_scores["recall"][0] if bert_scores.get("recall") else 0.0
+                auto_metrics["bertscore_f1"] = bert_scores["f1"][0] if bert_scores.get("f1") else 0.0
+            except Exception:
+                auto_metrics["bertscore_precision"] = 0.0
+                auto_metrics["bertscore_recall"] = 0.0
+                auto_metrics["bertscore_f1"] = 0.0
+
+            info["auto_metrics"] = auto_metrics
+
+        # Return the LLM-judge reward as the primary metric
+        return judge_reward
+
+    judge_rubric.add_reward_func(reward_open_i_summarization, weight=1.0)
+
+    return vf.SingleTurnEnv(
+        dataset=eval_dataset,
+        eval_dataset=eval_dataset,
+        system_prompt=final_system_prompt,
+        rubric=judge_rubric,
+        name="open_i_summarization",
+        **kwargs,
+    )
diff --git a/environments/open_i_summarization/pyproject.toml b/environments/open_i_summarization/pyproject.toml
new file mode 100644
index 00000000..e71a3e8b
--- /dev/null
+++ b/environments/open_i_summarization/pyproject.toml
@@ -0,0 +1,30 @@
+[project]
+name = "open_i_summarization"
+description = "Radiology findings to impression summarization benchmark using Open-I dataset"
+tags = ["medical", "radiology", "summarization", "single-turn", "llm-judge", "nlg-metrics"]
+version = "0.1.0"
+requires-python = ">=3.11"
+dependencies = [
+    "verifiers>=0.1.4",
+    "datasets>=2.13.0",
+    "medarc_verifiers>=0.1.0",
+    "evaluate>=0.4.0",
+    "bert_score>=0.3.13",
+    "rouge_score>=0.1.2",
+    "sacrebleu>=2.3.0",
+]
+
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+
+[tool.hatch.build]
+include = ["open_i_summarization.py"]
+
+[tool.uv.sources]
+medarc_verifiers = { git = "https://github.com/MedARC-AI/med-lm-envs" }
+
+[tool.prime.environment]
+loader = "open_i_summarization:load_environment"
+display_name = "Open-I Summarization"
+visibility = "PUBLIC"
diff --git a/tests/test_open_i_summarization.py b/tests/test_open_i_summarization.py
new file mode 100644
index 00000000..789a5c46
--- /dev/null
+++ b/tests/test_open_i_summarization.py
@@ -0,0 +1,205 @@
+"""Tests for the open_i_summarization environment."""
+
+import pytest
+
+
+def test_extract_completion_text_from_list() -> None:
+    """Test extracting text from chat-style completion."""
+    from open_i_summarization import _extract_completion_text
+
+    completion = [{"role": "assistant", "content": "No acute findings"}]
+    assert _extract_completion_text(completion) == "No acute findings"
+
+
+def test_extract_completion_text_from_string() -> None:
+    """Test extracting text from string completion."""
+    from open_i_summarization import _extract_completion_text
+
+    completion = "No acute findings"
+    assert _extract_completion_text(completion) == "No acute findings"
+
+
+def test_extract_completion_text_empty_list() -> None:
+    """Test handling empty completion list."""
+    from open_i_summarization import _extract_completion_text
+
+    completion = []
+    assert _extract_completion_text(completion) == "[]"
+
+
+def test_extract_answer_section_with_think_tags() -> None:
+    """Test extracting answer after think tags."""
+    from open_i_summarization import extract_answer_section
+
+    text = "<think>Some reasoning here...</think>The actual answer"
+    assert extract_answer_section(text) == "The actual answer"
+
+
+def test_extract_answer_section_without_think_tags() -> None:
+    """Test extracting answer without think tags."""
+    from open_i_summarization import extract_answer_section
+
+    text = "Just a plain answer"
+    assert extract_answer_section(text) == "Just a plain answer"
+
+
+def test_extract_answer_section_empty() -> None:
+    """Test handling empty string."""
+    from open_i_summarization import extract_answer_section
+
+    assert extract_answer_section("") == ""
+
+
+def test_compute_normalized_judge_reward_perfect_scores() -> None:
+    """Test normalized reward with perfect scores."""
+    from open_i_summarization import _compute_normalized_judge_reward
+
+    scores = {
+        "correctness": {"score": 5, "reason": "Perfect"},
+        "completeness": {"score": 5, "reason": "Perfect"},
+        "conciseness": {"score": 5, "reason": "Perfect"},
+    }
+    assert _compute_normalized_judge_reward(scores) == pytest.approx(1.0)
+
+
+def test_compute_normalized_judge_reward_mixed_scores() -> None:
+    """Test normalized reward with mixed scores."""
+    from open_i_summarization import _compute_normalized_judge_reward
+
+    scores = {
+        "correctness": {"score": 5, "reason": "Good"},
+        "completeness": {"score": 3, "reason": "Partial"},
+        "conciseness": {"score": 4, "reason": "Good"},
+    }
+    # (5/5 + 3/5 + 4/5) / 3 = (1.0 + 0.6 + 0.8) / 3 = 0.8
+    assert _compute_normalized_judge_reward(scores) == pytest.approx(0.8)
+
+
+def test_compute_normalized_judge_reward_zero_scores() -> None:
+    """Test normalized reward with zero scores."""
+    from open_i_summarization import _compute_normalized_judge_reward
+
+    scores = {
+        "correctness": {"score": 0, "reason": "Poor"},
+        "completeness": {"score": 0, "reason": "Poor"},
+        "conciseness": {"score": 0, "reason": "Poor"},
+    }
+    assert _compute_normalized_judge_reward(scores) == pytest.approx(0.0)
+
+
+def test_compute_normalized_judge_reward_missing_scores() -> None:
+    """Test normalized reward with missing scores."""
+    from open_i_summarization import _compute_normalized_judge_reward
+
+    scores = {
+        "correctness": {"score": 5, "reason": "Good"},
+        # completeness and conciseness missing
+    }
+    # Only correctness is counted: (5/5) / 3 = 0.333...
+    assert _compute_normalized_judge_reward(scores) == pytest.approx(1 / 3)
+
+
+def test_compute_normalized_judge_reward_string_scores() -> None:
+    """Test normalized reward handles string scores."""
+    from open_i_summarization import _compute_normalized_judge_reward
+
+    scores = {
+        "correctness": {"score": "4", "reason": "Good"},
+        "completeness": {"score": "3", "reason": "Partial"},
+        "conciseness": {"score": "5", "reason": "Excellent"},
+    }
+    # (4/5 + 3/5 + 5/5) / 3 = (0.8 + 0.6 + 1.0) / 3 = 0.8
+    assert _compute_normalized_judge_reward(scores) == pytest.approx(0.8)
+
+
+def test_judge_template_contains_placeholders() -> None:
+    """Test that judge template has required placeholders."""
+    from open_i_summarization import JUDGE_TEMPLATE
+
+    assert "{findings}" in JUDGE_TEMPLATE
+    assert "{response}" in JUDGE_TEMPLATE
+    assert "{reference}" in JUDGE_TEMPLATE
+    assert "{output_format}" in JUDGE_TEMPLATE
+
+
+def test_judge_dimensions_defined() -> None:
+    """Test that judge dimensions are properly defined."""
+    from open_i_summarization import JUDGE_DIMENSIONS
+
+    # Note: "correctness" replaces "accuracy" per Nature Medicine paper terminology
+    assert "correctness" in JUDGE_DIMENSIONS
+    assert "completeness" in JUDGE_DIMENSIONS
+    assert "conciseness" in JUDGE_DIMENSIONS
+    assert len(JUDGE_DIMENSIONS) == 3
+
+
+def test_dataset_loading() -> None:
+    """Test that the dataset can be loaded."""
+    from datasets import load_dataset
+
+    ds = load_dataset("medarc/open-i-summarization", split="test")
+    assert len(ds) == 343
+    assert "idx" in ds.features
+    assert "inputs" in ds.features
+    assert "target" in ds.features
+
+
+def test_dataset_sample_structure() -> None:
+    """Test sample structure from dataset."""
+    from datasets import load_dataset
+
+    ds = load_dataset("medarc/open-i-summarization", split="test")
+    sample = ds[0]
+
+    assert isinstance(sample["idx"], int)
+    assert isinstance(sample["inputs"], str)
+    assert isinstance(sample["target"], str)
+    assert len(sample["inputs"]) > 0
+    assert len(sample["target"]) > 0
+
+
+@pytest.fixture
+def mock_api_key(monkeypatch: pytest.MonkeyPatch) -> None:
+    """Set a mock API key for testing."""
+    monkeypatch.setenv("OPENAI_API_KEY", "test-api-key")
+
+
+def test_environment_loading(mock_api_key: None) -> None:
+    """Test that environment loads successfully with mock API key."""
+    import verifiers as vf
+
+    env = vf.load_environment("open_i_summarization", compute_auto_metrics=False)
+
+    assert env.name == "open_i_summarization"
+    assert len(env.eval_dataset) == 343
+    assert "Summarize the radiology report findings" in env.system_prompt
+
+
+def test_environment_dataset_mapping(mock_api_key: None) -> None:
+    """Test that dataset is properly mapped to environment format."""
+    import verifiers as vf
+
+    env = vf.load_environment("open_i_summarization", compute_auto_metrics=False)
+    sample = env.eval_dataset[0]
+
+    assert "question" in sample
+    assert "answer" in sample
+    assert "info" in sample
+    assert "findings" in sample["info"]
+    assert "idx" in sample["info"]
+
+
+def test_environment_with_validation_split(mock_api_key: None) -> None:
+    """Test loading validation split."""
+    import verifiers as vf
+
+    env = vf.load_environment("open_i_summarization", split="validation", compute_auto_metrics=False)
+    assert len(env.eval_dataset) == 341
+
+
+def test_environment_with_train_split(mock_api_key: None) -> None:
+    """Test loading train split."""
+    import verifiers as vf
+
+    env = vf.load_environment("open_i_summarization", split="train", compute_auto_metrics=False)
+    assert len(env.eval_dataset) == 2735