Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions configs/envs/open_i_summarization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Open-I Summarization environment configuration
# Radiology findings to impression summarization benchmark

- id: open_i_summarization
module: open_i_summarization
num_examples: -1
verbose: false
env_args:
split: test
compute_auto_metrics: true

# Validation split variant
- id: open_i_summarization_val
module: open_i_summarization
num_examples: -1
verbose: false
env_args:
split: validation
compute_auto_metrics: true

# Fast evaluation (no automatic metrics)
- id: open_i_summarization_fast
module: open_i_summarization
num_examples: -1
verbose: false
env_args:
split: test
compute_auto_metrics: false
213 changes: 213 additions & 0 deletions environments/open_i_summarization/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
# Open-I Summarization

Evaluation environment for radiology report summarization: generating impressions from findings.

### Overview
- **Environment ID**: `open_i_summarization`
- **Short description**: Radiology findings-to-impression summarization benchmark using the Open-I chest X-ray dataset. This environment evaluates how well models can distill radiology findings into concise, clinically accurate impressions.
- **Tags**: medical, radiology, summarization, single-turn, llm-judge, nlg-metrics
- **System Prompt**: "Summarize the radiology report findings into an impression with minimal text."

---

### Dataset
- **Source**: [medarc/open-i-summarization](https://huggingface.co/datasets/medarc/open-i-summarization)
- **Based on**: Indiana University Chest X-ray Collection (Open-I), as used in [Van Veen et al., Nature Medicine 2024](https://www.nature.com/articles/s41591-024-02855-5) - "Adapted large language models can outperform medical experts in clinical text summarization"
- **Split sizes**:
- **Train:** 2,735 examples
- **Validation:** 341 examples
- **Test:** 343 examples
- **Task**: Given radiology findings, generate a concise clinical impression.

---

### Task
- **Type:** Single-Turn Summarization
- **Input:** Radiology findings (chest X-ray report text)
- **Output:** Clinical impression (summary)
- **Evaluation:** Dual evaluation approach following the Nature Medicine paper:
1. **LLM-as-Judge**: Evaluates correctness, completeness, and conciseness (1-5 scale each)
2. **Automatic Metrics**: BLEU, ROUGE (1/2/L/Lsum), BERTScore (precision/recall/F1)

The LLM-as-judge criteria are adapted from the **Nature Medicine reader study** (Methods section):
- **Correctness**: "Which summary includes less false information?" — evaluates precision (penalizes fabricated information)
- **Completeness**: "Which summary more completely captures important information?" — evaluates recall (clinically important detail retained)
- **Conciseness**: "Which summary contains less non-important information?" — evaluates brevity (penalizes superfluous information)

The implementation follows the pattern established in `medicationqa`, using multi-dimensional scoring with a JSONParser.

---

### Quickstart

**Basic evaluation with default settings:**
```bash
python -m medarc_verifiers.cli.main open_i_summarization -m gpt-4.1-mini -n 5 -r 1 --judge-model gpt-4.1-mini -s
```

**Example output:**
```
--- Evaluation ---
Environment: open_i_summarization
Model: gpt-4.1-mini
Provider: https://api.openai.com/v1/
Examples: 5
Rollouts per example: 1
--- Example ---
╭──────────────────────────────────── Step 0 ────────────────────────────────────╮
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ │
│ ┃ Prompt ┃ Completion ┃ Reward ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ │
│ │ system: Summarize the │ assistant: Normal heart size; │ 1.00 │ │
│ │ radiology report findings into │ no acute pulmonary findings; │ │ │
│ │ an impression with minimal │ mediastinal calcification and │ │ │
│ │ text. │ dense right upper lung nodule │ │ │
│ │ │ consistent with prior │ │ │
│ │ user: Heart size within normal │ granulomatous disease. │ │ │
│ │ limits. No focal alveolar │ │ │ │
│ │ consolidation, no definite │ │ │ │
│ │ pleural effusion seen... │ │ │ │
│ └────────────────────────────────┴─────────────────────────────────┴────────┘ │
╰────────────────────────────────────────────────────────────────────────────────╯
--- All ---
Rewards:
reward: avg - 1.000, std - 0.000
r1: [1.0, 1.0, 1.0, 1.0, 1.0]
```

**Run on validation split:**
```bash
python -m medarc_verifiers.cli.main open_i_summarization --split validation -m gpt-4.1-mini -n 10 -r 1 --judge-model gpt-4.1-mini -s
```

**Fast evaluation (without automatic metrics):**
```bash
python -m medarc_verifiers.cli.main open_i_summarization -m gpt-4.1-mini -n 10 -r 1 --judge-model gpt-4.1-mini --no-compute-auto-metrics -s
```

**Using a local model (e.g., Ollama):**
```bash
python -m medarc_verifiers.cli.main open_i_summarization \
-m llama3 \
--api-base-url http://localhost:11434/v1 \
--env-args '{"judge_model":"llama3","judge_base_url":"http://localhost:11434/v1","judge_api_key":"ollama"}' \
-n 5 -r 1 -s
```

---

### Environment Arguments

| Arg | Type | Default | Description |
| --- | ---- | ------- | ----------- |
| `split` | `str` | `"test"` | Dataset split to use (`train`, `validation`, `test`). |
| `judge_model` | `str` | `"gpt-4o-mini"` | Model identifier for the LLM judge. |
| `judge_base_url` | `str \| None` | `None` | Custom API base URL (e.g., for Ollama or local models). |
| `judge_api_key` | `str \| None` | `None` | API key for the judge model (falls back to `OPENAI_API_KEY`). |
| `compute_auto_metrics` | `bool` | `True` | Whether to compute BLEU/ROUGE/BERTScore metrics. |
| `system_prompt` | `str \| None` | `None` | Custom system prompt (uses default if not provided). |

---

### Metrics

#### Primary Metric (Reward)
| Metric | Meaning |
|--------|---------|
| `reward` | Normalized LLM-judge score (0-1), averaged across correctness, completeness, and conciseness |

#### LLM-Judge Dimensions (1-5 scale)

Criteria adapted from [Van Veen et al., Nature Medicine 2024](https://doi.org/10.1038/s41591-024-02855-5) (Methods - Reader study):

| Dimension | Description |
|-----------|-------------|
| `correctness` | Does the summary include false information? Evaluates precision—penalizes fabricated or incorrect information. |
| `completeness` | Does the summary completely capture important information? Evaluates recall—clinically important detail retained. |
| `conciseness` | Does the summary contain non-important information? Evaluates brevity—penalizes superfluous information. Compares output length to reference. |

#### Automatic Metrics (following Van Veen et al.)
| Metric | Description |
|--------|-------------|
| `bleu` | BLEU score (n-gram precision) |
| `rouge1` | ROUGE-1 (unigram overlap) |
| `rouge2` | ROUGE-2 (bigram overlap) |
| `rougeL` | ROUGE-L (longest common subsequence) |
| `rougeLsum` | ROUGE-Lsum (sentence-level LCS) |
| `bertscore_precision` | BERTScore precision |
| `bertscore_recall` | BERTScore recall |
| `bertscore_f1` | BERTScore F1 |

---

### Results Dataset Structure

#### Core Evaluation Fields
- **`prompt`** – The radiology findings presented to the model.
- **`completion`** – The model-generated impression.
- **`reward`** – Normalized LLM-judge score in `[0, 1]`.

#### Example Metadata (`info`)
- **`idx`** – Original dataset index.
- **`findings`** – The input radiology findings text.
- **`judge_feedback`** – Detailed LLM-judge evaluation with scores and reasoning.
- **`auto_metrics`** – Dictionary containing BLEU, ROUGE, and BERTScore values.

---

### Example Results

The LLM-as-judge now applies stricter conciseness scoring based on the Nature Medicine paper criteria:

```
=== Example 1 ===
Model output: Normal heart size; no alveolar consolidation, pleural effusion,
or pulmonary edema; mediastinal calcification and right upper
lung nodule indicate prior granulomatous disease.
Reference: No acute cardiopulmonary findings
Length ratio: 5.3x
Reward: 0.933

LLM-as-Judge Scores:
correctness: 5/5 (no false information)
completeness: 5/5 (all key findings captured)
conciseness: 4/5 (longer than reference, but clinically relevant)

=== Example 2 ===
Model output: Right middle and lower lobe opacity; mediastinal contours normal;
no fissure displacement or pneumothorax.
Reference: Opacification of the right middle and lower lobes.
Length ratio: 2.1x
Reward: 0.867

LLM-as-Judge Scores:
correctness: 5/5
completeness: 5/5
conciseness: 3/5 (approximately 2x longer with additional details)
```

---

### References

**Dataset Source**
- Van Veen, D., Van Uden, C., Blankemeier, L. et al. "Adapted large language models can outperform medical experts in clinical text summarization." *Nature Medicine* 30, 1134–1142 (2024). https://doi.org/10.1038/s41591-024-02855-5

**Open-I Dataset**
```bibtex
@article{demner2016preparing,
title={Preparing a collection of radiology examinations for distribution and retrieval},
author={Demner-Fushman, Dina and Kohli, Marc D and Rosenman, Marc B and Shooshan, Sonya E and Rodriguez, Laritza and Antani, Sameer and Thoma, George R and McDonald, Clement J},
journal={Journal of the American Medical Informatics Association},
volume={23},
number={2},
pages={304--310},
year={2016},
publisher={Oxford University Press}
}
```

**Evaluation Metrics**
- BLEU: Papineni et al., "BLEU: a Method for Automatic Evaluation of Machine Translation" (ACL 2002)
- ROUGE: Lin, "ROUGE: A Package for Automatic Evaluation of Summaries" (ACL 2004)
- BERTScore: Zhang et al., "BERTScore: Evaluating Text Generation with BERT" (ICLR 2020)
Loading