RadEval

All-in-one metrics for evaluating AI-generated radiology text

TL;DR

pip install -e .

from RadEval import RadEval
import json

refs = [
    "Mild cardiomegaly with small bilateral pleural effusions and basilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]
hyps = [
    "Mildly enlarged cardiac silhouette with small pleural effusions and dependent bibasilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]

evaluator = RadEval(
    do_radgraph=True,
    do_bleu=True
)

results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=2))

{
  "radgraph_simple": 0.72,
  "radgraph_partial": 0.61,
  "radgraph_complete": 0.61,
  "bleu": 0.36
}

Installation

pip install RadEval              # from PyPI
pip install RadEval[api]         # include OpenAI/Gemini for LLM-based metrics

Or install from source:

git clone https://github.com/jbdel/RadEval.git && cd RadEval
conda create -n radeval python=3.11 -y && conda activate radeval
pip install -e '.[api]'

Supported Metrics

Category	Metric	Flag	Modality	Provider	Best For	Usage
Lexical	BLEU	`do_bleu`	--	--	Surface-level n-gram overlap	docs
	ROUGE	`do_rouge`	--	--	Content coverage	docs
Semantic	BERTScore	`do_bertscore`	--	--	Semantic similarity	docs
	RadEval BERTScore	`do_radeval_bertscore`	--	--	Domain-adapted radiology semantics	docs
Clinical	F1CheXbert	`do_f1chexbert`	CXR	--	CheXpert finding classification	docs
	F1RadBERT-CT	`do_f1radbert_ct`	CT	--	CT finding classification	docs
	F1RadGraph	`do_radgraph`	CXR	--	Clinical entity/relation accuracy	docs
	RaTEScore	`do_ratescore`	CXR	--	Entity-level synonym-aware scoring	docs
Specialized	RadGraph-RadCliQ	`do_radgraph_radcliq`	CXR	--	Per-pair entity+relation F1 (RadCliQ variant)	docs
	RadCliQ-v1	`do_radcliq`	CXR	--	Composite clinical relevance	docs
	SRRBert	`do_srrbert`	CXR	--	Structured report evaluation	docs
	Temporal F1	`do_temporal`	CXR	--	Temporal consistency	docs
	GREEN	`do_green`	CXR	Local HF	LLM-based overall quality (7B model)	docs
	MammoGREEN	`do_mammo_green`	Mammo	OpenAI / Gemini	Mammography-specific LLM scoring	docs
	CRIMSON	`do_crimson`	CXR	OpenAI / HF	LLM-based clinical significance scoring	docs
	RadFact-CT	`do_radfact_ct`	CT	OpenAI	LLM-based factual precision/recall	docs

Modality: CXR = Chest X-Ray, CT = Computed Tomography, Mammo = Mammography, -- = modality-agnostic.

Enable only the metrics you need -- each one is loaded lazily.

API Keys for LLM Metrics

LLM-based metrics (CRIMSON, MammoGREEN, RadFact-CT) share two global API key arguments:

evaluator = RadEval(
    openai_api_key="sk-...",   # used by CRIMSON (openai), MammoGREEN (openai), RadFact-CT
    gemini_api_key="AIza...",  # used by MammoGREEN (gemini)
    do_crimson=True,
    do_mammo_green=True,
    do_radfact_ct=True,
)

If not passed explicitly, keys fall back to the environment variables OPENAI_API_KEY, GEMINI_API_KEY, or GOOGLE_API_KEY. An error is raised if the chosen provider requires a key that is neither passed nor in the environment.

Per-Sample Output

Pass do_per_sample=True to get per-sample scores for every enabled metric. The output uses the same flat keys as the default mode, but each value is a list[float] of length n_samples instead of a single aggregate.

evaluator = RadEval(do_bleu=True, do_bertscore=True, do_per_sample=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu"]      → [0.85, 0.40, ...]   (one per sample)
# results["bertscore"] → [0.95, 0.89, ...]

See docs/metrics.md for the full list of per-sample output keys for each metric.

Detailed Output

Pass do_details=True to get additional aggregate scores beyond the defaults: per-label F1 breakdowns for classifiers, BLEU-1/2/3, standard deviations for LLM-based metrics. Same flat keys as default, no nesting.

evaluator = RadEval(do_bleu=True, do_f1chexbert=True, do_crimson=True, do_details=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu"]       → 0.36     (same as default)
# results["bleu_1"]     → 0.55     (extra: BLEU-1)
# results["bleu_2"]     → 0.42     (extra: BLEU-2)
# results["crimson_std"] → 0.15    (extra: std)
# results["f1chexbert_label_scores_f1"] → {"f1chexbert_5": {"Cardiomegaly": 0.59, ...}, ...}

See docs/metrics.md for the full output schema of each metric.

Comparing Systems

Use compare_systems to run paired approximate randomization tests between any number of systems:

from RadEval import RadEval, compare_systems

evaluator = RadEval(do_bleu=True)
signatures, scores = compare_systems(
    systems={
        'baseline': baseline_reports,
        'improved': improved_reports,
    },
    metrics={'bleu': lambda hyps, refs: evaluator(refs, hyps)['bleu']},
    references=reference_reports,
    n_samples=10000,
)

See docs/hypothesis_testing.md for a full walkthrough and interpretation guide.

Documentation

Page	Contents
docs/metrics.md	What each metric measures, `do_per_sample` / `do_details` output schemas
docs/hypothesis_testing.md	Statistical background, full example, performance notes
docs/file_formats.md	Loading data from .tok, .json, and Python lists

RadEval Expert Dataset

A curated evaluation set annotated by board-certified radiologists for validating automatic metrics. Available on HuggingFace.

Citation

@inproceedings{xu-etal-2025-radeval,
    title = "{R}ad{E}val: A framework for radiology text evaluation",
    author = "Xu, Justin  and
      Zhang, Xi  and
      Abderezaei, Javid  and
      Bauml, Julie  and
      Boodoo, Roger  and
      Haghighi, Fatemeh  and
      Ganjizadeh, Ali  and
      Brattain, Eric  and
      Van Veen, Dave  and
      Meng, Zaiqiao  and
      Eyre, David W  and
      Delbrouck, Jean-Benoit",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-demos.40/",
    doi = "10.18653/v1/2025.emnlp-demos.40",
    pages = "546--557",
}

Contributors

_{Jean-Benoit Delbrouck}

_{Justin Xu}

_{Xi Zhang}

Acknowledgments

Built on the work of the radiology AI community: CheXbert, RadGraph, BERTScore, RaTEScore, SRR-BERT, GREEN, and datasets like MIMIC-CXR.

If you find RadEval useful, please give us a star!

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
RadEval		RadEval
docs		docs
logs		logs
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RadEval

TL;DR

Installation

Supported Metrics

API Keys for LLM Metrics

Per-Sample Output

Detailed Output

Comparing Systems

Documentation

RadEval Expert Dataset

Citation

Contributors

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RadEval

TL;DR

Installation

Supported Metrics

API Keys for LLM Metrics

Per-Sample Output

Detailed Output

Comparing Systems

Documentation

RadEval Expert Dataset

Citation

Contributors

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages