Skip to content

jbdel/RadEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

124 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RadEval

All-in-one metrics for evaluating AI-generated radiology text

PyPI Python version Expert Dataset Model Video Gradio Demo EMNLP License

TL;DR

pip install -e .
from RadEval import RadEval
import json

refs = [
    "Mild cardiomegaly with small bilateral pleural effusions and basilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]
hyps = [
    "Mildly enlarged cardiac silhouette with small pleural effusions and dependent bibasilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]

evaluator = RadEval(
    do_radgraph=True,
    do_bleu=True
)

results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=2))
{
  "radgraph_simple": 0.72,
  "radgraph_partial": 0.61,
  "radgraph_complete": 0.61,
  "bleu": 0.36
}

Installation

pip install RadEval              # from PyPI
pip install RadEval[api]         # include OpenAI/Gemini for LLM-based metrics

Or install from source:

git clone https://github.com/jbdel/RadEval.git && cd RadEval
conda create -n radeval python=3.11 -y && conda activate radeval
pip install -e '.[api]'

Supported Metrics

Category Metric Flag Modality Provider Best For Usage
Lexical BLEU do_bleu -- -- Surface-level n-gram overlap docs
ROUGE do_rouge -- -- Content coverage docs
Semantic BERTScore do_bertscore -- -- Semantic similarity docs
RadEval BERTScore do_radeval_bertscore -- -- Domain-adapted radiology semantics docs
Clinical F1CheXbert do_f1chexbert CXR -- CheXpert finding classification docs
F1RadBERT-CT do_f1radbert_ct CT -- CT finding classification docs
F1RadGraph do_radgraph CXR -- Clinical entity/relation accuracy docs
RaTEScore do_ratescore CXR -- Entity-level synonym-aware scoring docs
Specialized RadGraph-RadCliQ do_radgraph_radcliq CXR -- Per-pair entity+relation F1 (RadCliQ variant) docs
RadCliQ-v1 do_radcliq CXR -- Composite clinical relevance docs
SRRBert do_srrbert CXR -- Structured report evaluation docs
Temporal F1 do_temporal CXR -- Temporal consistency docs
GREEN do_green CXR Local HF LLM-based overall quality (7B model) docs
MammoGREEN do_mammo_green Mammo OpenAI / Gemini Mammography-specific LLM scoring docs
CRIMSON do_crimson CXR OpenAI / HF LLM-based clinical significance scoring docs
RadFact-CT do_radfact_ct CT OpenAI LLM-based factual precision/recall docs

Modality: CXR = Chest X-Ray, CT = Computed Tomography, Mammo = Mammography, -- = modality-agnostic.

Enable only the metrics you need -- each one is loaded lazily.

API Keys for LLM Metrics

LLM-based metrics (CRIMSON, MammoGREEN, RadFact-CT) share two global API key arguments:

evaluator = RadEval(
    openai_api_key="sk-...",   # used by CRIMSON (openai), MammoGREEN (openai), RadFact-CT
    gemini_api_key="AIza...",  # used by MammoGREEN (gemini)
    do_crimson=True,
    do_mammo_green=True,
    do_radfact_ct=True,
)

If not passed explicitly, keys fall back to the environment variables OPENAI_API_KEY, GEMINI_API_KEY, or GOOGLE_API_KEY. An error is raised if the chosen provider requires a key that is neither passed nor in the environment.

Per-Sample Output

Pass do_per_sample=True to get per-sample scores for every enabled metric. The output uses the same flat keys as the default mode, but each value is a list[float] of length n_samples instead of a single aggregate.

evaluator = RadEval(do_bleu=True, do_bertscore=True, do_per_sample=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu"]      → [0.85, 0.40, ...]   (one per sample)
# results["bertscore"] → [0.95, 0.89, ...]

See docs/metrics.md for the full list of per-sample output keys for each metric.

Detailed Output

Pass do_details=True to get additional aggregate scores beyond the defaults: per-label F1 breakdowns for classifiers, BLEU-1/2/3, standard deviations for LLM-based metrics. Same flat keys as default, no nesting.

evaluator = RadEval(do_bleu=True, do_f1chexbert=True, do_crimson=True, do_details=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu"]       → 0.36     (same as default)
# results["bleu_1"]     → 0.55     (extra: BLEU-1)
# results["bleu_2"]     → 0.42     (extra: BLEU-2)
# results["crimson_std"] → 0.15    (extra: std)
# results["f1chexbert_label_scores_f1"] → {"f1chexbert_5": {"Cardiomegaly": 0.59, ...}, ...}

See docs/metrics.md for the full output schema of each metric.

Comparing Systems

Use compare_systems to run paired approximate randomization tests between any number of systems:

from RadEval import RadEval, compare_systems

evaluator = RadEval(do_bleu=True)
signatures, scores = compare_systems(
    systems={
        'baseline': baseline_reports,
        'improved': improved_reports,
    },
    metrics={'bleu': lambda hyps, refs: evaluator(refs, hyps)['bleu']},
    references=reference_reports,
    n_samples=10000,
)

See docs/hypothesis_testing.md for a full walkthrough and interpretation guide.

Documentation

Page Contents
docs/metrics.md What each metric measures, do_per_sample / do_details output schemas
docs/hypothesis_testing.md Statistical background, full example, performance notes
docs/file_formats.md Loading data from .tok, .json, and Python lists

RadEval Expert Dataset

A curated evaluation set annotated by board-certified radiologists for validating automatic metrics. Available on HuggingFace.

Citation

@inproceedings{xu-etal-2025-radeval,
    title = "{R}ad{E}val: A framework for radiology text evaluation",
    author = "Xu, Justin  and
      Zhang, Xi  and
      Abderezaei, Javid  and
      Bauml, Julie  and
      Boodoo, Roger  and
      Haghighi, Fatemeh  and
      Ganjizadeh, Ali  and
      Brattain, Eric  and
      Van Veen, Dave  and
      Meng, Zaiqiao  and
      Eyre, David W  and
      Delbrouck, Jean-Benoit",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-demos.40/",
    doi = "10.18653/v1/2025.emnlp-demos.40",
    pages = "546--557",
}

Contributors

Jean-Benoit Delbrouck
Jean-Benoit Delbrouck
Justin Xu
Justin Xu
Xi Zhang
Xi Zhang

Acknowledgments

Built on the work of the radiology AI community: CheXbert, RadGraph, BERTScore, RaTEScore, SRR-BERT, GREEN, and datasets like MIMIC-CXR.


If you find RadEval useful, please give us a star!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages