Sepsis Evaluation Codebase

A codebase for systematically evaluating machine learning models that predict sepsis onset in intensive care patients.
It implements and compares several evaluation strategies and metrics to analyze how methodological choices affect reported model performance.

📄 Preprint:
Do D-K, Rockenschaub P, Boie S, Kumpf O, Volk H-D, Balzer F, von Dincklage F, Lichtner G.
The impact of evaluation strategy on sepsis prediction model performance metrics in intensive care data.
medRxiv, 2025. DOI: 10.1101/2025.02.20.25322509

Overview

This project investigates how the choice of evaluation strategy affects reported model performance for sepsis prediction. We compare three different evaluation approaches to understand their strengths, limitations, and clinical implications. This analysis is carried out on different models (TCN, GRU, Attention, Logistic Regression) and datasets (MIMIC-IV, BerlinICU).

Key Research Questions

How does the choice of evaluation strategy affect reported model performance?
What is the impact of "onset matching" (aligning patient timelines to sepsis onset)?
How do models trained on MIMIC-IV perform on BerlinICU?
Do complex deep learning models (TCN, GRU, Attention) outperform traditional approaches (Logistic Regression)?

Evaluation Strategies

Peak Score: Takes the maximum prediction value across a patient's entire stay
- Clinical question: "Will this patient develop sepsis at any point?"
- Patient-level evaluation (one prediction per patient)
Fixed Horizon: Evaluates predictions at specific times before sepsis onset
- Clinical question: "Will this patient develop sepsis in exactly N hours?"
- Evaluates at: 6, 12, 24, 48, 75, and 100 hours before onset
Continuous Prediction: Evaluates every hourly prediction throughout the stay
- Clinical question: "Is this patient developing sepsis right now?"
- Hour-level evaluation with sample weighting

Metrics Evaluated

AUROC: Overall discrimination ability (0.5 = random, 1.0 = perfect)
AUPRC: Performance on imbalanced data (baseline = prevalence)
Brier Score: Combined calibration and discrimination (0 = perfect)
Precision@80%Recall: Clinical utility metric
Recall@15%Precision: Sensitivity at clinically relevant precision
NPV@80%Specificity: Negative predictive value at high specificity
Lift-related metrics: AUPRC Lift, Precision@80%Recall Lift, Recall@Double Prevalence
Bootstrap Hypothesis Tests: Statistical significance of differences between strategies

Repository Structure

sepsis_eval/
├── data/                           # Model predictions (input data)
│
├── src/                            # Core functionality modules
│   ├── bootstrap.py               # Bootstrap orchestration and multiprocessing
│   ├── eval_strategies.py         # Evaluation strategy implementations
│   ├── metrics.py                 # Metric calculation functions
│   ├── plotting.py                # Plotting utilities
│   └── preprocessing.py           # Data loading and preprocessing
│
├── analysis/                       # Analysis workflows
│   ├── run_bootstrap/             # Bootstrap confidence intervals
│   │   ├── config.py              # Scenario configurations
│   │   ├── run.py                 # Main bootstrap runner
│   │   └── run_all_scenarios.sh   # Run all scenarios sequentially
│   │
│   ├── calibrate_models/          # Model calibration with isotonic regression
│   │   ├── run.py                 # Run calibration pipeline
│   │   ├── plot.py                # Generate calibration plots
│   │   ├── definitions.py         # Core calibration logic and plotting functions
│   │   └── config.py              # Calibration scenarios configuration
│   │
│   └── visualize/                 # Visualization and analysis
│       ├── metrics/               # Metric visualizations
│       │   ├── plots.py           # Generate individual line plots
│       │   ├── panel_single_model.py      # Single model panels
│       │   ├── panel_neural_nets.py       # Neural network panels
│       │   └── panel_multiple_metrics.py  # Multi-metric panels
│       │
│       ├── hypothesis_test/       # Statistical hypothesis testing
│       │   ├── run.py             # Run hypothesis tests
│       │   ├── plot.py            # Generate test visualizations
│       │   ├── definitions.py     # Core testing logic
│       │   └── config.py          # Test configurations
│       │
│       ├── onset_matching_effect/ # Onset matching impact analysis
│       └── model_params/          # Model parameter analysis
│
├── results/                       # Output directory
│   ├── bootstrap_results/         # Bootstrap analysis results (pickle files)
│   ├── metrics/                   # Metric visualizations and panels
│   ├── calibration/              # Calibration plots and results
│   ├── hypothesis_test/          # Hypothesis test results and plots
│   └── onset_matching_effect/    # Onset matching analysis results
│
├── set_paths.py                  # Centralized path configuration
├── environment.yml               # Conda environment specification
└── README.md                     # This file

Installation

Clone the repository:

git clone https://github.com/physiciandata/sepsis_eval_private.git
cd sepsis_eval_private

Create and activate the conda environment:

conda env create -f environment.yml
conda activate sepsis_eval

Place your model prediction files in the data/ directory

Usage

All commands should be run from the project root directory.

1. Bootstrap Analysis (Confidence Intervals)

Run bootstrap analysis to generate confidence intervals for all metrics:

# Run all 20 scenarios sequentially (takes very long locally)
bash analysis/run_bootstrap/run_all_scenarios.sh

# Or run a specific scenario
python analysis/run_bootstrap/run.py test_case
python analysis/run_bootstrap/run.py peak_score_tcn_yes
python analysis/run_bootstrap/run.py continuous_logreg_no
# See analysis/run_bootstrap/config.py for all available scenarios

Note: For HPC/SLURM environments, use slurm/create_scripts/create_scripts.py to generate parallel job scripts.

2. Generate All Visualizations

# Generate all plots and analyses at once (recommended)
python analysis/plot_all.py

This will generate:

Individual metric plots (AUROC, AUPRC, Precision@80%Recall, etc.)
Panel layouts (single model, multi-metric, neural network comparisons)
Hypothesis test results (Fig 4)
Calibration analysis (Suppl. Fig 4)
Onset matching effect analysis (Suppl. Fig 8)
Model parameters analysis (Suppl. Table 3)

Advanced: Run Individual Components

# Individual metric plots
python analysis/visualize/metrics/plots.py auroc
python analysis/visualize/metrics/plots.py auprc

# Panel layouts
python analysis/visualize/metrics/panel_single_model.py los_filter_yes
python analysis/visualize/metrics/panel_multiple_metrics.py standard
python analysis/visualize/metrics/panel_neural_nets.py auroc

# Calibration
python analysis/calibrate_models/run.py all
python analysis/calibrate_models/plot.py

# Hypothesis testing
python analysis/visualize/hypothesis_test/run.py
python analysis/visualize/hypothesis_test/run.py --model tcn
python analysis/visualize/hypothesis_test/plot.py

# Onset matching effect & model params
python analysis/visualize/onset_matching_effect/run.py full_tcn
python analysis/visualize/model_params/analyze.py

See script docstrings for all available options.

Output Files

All results are saved in the results/ directory:

results/
├── bootstrap_results/    # Bootstrap analysis (1000 iterations, pickle files)
├── calibration/          # Brier scores
├── hypothesis_test/      # Statistical significance tests
├── metrics/              # All plots (individual + panels)
├── onset_matching_effect/ # Onset matching impact analysis results
└── model_params/         # Model parameters analysis

Paper figures mapping:

Fig 4: Hypothesis tests → results/hypothesis_test/
Fig 5: TCN evaluation → results/metrics/
Suppl. Fig 1: LogReg with LOS filter → results/metrics/
Suppl. Fig 2: LogReg without LOS filter → results/metrics/
Suppl. Fig 4: Calibration analysis → results/calibration/
Suppl. Fig 5: Multi-metric (standard) → results/metrics/
Suppl. Fig 6: Multi-metric (lift) → results/metrics/
Suppl. Fig 7: Neural nets comparison → results/metrics/
Suppl. Fig 8: Onset matching effect → results/onset_matching_effect/
Suppl. Table 3: Model parameters → results/model_params/

Contributing

When contributing, please:

Follow the existing code structure
Run scripts from the project root directory
Ensure all paths use set_paths.py

Citation

If you use this repository or reproduce results from our analysis, please cite our preprint:

Do D-K, Rockenschaub P, Boie S, Kumpf O, Volk H-D, Balzer F, von Dincklage F, Lichtner G. The impact of evaluation strategy on sepsis prediction model performance metrics in intensive care data. medRxiv, 2025. DOI: 10.1101/2025.02.20.25322509

License

This repository is released under the BSD 3-Clause License. You are free to use, modify, and distribute the code with proper attribution. See the LICENSE file for details.

Contact

For questions or issues, please open an issue on GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sepsis Evaluation Codebase

Overview

Key Research Questions

Evaluation Strategies

Metrics Evaluated

Repository Structure

Installation

Usage

1. Bootstrap Analysis (Confidence Intervals)

2. Generate All Visualizations

Output Files

Contributing

Citation

License

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
analysis		analysis
results		results
slurm		slurm
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
set_paths.py		set_paths.py

License

umg-minai/sepsis-eval-study

Folders and files

Latest commit

History

Repository files navigation

Sepsis Evaluation Codebase

Overview

Key Research Questions

Evaluation Strategies

Metrics Evaluated

Repository Structure

Installation

Usage

1. Bootstrap Analysis (Confidence Intervals)

2. Generate All Visualizations

Output Files

Contributing

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages