Skip to content

umg-minai/sepsis-eval-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sepsis Evaluation Codebase

A codebase for systematically evaluating machine learning models that predict sepsis onset in intensive care patients.
It implements and compares several evaluation strategies and metrics to analyze how methodological choices affect reported model performance.

📄 Preprint:
Do D-K, Rockenschaub P, Boie S, Kumpf O, Volk H-D, Balzer F, von Dincklage F, Lichtner G.
The impact of evaluation strategy on sepsis prediction model performance metrics in intensive care data.
medRxiv, 2025. DOI: 10.1101/2025.02.20.25322509

Overview

This project investigates how the choice of evaluation strategy affects reported model performance for sepsis prediction. We compare three different evaluation approaches to understand their strengths, limitations, and clinical implications. This analysis is carried out on different models (TCN, GRU, Attention, Logistic Regression) and datasets (MIMIC-IV, BerlinICU).

Key Research Questions

  • How does the choice of evaluation strategy affect reported model performance?
  • What is the impact of "onset matching" (aligning patient timelines to sepsis onset)?
  • How do models trained on MIMIC-IV perform on BerlinICU?
  • Do complex deep learning models (TCN, GRU, Attention) outperform traditional approaches (Logistic Regression)?

Evaluation Strategies

  1. Peak Score: Takes the maximum prediction value across a patient's entire stay

    • Clinical question: "Will this patient develop sepsis at any point?"
    • Patient-level evaluation (one prediction per patient)
  2. Fixed Horizon: Evaluates predictions at specific times before sepsis onset

    • Clinical question: "Will this patient develop sepsis in exactly N hours?"
    • Evaluates at: 6, 12, 24, 48, 75, and 100 hours before onset
  3. Continuous Prediction: Evaluates every hourly prediction throughout the stay

    • Clinical question: "Is this patient developing sepsis right now?"
    • Hour-level evaluation with sample weighting

Metrics Evaluated

  • AUROC: Overall discrimination ability (0.5 = random, 1.0 = perfect)
  • AUPRC: Performance on imbalanced data (baseline = prevalence)
  • Brier Score: Combined calibration and discrimination (0 = perfect)
  • Precision@80%Recall: Clinical utility metric
  • Recall@15%Precision: Sensitivity at clinically relevant precision
  • NPV@80%Specificity: Negative predictive value at high specificity
  • Lift-related metrics: AUPRC Lift, Precision@80%Recall Lift, Recall@Double Prevalence
  • Bootstrap Hypothesis Tests: Statistical significance of differences between strategies

Repository Structure

sepsis_eval/
├── data/                           # Model predictions (input data)
│
├── src/                            # Core functionality modules
│   ├── bootstrap.py               # Bootstrap orchestration and multiprocessing
│   ├── eval_strategies.py         # Evaluation strategy implementations
│   ├── metrics.py                 # Metric calculation functions
│   ├── plotting.py                # Plotting utilities
│   └── preprocessing.py           # Data loading and preprocessing
│
├── analysis/                       # Analysis workflows
│   ├── run_bootstrap/             # Bootstrap confidence intervals
│   │   ├── config.py              # Scenario configurations
│   │   ├── run.py                 # Main bootstrap runner
│   │   └── run_all_scenarios.sh   # Run all scenarios sequentially
│   │
│   ├── calibrate_models/          # Model calibration with isotonic regression
│   │   ├── run.py                 # Run calibration pipeline
│   │   ├── plot.py                # Generate calibration plots
│   │   ├── definitions.py         # Core calibration logic and plotting functions
│   │   └── config.py              # Calibration scenarios configuration
│   │
│   └── visualize/                 # Visualization and analysis
│       ├── metrics/               # Metric visualizations
│       │   ├── plots.py           # Generate individual line plots
│       │   ├── panel_single_model.py      # Single model panels
│       │   ├── panel_neural_nets.py       # Neural network panels
│       │   └── panel_multiple_metrics.py  # Multi-metric panels
│       │
│       ├── hypothesis_test/       # Statistical hypothesis testing
│       │   ├── run.py             # Run hypothesis tests
│       │   ├── plot.py            # Generate test visualizations
│       │   ├── definitions.py     # Core testing logic
│       │   └── config.py          # Test configurations
│       │
│       ├── onset_matching_effect/ # Onset matching impact analysis
│       └── model_params/          # Model parameter analysis
│
├── results/                       # Output directory
│   ├── bootstrap_results/         # Bootstrap analysis results (pickle files)
│   ├── metrics/                   # Metric visualizations and panels
│   ├── calibration/              # Calibration plots and results
│   ├── hypothesis_test/          # Hypothesis test results and plots
│   └── onset_matching_effect/    # Onset matching analysis results
│
├── set_paths.py                  # Centralized path configuration
├── environment.yml               # Conda environment specification
└── README.md                     # This file

Installation

  1. Clone the repository:
git clone https://github.com/physiciandata/sepsis_eval_private.git
cd sepsis_eval_private
  1. Create and activate the conda environment:
conda env create -f environment.yml
conda activate sepsis_eval
  1. Place your model prediction files in the data/ directory

Usage

All commands should be run from the project root directory.

1. Bootstrap Analysis (Confidence Intervals)

Run bootstrap analysis to generate confidence intervals for all metrics:

# Run all 20 scenarios sequentially (takes very long locally)
bash analysis/run_bootstrap/run_all_scenarios.sh

# Or run a specific scenario
python analysis/run_bootstrap/run.py test_case
python analysis/run_bootstrap/run.py peak_score_tcn_yes
python analysis/run_bootstrap/run.py continuous_logreg_no
# See analysis/run_bootstrap/config.py for all available scenarios

Note: For HPC/SLURM environments, use slurm/create_scripts/create_scripts.py to generate parallel job scripts.

2. Generate All Visualizations

# Generate all plots and analyses at once (recommended)
python analysis/plot_all.py

This will generate:

  • Individual metric plots (AUROC, AUPRC, Precision@80%Recall, etc.)
  • Panel layouts (single model, multi-metric, neural network comparisons)
  • Hypothesis test results (Fig 4)
  • Calibration analysis (Suppl. Fig 4)
  • Onset matching effect analysis (Suppl. Fig 8)
  • Model parameters analysis (Suppl. Table 3)
Advanced: Run Individual Components
# Individual metric plots
python analysis/visualize/metrics/plots.py auroc
python analysis/visualize/metrics/plots.py auprc

# Panel layouts
python analysis/visualize/metrics/panel_single_model.py los_filter_yes
python analysis/visualize/metrics/panel_multiple_metrics.py standard
python analysis/visualize/metrics/panel_neural_nets.py auroc

# Calibration
python analysis/calibrate_models/run.py all
python analysis/calibrate_models/plot.py

# Hypothesis testing
python analysis/visualize/hypothesis_test/run.py
python analysis/visualize/hypothesis_test/run.py --model tcn
python analysis/visualize/hypothesis_test/plot.py

# Onset matching effect & model params
python analysis/visualize/onset_matching_effect/run.py full_tcn
python analysis/visualize/model_params/analyze.py

See script docstrings for all available options.

Output Files

All results are saved in the results/ directory:

results/
├── bootstrap_results/    # Bootstrap analysis (1000 iterations, pickle files)
├── calibration/          # Brier scores
├── hypothesis_test/      # Statistical significance tests
├── metrics/              # All plots (individual + panels)
├── onset_matching_effect/ # Onset matching impact analysis results
└── model_params/         # Model parameters analysis

Paper figures mapping:

  • Fig 4: Hypothesis tests → results/hypothesis_test/
  • Fig 5: TCN evaluation → results/metrics/
  • Suppl. Fig 1: LogReg with LOS filter → results/metrics/
  • Suppl. Fig 2: LogReg without LOS filter → results/metrics/
  • Suppl. Fig 4: Calibration analysis → results/calibration/
  • Suppl. Fig 5: Multi-metric (standard) → results/metrics/
  • Suppl. Fig 6: Multi-metric (lift) → results/metrics/
  • Suppl. Fig 7: Neural nets comparison → results/metrics/
  • Suppl. Fig 8: Onset matching effect → results/onset_matching_effect/
  • Suppl. Table 3: Model parameters → results/model_params/

Project Overview

Contributing

When contributing, please:

  1. Follow the existing code structure
  2. Run scripts from the project root directory
  3. Ensure all paths use set_paths.py

Citation

If you use this repository or reproduce results from our analysis, please cite our preprint:

Do D-K, Rockenschaub P, Boie S, Kumpf O, Volk H-D, Balzer F, von Dincklage F, Lichtner G. The impact of evaluation strategy on sepsis prediction model performance metrics in intensive care data. medRxiv, 2025. DOI: 10.1101/2025.02.20.25322509

License

This repository is released under the BSD 3-Clause License. You are free to use, modify, and distribute the code with proper attribution. See the LICENSE file for details.

Contact

For questions or issues, please open an issue on GitHub

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published