A codebase for systematically evaluating machine learning models that predict sepsis onset in intensive care patients.
It implements and compares several evaluation strategies and metrics to analyze how methodological choices affect reported model performance.
📄 Preprint:
Do D-K, Rockenschaub P, Boie S, Kumpf O, Volk H-D, Balzer F, von Dincklage F, Lichtner G.
The impact of evaluation strategy on sepsis prediction model performance metrics in intensive care data.
medRxiv, 2025. DOI: 10.1101/2025.02.20.25322509
This project investigates how the choice of evaluation strategy affects reported model performance for sepsis prediction. We compare three different evaluation approaches to understand their strengths, limitations, and clinical implications. This analysis is carried out on different models (TCN, GRU, Attention, Logistic Regression) and datasets (MIMIC-IV, BerlinICU).
- How does the choice of evaluation strategy affect reported model performance?
- What is the impact of "onset matching" (aligning patient timelines to sepsis onset)?
- How do models trained on MIMIC-IV perform on BerlinICU?
- Do complex deep learning models (TCN, GRU, Attention) outperform traditional approaches (Logistic Regression)?
-
Peak Score: Takes the maximum prediction value across a patient's entire stay
- Clinical question: "Will this patient develop sepsis at any point?"
- Patient-level evaluation (one prediction per patient)
-
Fixed Horizon: Evaluates predictions at specific times before sepsis onset
- Clinical question: "Will this patient develop sepsis in exactly N hours?"
- Evaluates at: 6, 12, 24, 48, 75, and 100 hours before onset
-
Continuous Prediction: Evaluates every hourly prediction throughout the stay
- Clinical question: "Is this patient developing sepsis right now?"
- Hour-level evaluation with sample weighting
- AUROC: Overall discrimination ability (0.5 = random, 1.0 = perfect)
- AUPRC: Performance on imbalanced data (baseline = prevalence)
- Brier Score: Combined calibration and discrimination (0 = perfect)
- Precision@80%Recall: Clinical utility metric
- Recall@15%Precision: Sensitivity at clinically relevant precision
- NPV@80%Specificity: Negative predictive value at high specificity
- Lift-related metrics: AUPRC Lift, Precision@80%Recall Lift, Recall@Double Prevalence
- Bootstrap Hypothesis Tests: Statistical significance of differences between strategies
sepsis_eval/
├── data/ # Model predictions (input data)
│
├── src/ # Core functionality modules
│ ├── bootstrap.py # Bootstrap orchestration and multiprocessing
│ ├── eval_strategies.py # Evaluation strategy implementations
│ ├── metrics.py # Metric calculation functions
│ ├── plotting.py # Plotting utilities
│ └── preprocessing.py # Data loading and preprocessing
│
├── analysis/ # Analysis workflows
│ ├── run_bootstrap/ # Bootstrap confidence intervals
│ │ ├── config.py # Scenario configurations
│ │ ├── run.py # Main bootstrap runner
│ │ └── run_all_scenarios.sh # Run all scenarios sequentially
│ │
│ ├── calibrate_models/ # Model calibration with isotonic regression
│ │ ├── run.py # Run calibration pipeline
│ │ ├── plot.py # Generate calibration plots
│ │ ├── definitions.py # Core calibration logic and plotting functions
│ │ └── config.py # Calibration scenarios configuration
│ │
│ └── visualize/ # Visualization and analysis
│ ├── metrics/ # Metric visualizations
│ │ ├── plots.py # Generate individual line plots
│ │ ├── panel_single_model.py # Single model panels
│ │ ├── panel_neural_nets.py # Neural network panels
│ │ └── panel_multiple_metrics.py # Multi-metric panels
│ │
│ ├── hypothesis_test/ # Statistical hypothesis testing
│ │ ├── run.py # Run hypothesis tests
│ │ ├── plot.py # Generate test visualizations
│ │ ├── definitions.py # Core testing logic
│ │ └── config.py # Test configurations
│ │
│ ├── onset_matching_effect/ # Onset matching impact analysis
│ └── model_params/ # Model parameter analysis
│
├── results/ # Output directory
│ ├── bootstrap_results/ # Bootstrap analysis results (pickle files)
│ ├── metrics/ # Metric visualizations and panels
│ ├── calibration/ # Calibration plots and results
│ ├── hypothesis_test/ # Hypothesis test results and plots
│ └── onset_matching_effect/ # Onset matching analysis results
│
├── set_paths.py # Centralized path configuration
├── environment.yml # Conda environment specification
└── README.md # This file
- Clone the repository:
git clone https://github.com/physiciandata/sepsis_eval_private.git
cd sepsis_eval_private- Create and activate the conda environment:
conda env create -f environment.yml
conda activate sepsis_eval- Place your model prediction files in the
data/directory
All commands should be run from the project root directory.
Run bootstrap analysis to generate confidence intervals for all metrics:
# Run all 20 scenarios sequentially (takes very long locally)
bash analysis/run_bootstrap/run_all_scenarios.sh
# Or run a specific scenario
python analysis/run_bootstrap/run.py test_case
python analysis/run_bootstrap/run.py peak_score_tcn_yes
python analysis/run_bootstrap/run.py continuous_logreg_no
# See analysis/run_bootstrap/config.py for all available scenariosNote: For HPC/SLURM environments, use slurm/create_scripts/create_scripts.py to generate parallel job scripts.
# Generate all plots and analyses at once (recommended)
python analysis/plot_all.pyThis will generate:
- Individual metric plots (AUROC, AUPRC, Precision@80%Recall, etc.)
- Panel layouts (single model, multi-metric, neural network comparisons)
- Hypothesis test results (Fig 4)
- Calibration analysis (Suppl. Fig 4)
- Onset matching effect analysis (Suppl. Fig 8)
- Model parameters analysis (Suppl. Table 3)
Advanced: Run Individual Components
# Individual metric plots
python analysis/visualize/metrics/plots.py auroc
python analysis/visualize/metrics/plots.py auprc
# Panel layouts
python analysis/visualize/metrics/panel_single_model.py los_filter_yes
python analysis/visualize/metrics/panel_multiple_metrics.py standard
python analysis/visualize/metrics/panel_neural_nets.py auroc
# Calibration
python analysis/calibrate_models/run.py all
python analysis/calibrate_models/plot.py
# Hypothesis testing
python analysis/visualize/hypothesis_test/run.py
python analysis/visualize/hypothesis_test/run.py --model tcn
python analysis/visualize/hypothesis_test/plot.py
# Onset matching effect & model params
python analysis/visualize/onset_matching_effect/run.py full_tcn
python analysis/visualize/model_params/analyze.pySee script docstrings for all available options.
All results are saved in the results/ directory:
results/
├── bootstrap_results/ # Bootstrap analysis (1000 iterations, pickle files)
├── calibration/ # Brier scores
├── hypothesis_test/ # Statistical significance tests
├── metrics/ # All plots (individual + panels)
├── onset_matching_effect/ # Onset matching impact analysis results
└── model_params/ # Model parameters analysis
Paper figures mapping:
- Fig 4: Hypothesis tests →
results/hypothesis_test/ - Fig 5: TCN evaluation →
results/metrics/ - Suppl. Fig 1: LogReg with LOS filter →
results/metrics/ - Suppl. Fig 2: LogReg without LOS filter →
results/metrics/ - Suppl. Fig 4: Calibration analysis →
results/calibration/ - Suppl. Fig 5: Multi-metric (standard) →
results/metrics/ - Suppl. Fig 6: Multi-metric (lift) →
results/metrics/ - Suppl. Fig 7: Neural nets comparison →
results/metrics/ - Suppl. Fig 8: Onset matching effect →
results/onset_matching_effect/ - Suppl. Table 3: Model parameters →
results/model_params/
When contributing, please:
- Follow the existing code structure
- Run scripts from the project root directory
- Ensure all paths use
set_paths.py
If you use this repository or reproduce results from our analysis, please cite our preprint:
Do D-K, Rockenschaub P, Boie S, Kumpf O, Volk H-D, Balzer F, von Dincklage F, Lichtner G. The impact of evaluation strategy on sepsis prediction model performance metrics in intensive care data. medRxiv, 2025. DOI: 10.1101/2025.02.20.25322509
This repository is released under the BSD 3-Clause License. You are free to use, modify, and distribute the code with proper attribution. See the LICENSE file for details.
For questions or issues, please open an issue on GitHub
