MoPaDi - Morphing Histopathology Diffusion

MoPaDi combines Diffusion Autoencoders with multiple instance learning (MIL) for explainability of deep learning classifiers in histopathology.

Important

This repository contains an updated version of the codebase. For the experiments described in the preprint, please refer to version 0.0.1 of MoPaDi.

For segmentation of 6 cell types to quantify changes in original and counterfactual images, DeepCMorph pretrained models were used.

For preprocessing of whole slide images (WSIs), please refer to KatherLab's STAMP protocol.

Table of Contents:

Getting started
Training the Models from Scratch
Pretrained Models
Datasets
Acknowledgements
Reference

Getting started

Clone the repository and create a virtual environment to install required packages, e.g. with uv (instructions below), conda or mamba.

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Sync dependencies. In the root of this repository, run:
```
uv sync
```
This creates a virtual environment at .venv/ and installs all necessary Python dependencies.
Activate the environment:
```
source .venv/bin/activate
```

You can obtain access to pretrained models on Hugging Face. Once the environment is set up and access to models has been granted, you can run the example notebooks (all the necessary data for these examples has been provided) or train your own models.

Training the Models from Scratch

To train the models from scratch, follow these steps:

Prepare the Environment: Ensure you have set up the virtual environment and installed the required packages.
Download Datasets: Obtain the Datasets used in the preprint or use your own.
Preprocess the Data: If the dataset consists of WSIs and not tiles, use the STAMP protocol for preprocessing WSIs as needed. The starting point for MoPaDi is folders of tiles (color normalized or not). Multiple cohorts can be used, all tiles do not need to be in the same folder. Resizing, if needed, can be done automatically during the training. Accepted image formats: JPEG, TIFF and PNG.

Tip

ZIP files containing tiles for each patient (STAMP's output) are also accepted and do not need to be extracted beforehand.

Configure Training: Modify conf.yaml file to match your dataset, define output path and desired training parameters.
Run Trainings: Execute the training scripts for the desired models.

Diffusion autoencoder: the core component of MoPaDi, encodes and decodes the images. Training this model is the longest step in the pipeline. Depending on the data and hardware, training time may vary between a couple of days to a few weeks.
```
mopadi autoenc --config conf.yaml
```

You can evaluate the trained autoencoder by adapting src/mopadi/utils/reconstruct_1k_images.py for your data to reconstruct images from the test set and compute corresponding metrics: SSIM, MS-SSIM, MSE.

Latent DPM (optional, not required for counterfactuals generation): for unconditional synthetic image generation. Enables sampling feature vectors from the latent space of semantic encoder, which are then decoded to synthetic histopathology tiles.
```
mopadi latent --config conf.yaml
```
Linear classifier: the simplest classifier for linearly separable classes, based on the original DiffAE method. Ground truth labels are needed for each tile. Enables counterfactual image generation.
```
mopadi linear_classifier --config conf.yaml
```
MIL classifier: more complex approach to guide counterfactual image generation, when a label is given on a patient level and not for each tile, introduced in our preprint.
```
mopadi mil --config conf.yaml --mode crossval
mopadi mil --config conf.yaml --mode train
mopadi mil --config conf.yaml --mode manipulate
```

Pretrained Models

Pretrained models can be found on Hugging Face. If you have already obtained access to models in that repository, automatic download is set up in example notebooks. Included models are:

Tissue classes autoencoding diffusion model (trained on 224 x 224 px tiles from NCT-CRC-HE-100K dataset (Kather et al., 2018)) + linear 9 classes classifier (Adipose [ADI], background [BACK], debris [DEB], lymphocytes [LYM], mucus [MUC], smooth muscle [MUS], normal colon mucosa [NORM], cancer-associated stroma [STR], colorectal adenocarcinoma epithelium [TUM]);
Colorectal (CRC) cancer autoencoding diffusion model (trained on 512 x 512 px tiles (0.5 microns per px, MPP) from tumor regions from TCGA CRC cohort) + microsatellite instability (MSI) status MIL classifier (MSI high [MSIH] vs. nonMSIH);
Breast cancer (BRCA) autoencoding diffusion model (trained on 512 x 512 px tiles (0.5 MPP) from tumor regions from TCGA BRCA cohort) + breast cancer type (invasive lobular carcinoma [ILC] vs. invasive ductal carcinoma [IDC]) and E2 center MIL classifiers;
Pancancer autoencoding diffusion model (trained on 256 x 256 px tiles (varying MPP) from histology images from uniform tumor regions in TCGA WSI (Komura & Ishikawa, 2021)) + liver cancer types (hepatocellular carcinoma [HCC] vs. cholangiocarcinoma [CCA]) MIL & linear classifiers and lung cancer types (lung adenocarcinoma [LUAD] vs. lung squamous cell carcinoma [LUSC]) MIL & linear classifiers.

If you want to process multiple images/folders, you can use conf.yaml, with use_pretrained parameter set to True, which will trigger automatic download of the selected model, and the following command:

mopadi mil --config conf.yaml --mode manipulate

For this to work, make sure you are logged in to your Hugging Face account:

huggingface-cli login

You can check if you are already logged in by running:

huggingface-cli whoami

Examples of counterfactual images generated with corresponding models (please refer to the preprint for more examples):

Datasets

Diagnostic WSI from The Cancer Genome Atlas (TCGA)
Histology images from uniform tumor regions in TCGA Whole Slide Images (Komura & Ishikawa, 2021)
100,000 histological images of human colorectal cancer and healthy tissue (Kather et al., 2018)

Acknowledgements

This project was built upon a DiffAE (MIT license) repository. We thank the developers for making their code open source.

Reference

If you find our work useful for your research or if you use parts of the code please consider citing our preprint:

Žigutytė, L., Lenz, T., Han, T., Hewitt, K. J., Reitsam, N. G., Foersch, S., Carrero, Z. I., Unger, M., Pearson, T. A., Truhn, D. & Kather, J. N. (2024). Counterfactual Diffusion Models for Mechanistic Explainability of Artificial Intelligence Models in Pathology. bioRxiv, 2024.

@misc{zigutyte2024mopadi,
      title={ounterfactual Diffusion Models for Mechanistic Explainability of Artificial Intelligence Models in Pathology}, 
      author={Laura Žigutytė and Tim Lenz and Tianyu Han and Katherine Jane Hewitt and Nic Gabriel Reitsam and Sebastian Foersch and Zunamys I Carrero and Michaela Unger and Alexander T Pearson and Daniel Truhn and Jakob Nikolas Kather},
      year={2024},
      eprint={2024.10.29.620913},
      archivePrefix={bioRxiv},
      url={https://www.biorxiv.org/content/10.1101/2024.10.29.620913v1}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
checkpoints		checkpoints
datasets		datasets
images		images
notebooks		notebooks
src/mopadi		src/mopadi
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
conf.yaml		conf.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_hpc.slurm		run_hpc.slurm
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MoPaDi - Morphing Histopathology Diffusion

Getting started

Training the Models from Scratch

Pretrained Models

Datasets

Acknowledgements

Reference

About

Uh oh!

Releases

Contributors 2

Uh oh!

Languages

License

KatherLab/mopadi

Folders and files

Latest commit

History

Repository files navigation

MoPaDi - Morphing Histopathology Diffusion

Getting started

Training the Models from Scratch

Pretrained Models

Datasets

Acknowledgements

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 2

Uh oh!

Languages