This repository contains the code and resources for Deep learning-based patient note-identification using clinical documents. The project aims to address the challenge of patient-note identification, where we try to accurately associate a single clinical note with its corresponding patient. This task highlights the importance of learning robust patient-level representations. We propose different embedding models and experiment with various aggregation techniques to produce robust patient-level representations. Our models are implemented using PyTorch.
project-root/
│
├── data_processing.py # Script to obtain and preprocess the dataset
│
├── data/
│ ├── raw/ # Original dataset files and CSVs
│ ├── processed/ # Files with embeddings
│
├── models/ # Embedding models to learn representations
│
├── patient_repr_aggregation/ # Codes for learning patient embeddings
│
├── training.py # Script to train and evaluate the classifier
│
├── results/ # Results from model evaluations
│
├── .gitattributes # Configures repository attributes like line endings
├── .gitignore # Specifies files and directories to ignore
└── README.md # Project description and instructions
MIMIC-III database analyzed in the study is available on PhysioNet repository. Here are some steps to prepare for the dataset:
- To request access to MIMIC-III, please follow https://mimic.physionet.org/gettingstarted/access/. Make sure to place
.csv files
underdata/mimic
. - With access to MIMIC-III, to build the MIMIC-III dataset locally using Postgres, follow the instructions at https://github.com/MIT-LCP/mimic-code/tree/master/buildmimic/postgres.
- Run SQL queries to generate necessary views, please follow https://github.com/onlyzdd/clinical-fusion/tree/master/query.
Install Anaconda (or miniconda to save storage space).
Then, create a conda environement (for example patient_repr) and install the dependencies, using the following commands:
$ conda create --name patient_repr python=3.9
$ conda activate patient_repr
$ conda install -y pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch -c=conda-forge
$ conda install -y numpy scipy pandas scikit-learn
$ conda install -y tqdm gensim nltk
Data Processing:
Run the data processing script to prepare the dataset.
python data_processing.py
Model Training:
Train and evaluate the classifier using the processed data.
python training.py
Results:
The results of the model evaluation will be saved in the results/ directory.
data_processing.py
: contains clinical notes preprocessing steps to create final datatsettraining.py
: contains the training and evaluate code of our classifiersdata
folder: contains raw and processed datasetsmodels
folder: contains all embedding models used to learn document-level representationpatient_repr_aggregation
folder: contains all codes used to experiment with different aggregation methods to learn patient-level representationsresults
folder: contains the results obtain by our classifier