This is the repository of the jounal paper titled "Integrating Explainable Artificial Intelligence and Carcinogenic Potency Characterization for Safer Nitrosamine Risk Assessment in Drug Synthesis."
This project is licensed under the terms of the MIT license.
The scripts were run using Python 3 in miniconda environment. The following packages are required in this project.
joblib 1.5.1
matplotlib-base 3.10.5
numpy 1.26.4
pandas 2.3.1
rdkit 2022.09.3
scikit-learn 1.7.1
tqdm 4.67.1
All the scripts and their input files should be put in the same folder to work as expected. The name of input file(s) in the Python script should be manually modified according to need.
"Data_N.csv" is the basis dataset of Ames test compounds including both nitrosamine and non-nitrosamines.
"NitroNeg.csv" is composed of the Ames-negative nitrosamines in the "Data_N.csv".
"PsuedoData.csv" is the structural analogs dataset.
"Metabolites.csv" includes the nitrosamines generated by BioTransformer 3.0 (https://biotransformer.ca/) and their fingerprints.
Below we will introduce how to use the codes and dataset in this repository to generate the results presented in the paper.
The three scripts, including "Oversampling.py", "PsuedoData.py", and "PsuedoData+matebolic.py" are utilized to increase Ames-negative nitrosamines to balance data.
The steps to implement the 5 strategies in the paper to synthesize data are introduced below.
The input file will be "Data_N.csv", and the default output of "Oversampling.py" will be "Data_N_Oversampling.csv".
"Metabolites.csv" should be merged manually with the "Data_N.csv" to form the augmented dataset with default name "Data_N_Metabolites.csv".
Use "Data_N_Metabolites.csv" as the input file of "Oversampling.py"
"PsuedoData.py" takes "Data_N.csv" and "PsuedoData.csv" as inputs, and remove duplicate compounds. One output file will be generated for each Tanimoto coefficient (Tc) threshold value from 0.60 to 0.85.
"PsuedoData+matebolic.py" takes "Data_N_Metabolites.csv" and "PsuedoData.csv" as inputs. The dataset should be manually reorganized by
- removing Column H ("Metabolites") and the last column ("Outer") from "Data_N_Metabolites.csv", and
- filling in "Metabolites" to the column "Source" for the metabolites.
"RF_Allkind.py" and "RF_NitroOnly.py" are the codes to build the RF model.
"RF_Allkind.py" is used to build model from the Ames dataset including non-nitrosamines. The fingerprint, data augmentation method, and training set should be changed manually to generate different models.
"RF_NitroOnly.py" is used to build model from nitrosamine subset in the Ames dataset or the Analog dataset. The fingerprint, data augmentation method, and training set should be changed manually to generate different models.
"Analyze_LIME.py" takes "NitroNeg.csv" and the model to be analyzed (in ".pkl" format) as input, and outputs the influence coefficient for every fingerprint.