Nitrosamine datasets and codes

This is the repository of the jounal paper titled "Integrating Explainable Artificial Intelligence and Carcinogenic Potency Characterization for Safer Nitrosamine Risk Assessment in Drug Synthesis."

This project is licensed under the terms of the MIT license.

Requirements

The scripts were run using Python 3 in miniconda environment. The following packages are required in this project.

joblib 1.5.1

matplotlib-base 3.10.5

numpy 1.26.4

pandas 2.3.1

rdkit 2022.09.3

scikit-learn 1.7.1

tqdm 4.67.1

All the scripts and their input files should be put in the same folder to work as expected. The name of input file(s) in the Python script should be manually modified according to need.

Description of Datasets

"Data_N.csv" is the basis dataset of Ames test compounds including both nitrosamine and non-nitrosamines.

"NitroNeg.csv" is composed of the Ames-negative nitrosamines in the "Data_N.csv".

"PsuedoData.csv" is the structural analogs dataset.

"Metabolites.csv" includes the nitrosamines generated by BioTransformer 3.0 (https://biotransformer.ca/) and their fingerprints.

Code Usage

Below we will introduce how to use the codes and dataset in this repository to generate the results presented in the paper.

Data Augmentation

The three scripts, including "Oversampling.py", "PsuedoData.py", and "PsuedoData+matebolic.py" are utilized to increase Ames-negative nitrosamines to balance data.

The steps to implement the 5 strategies in the paper to synthesize data are introduced below.

Oversampling

The input file will be "Data_N.csv", and the default output of "Oversampling.py" will be "Data_N_Oversampling.csv".

Metabolite addition

"Metabolites.csv" should be merged manually with the "Data_N.csv" to form the augmented dataset with default name "Data_N_Metabolites.csv".

Metabolites + Oversampling

Use "Data_N_Metabolites.csv" as the input file of "Oversampling.py"

Analog addition

"PsuedoData.py" takes "Data_N.csv" and "PsuedoData.csv" as inputs, and remove duplicate compounds. One output file will be generated for each Tanimoto coefficient (Tc) threshold value from 0.60 to 0.85.

Metabolites + Analogs

"PsuedoData+matebolic.py" takes "Data_N_Metabolites.csv" and "PsuedoData.csv" as inputs. The dataset should be manually reorganized by

removing Column H ("Metabolites") and the last column ("Outer") from "Data_N_Metabolites.csv", and
filling in "Metabolites" to the column "Source" for the metabolites.

Model construction

"RF_Allkind.py" and "RF_NitroOnly.py" are the codes to build the RF model.

"RF_Allkind.py" is used to build model from the Ames dataset including non-nitrosamines. The fingerprint, data augmentation method, and training set should be changed manually to generate different models.

"RF_NitroOnly.py" is used to build model from nitrosamine subset in the Ames dataset or the Analog dataset. The fingerprint, data augmentation method, and training set should be changed manually to generate different models.

Explainability via LIME

"Analyze_LIME.py" takes "NitroNeg.csv" and the model to be analyzed (in ".pkl" format) as input, and outputs the influence coefficient for every fingerprint.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
codes		codes
data		data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nitrosamine datasets and codes

Requirements

Description of Datasets

Code Usage

Data Augmentation

Oversampling

Metabolite addition

Metabolites + Oversampling

Analog addition

Metabolites + Analogs

Model construction

Explainability via LIME

About

Uh oh!

Releases

Packages

Languages

License

CMDM-Lab/nitrosamine

Folders and files

Latest commit

History

Repository files navigation

Nitrosamine datasets and codes

Requirements

Description of Datasets

Code Usage

Data Augmentation

Oversampling

Metabolite addition

Metabolites + Oversampling

Analog addition

Metabolites + Analogs

Model construction

Explainability via LIME

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages