Skip to content
This repository was archived by the owner on Nov 20, 2025. It is now read-only.
/ nitrosamine Public archive

The repository of datasets and codes of the jounal paper titled "Integrating Explainable Artificial Intelligence and Carcinogenic Potency Characterization for Safer Nitrosamine Risk Assessment in Drug Synthesis."

License

Notifications You must be signed in to change notification settings

CMDM-Lab/nitrosamine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Nitrosamine datasets and codes

This is the repository of the jounal paper titled "Integrating Explainable Artificial Intelligence and Carcinogenic Potency Characterization for Safer Nitrosamine Risk Assessment in Drug Synthesis."

This project is licensed under the terms of the MIT license.

Requirements

The scripts were run using Python 3 in miniconda environment. The following packages are required in this project.

joblib 1.5.1

matplotlib-base 3.10.5

numpy 1.26.4

pandas 2.3.1

rdkit 2022.09.3

scikit-learn 1.7.1

tqdm 4.67.1

All the scripts and their input files should be put in the same folder to work as expected. The name of input file(s) in the Python script should be manually modified according to need.

Description of Datasets

"Data_N.csv" is the basis dataset of Ames test compounds including both nitrosamine and non-nitrosamines.

"NitroNeg.csv" is composed of the Ames-negative nitrosamines in the "Data_N.csv".

"PsuedoData.csv" is the structural analogs dataset.

"Metabolites.csv" includes the nitrosamines generated by BioTransformer 3.0 (https://biotransformer.ca/) and their fingerprints.

Code Usage

Below we will introduce how to use the codes and dataset in this repository to generate the results presented in the paper.

Data Augmentation

The three scripts, including "Oversampling.py", "PsuedoData.py", and "PsuedoData+matebolic.py" are utilized to increase Ames-negative nitrosamines to balance data.

The steps to implement the 5 strategies in the paper to synthesize data are introduced below.

Oversampling

The input file will be "Data_N.csv", and the default output of "Oversampling.py" will be "Data_N_Oversampling.csv".

Metabolite addition

"Metabolites.csv" should be merged manually with the "Data_N.csv" to form the augmented dataset with default name "Data_N_Metabolites.csv".

Metabolites + Oversampling

Use "Data_N_Metabolites.csv" as the input file of "Oversampling.py"

Analog addition

"PsuedoData.py" takes "Data_N.csv" and "PsuedoData.csv" as inputs, and remove duplicate compounds. One output file will be generated for each Tanimoto coefficient (Tc) threshold value from 0.60 to 0.85.

Metabolites + Analogs

"PsuedoData+matebolic.py" takes "Data_N_Metabolites.csv" and "PsuedoData.csv" as inputs. The dataset should be manually reorganized by

  1. removing Column H ("Metabolites") and the last column ("Outer") from "Data_N_Metabolites.csv", and
  2. filling in "Metabolites" to the column "Source" for the metabolites.

Model construction

"RF_Allkind.py" and "RF_NitroOnly.py" are the codes to build the RF model.

"RF_Allkind.py" is used to build model from the Ames dataset including non-nitrosamines. The fingerprint, data augmentation method, and training set should be changed manually to generate different models.

"RF_NitroOnly.py" is used to build model from nitrosamine subset in the Ames dataset or the Analog dataset. The fingerprint, data augmentation method, and training set should be changed manually to generate different models.

Explainability via LIME

"Analyze_LIME.py" takes "NitroNeg.csv" and the model to be analyzed (in ".pkl" format) as input, and outputs the influence coefficient for every fingerprint.

About

The repository of datasets and codes of the jounal paper titled "Integrating Explainable Artificial Intelligence and Carcinogenic Potency Characterization for Safer Nitrosamine Risk Assessment in Drug Synthesis."

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages