The Array type meta-predictor based classier was develped for prediction of Harmful agents.
The following necessary packages should be installed in to process, generate fingerprint, train and test your model.
- KNIME analytics platform (> 3.5)
- python >3.0
- Keras
- Tensorflow
- Scikit Learn
- Jupyter Notebook
- R / R-Studio
-
Collection of chemical data from ChEMBL (Version 24) data for five different targets (i.e. nicotinic acetylcholine receptors (nAChR), muscarinic acetylcholine receptors (mAChR), vesicular acetylcholine receptors (VAChT), acetylcholinesterase (AChE) and butyrylcholinesterase (BUChE) using ChEMBL provided SQL file.
-
Negative data: We used the Decoy Finder to collect the negative data for each targets, in order to keep the balance between the active and inactive classes.
-
External data: In order to evaluate the performance of classifer, we used the external dataset from Chemical Warfare Agents (CWA) and New Psychoactive Substance (NPS).
Targets Collected Data Train (Active) Train (Inactive) Test (Active) Test (Inactive) Total nAChR 3175 637 637 272 272 1818 mAChR 24223 2431 2431 1041 1041 6944 BuChE 2414 484 484 207 207 1382 AChE 8212 1086 1086 464 464 3098 VAChT 231 106 106 45 45 302 CWA 95 95 NPS 3126 3126
-
We used the CDK Descriptor Kit for calculation of different Fingerprints (i.e. ECFP0, ECFP2, ECFP4, ECFP6, FCFP0, FCFP2, FCFP4, FCFP6) with fixed bitvector length (length:1024).
All the input data along with calculated features are available under the
data
folder.
- To build classification models, we used the four machine learning methods (i.e. Random Forest, Decision Tree, Support Vector Machine, k-Nearest Neighbor)
- We have used R for machine learning classification and a R code for one target (i.e. nAChR) has been provided under the script folder. The same code was used for learning for other targets. We build 10 models (M1-M10) for each ML classifier and for each targets. This result in
4 ML method * 5 targets * 10 models (M1-M10) = 200 models
.
- We have used R for machine learning classification and a R code for one target (i.e. nAChR) has been provided under the script folder. The same code was used for learning for other targets. We build 10 models (M1-M10) for each ML classifier and for each targets. This result in
Figure 2: The statistical performance based on AUC in external and internal validation for respective targets.
- To build the array-type meta-predictor based classifier, we used the CNN architecture with multiple layer of network for learning, validation and testing purposes.
- After training, testing and validating the ML classifer for different targets, the external set as third set was used to evaluate the ML performance. The predicted values for each targets, for each ML method and for different models (M1-M10) on external sets were used as input feature matrix with different shapes (i.e. CNN, CNN-3D, CNN-3D Reshaped) for array-type meta-predictor based CNN classification.
- The array-type meta-predictor CNN code is available under
script
folder.