From bdb9b494c1be66cf5d71c8c9000289b71d302e32 Mon Sep 17 00:00:00 2001
From: Koh Jia Xuan <51515698+kohjiaxuan@users.noreply.github.com>
Date: Sat, 8 Feb 2020 00:31:05 +0800
Subject: [PATCH 1/2] Update README.md
More information on types of evaluation metrics output and images of examples of the model pipeline running
---
README.md | 45 ++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 40 insertions(+), 5 deletions(-)
diff --git a/README.md b/README.md
index 183e2bb..333628d 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,9 @@
# Fraud-Detection-Pipeline
-A structured data pipeline for classification problems that does multiple purposes like scaling, sampling, k-fold cross validation with evaluation metrics.
+A structured data pipeline for classification problems that does multiple purposes like scaling, sampling, k-fold stratified cross validation (CV) with evaluation metrics.
It reduces the need for users to rewrite a lot of code as it's reusability is very high.
-Refer to sklearn_classification_pipeline.py for the full code
+Refer to sklearn_classification_pipeline.py for the full code
+For pipeline that does not have k-fold cross validation which leads to faster testing, use sklearn_classifier_pipeline_optionalCV.py. However this file still supports the internal sklearn cross validation method (can be switched on or off by parameter input). To use the custom k-fold stratified cross validation method, use sklearn_classification_pipeline.py instead.
+
# Strengths of using a data pipeline:
1. Customized pipeline works for all forms of classification problems, including fraud detection problems that require oversampling techniques.
2. Data pipeline allows users to do prior data cleansing and feature engineering, as long as df is DataFrame format with both features and response
@@ -13,9 +15,11 @@ Refer to sklearn_classification_pipeline.py for the full code
8. High customizability and reusability of the code while reducing the need to rewrite a lot of code (leading to spaghetti coding)
# Instructions:
-To run pipeline, create new class object modelpipeline()
-Next, execute modelpipeline.runmodel(...)
-## Parameters are:
+To run pipeline, import sklearn_classification_pipeline.py (stratified CV) or sklearn_classifier_pipeline_optionalCV.py (can switch off CV).
+Create new class object modelpipeline()
+Next, execute modelpipeline.runmodel(...) with the required parameters
+
+## Parameters for sklearn_classification_pipeline.py are:
1. df - DataFrame that has went through data cleaning, processing, feature engineering, containing both features and response
No standardization/scaling is required as there is built in function for that
2. varlist - List of all variables/features to use in model, including the response variable.
@@ -25,3 +29,34 @@ Next, execute modelpipeline.runmodel(...)
5. modelname - Choose the type of model to run - user can add more models as required using the if/else clause to check this string input in the buildmodel function
6. text - Text title to put for the confusion matrix output in each iteration of the n-folds stratified cross validation
7. n-fold - number of folds for the stratified cross validation
+8. Note that sklearn_classifier_pipeline_optionalCV.py has an additional parameter at the end called CV. If CV=False, then cross validation will be switched off.
+9. Remember to save the dictionary object into a variable - e.g. results = modelpipeline.runmodel(...) so that the evaluation results can be saved and reused.
+
+# Results:
+
+
+After the tests have finished, you can read the dictionary object storing the evaluation metrics results. In this case, results['final'] store the averaged results for k-fold cross validation while the other key-value pairs will store the evaluation metric result of each individual iteration in a list.
+
+
+The dictionary object returned will have results for each fold of k-fold cross validation. Evaluation metrics include:
+1. Accuracy
+2. Actual Accuracy (Optional - can be a hold out dataset for testing and can be other metrics other than accuracy
+3. Sensitvity
+4. Specificity
+5. Precision
+6. f1 score
+7. (ROC) AUC value
+8. (PR) AUC value
+9. Averaged values for 1-8 stored in dictionary object tagged to 'final' key
+
+Object Template = {"accuracy": [...], "actual_accuracy": [...], "sensitivity": [...], "specificity": [...],
+ "precision": [...], "f1": [...], "auc": [...], "pr_auc": [...], "final": {...}}
+
+For sklearn_classifier_pipeline_optionalCV.py, it returns results in a string format instead of a list of strings as there is only round of train-test. There is also the option to tweak the code to export the best model/transformed train-test dataset out for usage. By default, this is not done to free up memory usage.
+
+## Graphs for classification problems
+
+In the last iteration, ROC-AUC curve and PR-AUC curve will be plotted for users to analyze. For the individual AUC results, users can refer to the dictionary object output.
+
+
+
From 5d0cd821a916df174ea785474f27273a440c5bb8 Mon Sep 17 00:00:00 2001
From: Koh Jia Xuan <51515698+kohjiaxuan@users.noreply.github.com>
Date: Sat, 8 Feb 2020 00:34:13 +0800
Subject: [PATCH 2/2] Update README.md
Formatting of README.md
---
README.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 333628d..26dec6d 100644
--- a/README.md
+++ b/README.md
@@ -40,7 +40,7 @@ After the tests have finished, you can read the dictionary object storing the ev
The dictionary object returned will have results for each fold of k-fold cross validation. Evaluation metrics include:
1. Accuracy
-2. Actual Accuracy (Optional - can be a hold out dataset for testing and can be other metrics other than accuracy
+2. Actual Accuracy (Optional - can be a hold out dataset for testing and can be other metrics other than accuracy)
3. Sensitvity
4. Specificity
5. Precision