Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
kohjiaxuan committed Feb 8, 2020
2 parents 633ca40 + 5d0cd82 commit bbf5260
Showing 1 changed file with 40 additions and 5 deletions.
45 changes: 40 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Fraud-Detection-Pipeline
A structured data pipeline for classification problems that does multiple purposes like scaling, sampling, k-fold cross validation with evaluation metrics. <br>
A structured data pipeline for classification problems that does multiple purposes like scaling, sampling, k-fold stratified cross validation (CV) with evaluation metrics. <br>
It reduces the need for users to rewrite a lot of code as it's reusability is very high.<br>
Refer to <b>sklearn_classification_pipeline.py</b> for the full code <br><br>
Refer to <b>sklearn_classification_pipeline.py</b> for the full code <br>
For pipeline that does not have k-fold cross validation which leads to faster testing, use sklearn_classifier_pipeline_optionalCV.py. However this file still supports the internal sklearn cross validation method (can be switched on or off by parameter input). To use the custom k-fold stratified cross validation method, use sklearn_classification_pipeline.py instead.
<br><br>
# Strengths of using a data pipeline:
1. Customized pipeline works for all forms of classification problems, including fraud detection problems that require oversampling techniques. <br>
2. Data pipeline allows users to do prior data cleansing and feature engineering, as long as df is DataFrame format with both features and response <br>
Expand All @@ -13,9 +15,11 @@ Refer to <b>sklearn_classification_pipeline.py</b> for the full code <br><br>
8. High customizability and reusability of the code while reducing the need to rewrite a lot of code (leading to spaghetti coding)

# Instructions:
To run pipeline, create new class object modelpipeline() <br>
Next, execute modelpipeline.runmodel(...) <br>
## Parameters are:
To run pipeline, import sklearn_classification_pipeline.py (stratified CV) or sklearn_classifier_pipeline_optionalCV.py (can switch off CV). <br>
Create new class object modelpipeline() <br>
Next, execute modelpipeline.runmodel(...) with the required parameters <br>

## Parameters for sklearn_classification_pipeline.py are:
1. df - DataFrame that has went through data cleaning, processing, feature engineering, containing both features and response
<br> No standardization/scaling is required as there is built in function for that <br>
2. varlist - List of all variables/features to use in model, including the response variable. <br>
Expand All @@ -25,3 +29,34 @@ Next, execute modelpipeline.runmodel(...) <br>
5. modelname - Choose the type of model to run - user can add more models as required using the if/else clause to check this string input in the buildmodel function <br>
6. text - Text title to put for the confusion matrix output in each iteration of the n-folds stratified cross validation <br>
7. n-fold - number of folds for the stratified cross validation <br>
8. Note that sklearn_classifier_pipeline_optionalCV.py has an additional parameter at the end called CV. If CV=False, then cross validation will be switched off. <br>
9. Remember to save the dictionary object into a variable - e.g. results = modelpipeline.runmodel(...) so that the evaluation results can be saved and reused.

# Results:
![Confusion Matrix returned from each iteration of k-fold cross validation](https://github.com/kohjiaxuan/Fraud-Detection-Pipeline/blob/master/Confusion_Matrix.PNG)
<br><br>
After the tests have finished, you can read the dictionary object storing the evaluation metrics results. In this case, results['final'] store the averaged results for k-fold cross validation while the other key-value pairs will store the evaluation metric result of each individual iteration in a list. <br>
![Sample Results returned from the modelpipeline object](https://github.com/kohjiaxuan/Fraud-Detection-Pipeline/blob/master/results.PNG)
<br>
The dictionary object returned will have results for each fold of k-fold cross validation. Evaluation metrics include: <br>
1. Accuracy
2. Actual Accuracy (Optional - can be a hold out dataset for testing and can be other metrics other than accuracy)
3. Sensitvity
4. Specificity
5. Precision
6. f1 score
7. (ROC) AUC value
8. (PR) AUC value
9. Averaged values for 1-8 stored in dictionary object tagged to 'final' key
<br>
Object Template = {"accuracy": [...], "actual_accuracy": [...], "sensitivity": [...], "specificity": [...],
"precision": [...], "f1": [...], "auc": [...], "pr_auc": [...], "final": {...}}
<br>
For sklearn_classifier_pipeline_optionalCV.py, it returns results in a string format instead of a list of strings as there is only round of train-test. There is also the option to tweak the code to export the best model/transformed train-test dataset out for usage. By default, this is not done to free up memory usage. <br><br>

## Graphs for classification problems

In the last iteration, ROC-AUC curve and PR-AUC curve will be plotted for users to analyze. For the individual AUC results, users can refer to the dictionary object output. <br>
![ROC Curve that plots True Positive Rate against False Positive Rate](https://github.com/kohjiaxuan/Fraud-Detection-Pipeline/blob/master/ROC_AUC_Curve.PNG)
<br><br>
![PR Curve that plots Precision against Recall](https://github.com/kohjiaxuan/Fraud-Detection-Pipeline/blob/master/PR_AUC_Curve.PNG)

0 comments on commit bbf5260

Please sign in to comment.