Fraud Detection

Credit Card Data For ML model (Kaggle)

The Data Set used for the machine learning model is a Credit Card transactions data set. This Data set has no obfuscation as it has more then 24 million transactions generated from a virtual world simulation.The Data covers 2000 synthetic consumer residents in the US, who travel the world. For this analysis, only 2019 data has been extracted due to missing fradulant data in some years.
Fraud Statistics
1. The Ascent
2. Federal Trade Commision

Technology stack

Google Colab
python 3.8.3
pandas 1.0.5
numpy 1.18.5
requests 2.24.0
json 2.0.9
panel 0.9.7
plotly 5.3.1
hvplot 0.7.3
seaborn 0.10.1
matplotlib 3.2.2
scikit-learn 1.0.0 (NOTE scikit-learn 1.0.1 will not work!)
imblearn 0.8.1
xgboost 1.5.1
jupyter lab 2.1.5

Coding Standards

Following rules have been applied during code development and testing:

All variables must reflect their purpose. Underscore to be used as and when required.
Each step of the code must contain comments to explain the purpose of the code.
A git hub repository called project2 must be set up with branches for each developer.
Each developer must use their own git hub branch to code and unit test developed code.
Lead developer must review code prior to merge.
Lead developer is responsible for merging all code.
Each developer must download the most recent code from main branch before commencing code changes.
Each release must provide a brief message on changes made prior to committing the code.

Basic Statistical Analysis

The largest amount of fradulent transaction is 1244 and the largest refund fradulent transaction is -475, this suggests that the fraudsters are able to issue refund request and also make fradulent transactions. This however requires further investigation to pin point. Overall, according to the box plot, fraudulent transactions exhibit higher average amount (79.42 vs 42.21) per transaction with greater degree of deviation (143 vs 80)

Machine Learning Model Procedure / Analysis

Machine Learning Pipeline

Data Preparation

Data Ingestion

Extracted data is from Kaggle platform (Refer to the reference for more details) as a CSV files which was about 24 Million data samples that include Fradulant and non-fradulant transaction details. However, due to limited capacity of processing large amount of data, decision being made to extract subsect of the original dataset. To avoid the selection bias, yearly based data (Year 2019 Data) extracted without loosing any information. Excel and python used as tools to manipulate data which made it ready for the analysis.

Data Wrangling

Following Data cleansing techniques used to process the data before going further.

1. Drop Duplicates
2. Handling Missing Values
3. Correct the data types
4. Drop unwanted columns

Feature Engineering

Next, following Feature Engineering techniques applied to get a better sense of the data and to choose most relavent features from the raw data for the Machine Learning model.

1. Feature Transformation
2. Feature Splitting
3. Feature Encoding
4. Feature Scaling

Model Training

Before trainig the data, split the data into 60% for Training and 40% for Testing. Then, feed the training data into the Machine Learning Algorithms. Next, validate the predctions against metrics and imporve further by tuning hyper-parameters. This is an iterative process which continues until model train well enough reducing the cost while increasing the accuracy. Since it is a Fraud detection use cases in financial domain, further forcus on reducing False Negatives as opposed to purely rely on accuracy.

1. Logistic Regression

Confusion Matrix

Classification Report

2. Easy Ensemble Classifier

Confusion Matrix

Classification Report

3. XGBoost Classifier

Confusion Matrix

Classification Report

4. Random Forest Classifier

Confusion Matrix

Classification Report

Model Deployment

Google colab has been chosen as the desired cloud based platform for model deployment due to following reasons

1. Cost effectiveness
2. Reliability
3. No infrastructure overhead
4. Usability and accessibility

Hosted files can be found in below links (Open with Google Colaboratory)

https://drive.google.com/file/d/1GRwbiNPk_BRxBJpy5GpH7GHHIuS-t5-5/view?usp=sharing

https://drive.google.com/file/d/1CRn9pSCsjJ5W0YZcEpq6Sslk-JdbGVig/view?usp=sharing

https://drive.google.com/file/d/1X_-51IZRfOtzNyUspLVn-6dY8ggUzCFh/view?usp=sharing

References

https://www.kaggle.com/ealtman2019/credit-card-transactions

https://www.ftc.gov/

https://www.fool.com/the-ascent/research/identity-theft-credit-card-fraud-statistics/

https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/

https://www.projectpro.io/article/8-feature-engineering-techniques-for-machine-learning/423

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Data		Data
Images		Images
__pycache__		__pycache__
README.md		README.md
The Fraudulent Transaction Destector.pptx		The Fraudulent Transaction Destector.pptx
analysis.ipynb		analysis.ipynb
data_cleansing.ipynb		data_cleansing.ipynb
fraud_detection_classification.ipynb		fraud_detection_classification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fraud Detection

Table of Contents

Introduction - Machine Learning and Risk mitigation in Finance

Data, Technology and Coding Standards

Data Sources

Technology stack

Coding Standards

Basic Statistical Analysis

Machine Learning Model Procedure / Analysis

Machine Learning Pipeline

Data Preparation

Data Ingestion

Data Wrangling

Feature Engineering

Model Training

Model Deployment

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fraud Detection

Table of Contents

Introduction - Machine Learning and Risk mitigation in Finance

Data, Technology and Coding Standards

Data Sources

Technology stack

Coding Standards

Basic Statistical Analysis

Machine Learning Model Procedure / Analysis

Machine Learning Pipeline

Data Preparation

Data Ingestion

Data Wrangling

Feature Engineering

Model Training

Model Deployment

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages