- Introduction - Machine Learning and Risk mitigation in Finance
- Data, Technology and Coding Standards
- Basic Statistical Analysis
- Machine Learning Analysis and procedures
- References
The financial sector in which operates throughout the world as we know, is arguably the most predominant and most influential industry throughout.
As in any industry and any society, no matter how advanced or behind, there is always ways and opportunities to cheat the system, and gain and leverage mis-information or poor security & operations, in order for financial gain.
As the world evolves and technology advances, we see more and more cases of fraud and data theft within the financial sector each year, costing businesses, individuals, and the economy billions of dollars.
The Purpose of this project is to use and develope machine learning models to prove the importance and the effectiveness of this technology in the industry, and how it should be implemented across the board.
-
Credit Card Data For ML model (Kaggle)
The Data Set used for the machine learning model is a Credit Card transactions data set. This Data set has no obfuscation as it has more then 24 million transactions generated from a virtual world simulation.The Data covers 2000 synthetic consumer residents in the US, who travel the world. For this analysis, only 2019 data has been extracted due to missing fradulant data in some years.
-
Fraud Statistics
- The Ascent
- Federal Trade Commision
- Google Colab
- python 3.8.3
- pandas 1.0.5
- numpy 1.18.5
- requests 2.24.0
- json 2.0.9
- panel 0.9.7
- plotly 5.3.1
- hvplot 0.7.3
- seaborn 0.10.1
- matplotlib 3.2.2
- scikit-learn 1.0.0 (NOTE scikit-learn 1.0.1 will not work!)
- imblearn 0.8.1
- xgboost 1.5.1
- jupyter lab 2.1.5
Following rules have been applied during code development and testing:
- All variables must reflect their purpose. Underscore to be used as and when required.
- Each step of the code must contain comments to explain the purpose of the code.
- A git hub repository called project2 must be set up with branches for each developer.
- Each developer must use their own git hub branch to code and unit test developed code.
- Lead developer must review code prior to merge.
- Lead developer is responsible for merging all code.
- Each developer must download the most recent code from main branch before commencing code changes.
- Each release must provide a brief message on changes made prior to committing the code.
The largest amount of fradulent transaction is 1244 and the largest refund fradulent transaction is -475, this suggests that the fraudsters are able to issue refund request and also make fradulent transactions. This however requires further investigation to pin point. Overall, according to the box plot, fraudulent transactions exhibit higher average amount (79.42 vs 42.21) per transaction with greater degree of deviation (143 vs 80)
Extracted data is from Kaggle platform (Refer to the reference for more details) as a CSV files which was about 24 Million data samples that include Fradulant and non-fradulant transaction details. However, due to limited capacity of processing large amount of data, decision being made to extract subsect of the original dataset. To avoid the selection bias, yearly based data (Year 2019 Data) extracted without loosing any information. Excel and python used as tools to manipulate data which made it ready for the analysis.
Following Data cleansing techniques used to process the data before going further.
1. Drop Duplicates
2. Handling Missing Values
3. Correct the data types
4. Drop unwanted columns
Next, following Feature Engineering techniques applied to get a better sense of the data and to choose most relavent features from the raw data for the Machine Learning model.
1. Feature Transformation
2. Feature Splitting
3. Feature Encoding
4. Feature Scaling
Before trainig the data, split the data into 60% for Training and 40% for Testing. Then, feed the training data into the Machine Learning Algorithms. Next, validate the predctions against metrics and imporve further by tuning hyper-parameters. This is an iterative process which continues until model train well enough reducing the cost while increasing the accuracy. Since it is a Fraud detection use cases in financial domain, further forcus on reducing False Negatives as opposed to purely rely on accuracy.
1. Logistic Regression
Confusion Matrix
Classification Report
2. Easy Ensemble Classifier
Confusion Matrix
Classification Report
3. XGBoost Classifier
Confusion Matrix
Classification Report
4. Random Forest Classifier
Confusion Matrix
Classification Report
Google colab has been chosen as the desired cloud based platform for model deployment due to following reasons
1. Cost effectiveness
2. Reliability
3. No infrastructure overhead
4. Usability and accessibility
Hosted files can be found in below links (Open with Google Colaboratory)
https://drive.google.com/file/d/1GRwbiNPk_BRxBJpy5GpH7GHHIuS-t5-5/view?usp=sharing
https://drive.google.com/file/d/1CRn9pSCsjJ5W0YZcEpq6Sslk-JdbGVig/view?usp=sharing
https://drive.google.com/file/d/1X_-51IZRfOtzNyUspLVn-6dY8ggUzCFh/view?usp=sharing
https://www.kaggle.com/ealtman2019/credit-card-transactions
https://www.fool.com/the-ascent/research/identity-theft-credit-card-fraud-statistics/
https://www.projectpro.io/article/8-feature-engineering-techniques-for-machine-learning/423






