Skip to content

Sami-ul/Loan-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Loan-Classification

DATASET USED NOT INCLUDED

Project overview

  • Predict if a customer will or will not default
  • Important features
    • Debt to income ratio: The reason this may be an important feature in deciding whether a user will default or not because the more debt and less income a person has, the less chance that they will be able to pay back a loan
    • Delinquent credit lines: This may be an important feature because it shows that the persons payments are way past due, it shows that they are not able to pay back their credit cards and therefore may have a hard time paying back a loan

Models Created

Decision Tree Model

  • The training data performance for the first decision tree model image

  • The test data performance for the first decision tree model image

Decision Tree Model Tuned

  • The training data performance for the decision tree tuned model image

  • The test data performance for the decision tree tuned model image

Random Forest Model

  • The training data performance for the random forest model image

  • The test data performance for the random forest model image

Random Forest Model Tuned

  • The training data performance for the final random forest tuned model image

  • The test data performance for the final random forest tuned model image

  • As you can see, the untuned models on training data have perfect performance, meaning they are overfit, which leads to worse performance on test data

  • To fix this I tuned the data, which equalized the performance and made it so it could work on more varied data

  • The models used were decision tree and random forest

  • Decision tree was faster but random forest was better at classifying

Data Insights

  • The graph below shows the data imbalance between loans that were paid back, and loans that were defaulted.

  • We can see that there is a severe imbalance.

  • In the future tuning we may want to fix this.

  • image

  • The graph to the right is a density plot of the loan amounts.

  • We can see that most of the loans lie around the $20,000 mark.

  • This is important because it shows what the risk of getting a prediction wrong is.

  • image

  • This graph shows the amount of loans that are for debt consolidation versus the loans that are taken for home improvement.

  • This tells us that there are many more people who are getting a loan for debt, rather than for home improvement.

  • image

Markdown Version of Jupyter Notebook

Loan Classification Project

# Libraries we need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve,recall_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV
df = pd.read_csv("Dataset.csv")
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 1 1100 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 1.0 9.0 NaN
1 1 1300 70053.0 68400.0 HomeImp Other 7.0 0.0 2.0 121.833333 0.0 14.0 NaN
2 1 1500 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 1.0 10.0 NaN
3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 1700 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 0.0 14.0 NaN
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB
df.nunique()
BAD           2
LOAN        540
MORTDUE    5053
VALUE      5381
REASON        2
JOB           6
YOJ          99
DEROG        11
DELINQ       14
CLAGE      5314
NINQ         16
CLNO         62
DEBTINC    4693
dtype: int64
  • Above we can see that Reason and Bad are binary variables
  • Nothing needs to be dropped
df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
BAD LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
count 5960.000000 5960.000000 5442.000000 5848.000000 5445.000000 5252.000000 5380.000000 5652.000000 5450.000000 5738.000000 4693.000000
mean 0.199497 18607.969799 73760.817200 101776.048741 8.922268 0.254570 0.449442 179.766275 1.186055 21.296096 33.779915
std 0.399656 11207.480417 44457.609458 57385.775334 7.573982 0.846047 1.127266 85.810092 1.728675 10.138933 8.601746
min 0.000000 1100.000000 2063.000000 8000.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.524499
25% 0.000000 11100.000000 46276.000000 66075.500000 3.000000 0.000000 0.000000 115.116702 0.000000 15.000000 29.140031
50% 0.000000 16300.000000 65019.000000 89235.500000 7.000000 0.000000 0.000000 173.466667 1.000000 20.000000 34.818262
75% 0.000000 23300.000000 91488.000000 119824.250000 13.000000 0.000000 0.000000 231.562278 2.000000 26.000000 39.003141
max 1.000000 89900.000000 399550.000000 855909.000000 41.000000 10.000000 15.000000 1168.233561 17.000000 71.000000 203.312149
plt.hist(df['BAD'], bins=3)
plt.show()

png

df['LOAN'].plot(kind='density')
plt.show()

png

plt.pie(df['REASON'].value_counts(), labels=['DebtCon', 'HomeImp'], autopct='%.1f')
plt.show()
df['REASON'].value_counts()

png

DebtCon    3928
HomeImp    1780
Name: REASON, dtype: int64
correlation = df.corr()
sns.heatmap(correlation)
plt.show()

png

df['BAD'].value_counts(normalize=True)
0    0.800503
1    0.199497
Name: BAD, dtype: float64
df.fillna(df.mean(), inplace=True)
one_hot_encoding = pd.get_dummies(df['REASON'])
df = df.drop('REASON', axis=1)
df = df.join(one_hot_encoding)
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
BAD LOAN MORTDUE VALUE JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC DebtCon HomeImp
0 1 1100 25860.0000 39025.000000 Other 10.500000 0.00000 0.000000 94.366667 1.000000 9.000000 33.779915 0 1
1 1 1300 70053.0000 68400.000000 Other 7.000000 0.00000 2.000000 121.833333 0.000000 14.000000 33.779915 0 1
2 1 1500 13500.0000 16700.000000 Other 4.000000 0.00000 0.000000 149.466667 1.000000 10.000000 33.779915 0 1
3 1 1500 73760.8172 101776.048741 NaN 8.922268 0.25457 0.449442 179.766275 1.186055 21.296096 33.779915 0 0
4 0 1700 97800.0000 112000.000000 Office 3.000000 0.00000 0.000000 93.333333 0.000000 14.000000 33.779915 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5955 0 88900 57264.0000 90185.000000 Other 16.000000 0.00000 0.000000 221.808718 0.000000 16.000000 36.112347 1 0
5956 0 89000 54576.0000 92937.000000 Other 16.000000 0.00000 0.000000 208.692070 0.000000 15.000000 35.859971 1 0
5957 0 89200 54045.0000 92924.000000 Other 15.000000 0.00000 0.000000 212.279697 0.000000 15.000000 35.556590 1 0
5958 0 89800 50370.0000 91861.000000 Other 14.000000 0.00000 0.000000 213.892709 0.000000 16.000000 34.340882 1 0
5959 0 89900 48811.0000 88934.000000 Other 15.000000 0.00000 0.000000 219.601002 0.000000 16.000000 34.571519 1 0

5960 rows Ă— 14 columns

one_hot_encoding2 = pd.get_dummies(df['JOB'])
df = df.drop('JOB', axis=1)
df = df.join(one_hot_encoding2)
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
BAD LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC DebtCon HomeImp Mgr Office Other ProfExe Sales Self
0 1 1100 25860.0000 39025.000000 10.500000 0.00000 0.000000 94.366667 1.000000 9.000000 33.779915 0 1 0 0 1 0 0 0
1 1 1300 70053.0000 68400.000000 7.000000 0.00000 2.000000 121.833333 0.000000 14.000000 33.779915 0 1 0 0 1 0 0 0
2 1 1500 13500.0000 16700.000000 4.000000 0.00000 0.000000 149.466667 1.000000 10.000000 33.779915 0 1 0 0 1 0 0 0
3 1 1500 73760.8172 101776.048741 8.922268 0.25457 0.449442 179.766275 1.186055 21.296096 33.779915 0 0 0 0 0 0 0 0
4 0 1700 97800.0000 112000.000000 3.000000 0.00000 0.000000 93.333333 0.000000 14.000000 33.779915 0 1 0 1 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5955 0 88900 57264.0000 90185.000000 16.000000 0.00000 0.000000 221.808718 0.000000 16.000000 36.112347 1 0 0 0 1 0 0 0
5956 0 89000 54576.0000 92937.000000 16.000000 0.00000 0.000000 208.692070 0.000000 15.000000 35.859971 1 0 0 0 1 0 0 0
5957 0 89200 54045.0000 92924.000000 15.000000 0.00000 0.000000 212.279697 0.000000 15.000000 35.556590 1 0 0 0 1 0 0 0
5958 0 89800 50370.0000 91861.000000 14.000000 0.00000 0.000000 213.892709 0.000000 16.000000 34.340882 1 0 0 0 1 0 0 0
5959 0 89900 48811.0000 88934.000000 15.000000 0.00000 0.000000 219.601002 0.000000 16.000000 34.571519 1 0 0 0 1 0 0 0

5960 rows Ă— 19 columns

dependent = df['BAD']
independent = df.drop(['BAD'], axis=1)
x_train, x_test, y_train, y_test = train_test_split(independent, dependent, test_size=0.3, random_state=1)
def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))
    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8,5))
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels=['Not Default', 'Default'], yticklabels=['Not Default', 'Default'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()
dtree = DecisionTreeClassifier(class_weight={0:0.20, 1:0.80}, random_state=1)
dtree.fit(x_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1)
dependent_performance_dt = dtree.predict(x_train)
metrics_score(y_train, dependent_performance_dt)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3355
           1       1.00      1.00      1.00       817

    accuracy                           1.00      4172
   macro avg       1.00      1.00      1.00      4172
weighted avg       1.00      1.00      1.00      4172

png

  • The above is perfect because we are using the train values, not the test
  • Lets test on test data
dependent_test_performance_dt = dtree.predict(x_test)
metrics_score(y_test,dependent_test_performance_dt)
              precision    recall  f1-score   support

           0       0.90      0.92      0.91      1416
           1       0.68      0.61      0.64       372

    accuracy                           0.86      1788
   macro avg       0.79      0.77      0.78      1788
weighted avg       0.85      0.86      0.86      1788

png

  • As we can see, we got decent performance from this model, lets see if we can do better
  • Selfnote: do importance features next
important = dtree.feature_importances_
columns = independent.columns
important_items_df = pd.DataFrame(important, index=columns, columns=['Importance']).sort_values(by='Importance', ascending=False)
plt.figure(figsize=(13,13))
sns.barplot(important_items_df.Importance, important_items_df.index)
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

  • I followed this from a previous project to see the most important features
  • We can see that the most important features are DEBTINC, CLAGE and CLNO
tree_estimator = DecisionTreeClassifier(class_weight={0:0.20, 1:0.80}, random_state=1)

parameters = {
    'max_depth':np.arange(2,7),
    'criterion':['gini', 'entropy'],
    'min_samples_leaf':[5,10,20,25]
             }
score = metrics.make_scorer(recall_score, pos_label=1)
gridCV= GridSearchCV(tree_estimator, parameters, scoring=score,cv=10)
gridCV = gridCV.fit(x_train, y_train) 
tree_estimator = gridCV.best_estimator_
tree_estimator.fit(x_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, max_depth=6,
                       min_samples_leaf=25, random_state=1)
dependent_performance_dt = tree_estimator.predict(x_train)
metrics_score(y_train, dependent_performance_dt)
              precision    recall  f1-score   support

           0       0.95      0.87      0.91      3355
           1       0.60      0.82      0.69       817

    accuracy                           0.86      4172
   macro avg       0.77      0.84      0.80      4172
weighted avg       0.88      0.86      0.86      4172

png

  • We increased the less harmful error but decreased the harmful error
dependent_test_performance_dt = tree_estimator.predict(x_test)
metrics_score(y_test, dependent_test_performance_dt)
              precision    recall  f1-score   support

           0       0.94      0.86      0.90      1416
           1       0.60      0.77      0.67       372

    accuracy                           0.84      1788
   macro avg       0.77      0.82      0.79      1788
weighted avg       0.87      0.84      0.85      1788

png

  • Although the performance is slightly worse, we still reduce harmful error
important = tree_estimator.feature_importances_
columns=independent.columns
importance_df=pd.DataFrame(important,index=columns,columns=['Importance']).sort_values(by='Importance',ascending=False)
plt.figure(figsize=(13,13))
sns.barplot(importance_df.Importance,importance_df.index)
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

features = list(independent.columns)

plt.figure(figsize=(30,20))

tree.plot_tree(dtree,max_depth=4,feature_names=features,filled=True,fontsize=12,node_ids=True,class_names=True)
plt.show()

png

  • A visualization is one of the advantages that dtrees offer, we can show this to the client ot show the thought process
forest_estimator = RandomForestClassifier(class_weight={0:0.20, 1:0.80}, random_state=1)
forest_estimator.fit(x_train, y_train)
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, random_state=1)
y_predict_training_forest = forest_estimator.predict(x_train)
metrics_score(y_train, y_predict_training_forest)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3355
           1       1.00      1.00      1.00       817

    accuracy                           1.00      4172
   macro avg       1.00      1.00      1.00      4172
weighted avg       1.00      1.00      1.00      4172

png

  • A perfect classification
    • This implies overfitting
y_predict_test_forest = forest_estimator.predict(x_test)
metrics_score(y_test, y_predict_test_forest)
              precision    recall  f1-score   support

           0       0.91      0.98      0.95      1416
           1       0.91      0.65      0.76       372

    accuracy                           0.91      1788
   macro avg       0.91      0.82      0.85      1788
weighted avg       0.91      0.91      0.91      1788

png

  • The performance is a lot better than the original single tree
  • Lets fix overfitting
forest_estimator_tuned = RandomForestClassifier(class_weight={0:0.20,1:0.80}, random_state=1)

parameters_rf = {  
        "n_estimators": [100,250,500],
        "min_samples_leaf": np.arange(1, 4,1),
        "max_features": [0.7,0.9,'auto'],
}

score = metrics.make_scorer(recall_score, pos_label=1)

# Run the grid search
grid_obj = GridSearchCV(forest_estimator_tuned, parameters_rf, scoring=score, cv=5)
grid_obj = grid_obj.fit(x_train, y_train)

# Set the clf to the best combination of parameters
forest_estimator_tuned = grid_obj.best_estimator_
forest_estimator_tuned.fit(x_train, y_train)
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, min_samples_leaf=3,
                       n_estimators=500, random_state=1)
y_predict_train_forest_tuned = forest_estimator_tuned.predict(x_train)
metrics_score(y_train, y_predict_train_forest_tuned)
              precision    recall  f1-score   support

           0       1.00      0.98      0.99      3355
           1       0.93      0.99      0.96       817

    accuracy                           0.98      4172
   macro avg       0.96      0.99      0.97      4172
weighted avg       0.98      0.98      0.98      4172

png

y_predict_test_forest_tuned = forest_estimator_tuned.predict(x_test)
metrics_score(y_test, y_predict_test_forest_tuned)
              precision    recall  f1-score   support

           0       0.94      0.96      0.95      1416
           1       0.83      0.75      0.79       372

    accuracy                           0.92      1788
   macro avg       0.88      0.86      0.87      1788
weighted avg       0.91      0.92      0.92      1788

png

  • We now have very good performance
  • We can submit this to the company

Conclusion

  • I made many models to get the best results.
    • The first one I made was a decision tree, this is not as good as random forest but it is transparent as it lets us visualize it. This first one had decent performance.
    • To improve the performance of this we tried to tune the model, this reduced the harmful error.
    • Then to improve even more I created a decision tree model, this had excellent performance once we created a second version which removed overfitting.

Recommendations

  • The biggest thing that effects defaulting on a loan is the debt to income ratio. If someone has a lot of debt and a lower income they may have a harder time paying back a loan.
  • Something else that effects defaulting on a loan is the number of delinquent credit lines. This means that someone who cannot make their credit card payments will have a hard time paying back a loan.
  • Years at job is also a driver of a loans outcome. A large number of years at a job could indicate financial stability.
  • DEROG, or a history of delinquent payments is also a warning sign of not being able to pay back a loan.
  • Those are some warning signs/good signs that should be looked out for when looking for candidates to give loans to.

I will now apply SHAP to look more into this model.

!pip install shap
import shap
Requirement already satisfied: shap in c:\programdata\anaconda3\lib\site-packages (0.39.0)
Requirement already satisfied: scipy in c:\programdata\anaconda3\lib\site-packages (from shap) (1.6.2)
Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (from shap) (1.20.1)
Requirement already satisfied: numba in c:\programdata\anaconda3\lib\site-packages (from shap) (0.53.1)
Requirement already satisfied: slicer==0.0.7 in c:\programdata\anaconda3\lib\site-packages (from shap) (0.0.7)
Requirement already satisfied: cloudpickle in c:\programdata\anaconda3\lib\site-packages (from shap) (1.6.0)
Requirement already satisfied: pandas in c:\programdata\anaconda3\lib\site-packages (from shap) (1.2.4)
Requirement already satisfied: scikit-learn in c:\programdata\anaconda3\lib\site-packages (from shap) (0.24.1)
Requirement already satisfied: tqdm>4.25.0 in c:\programdata\anaconda3\lib\site-packages (from shap) (4.59.0)
Requirement already satisfied: setuptools in c:\programdata\anaconda3\lib\site-packages (from numba->shap) (52.0.0.post20210125)
Requirement already satisfied: llvmlite<0.37,>=0.36.0rc1 in c:\programdata\anaconda3\lib\site-packages (from numba->shap) (0.36.0)
Requirement already satisfied: pytz>=2017.3 in c:\programdata\anaconda3\lib\site-packages (from pandas->shap) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\programdata\anaconda3\lib\site-packages (from pandas->shap) (2.8.1)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas->shap) (1.15.0)
Requirement already satisfied: joblib>=0.11 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn->shap) (1.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn->shap) (2.1.0)
shap.initjs()
explain = shap.TreeExplainer(forest_estimator_tuned)
shap_vals = explain(x_train)
type(shap_vals)
shap._explanation.Explanation
shap.plots.bar(shap_vals[:, :, 0])

png

shap.plots.heatmap(shap_vals[:, :, 0])

png

shap.summary_plot(shap_vals[:, :, 0], x_train)

png

print(forest_estimator_tuned.predict(x_test.iloc[107].to_numpy().reshape(1,-1))) # This predicts for one row, 0 means approved, 1 means no.
[1]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published