This project presents a comprehensive analysis of credit risk prediction using multiple machine learning approaches. The study compares the performance of Logistic Regression, Random Forest, and XGBoost models for predicting problematic loans using the Kaggle Credit Card Approval Prediction dataset.
- Best Model: Random Forest achieved 0.780 AUC with optimal balance of precision and recall
- Business Impact: Model can prevent 40% of bad loan approvals while maintaining 97% approval rate
- Data Quality: Successfully cleaned and engineered features from 36,457 applications
- Feature Importance: Employment history, age, and income ratios are key predictors
| Model | AUC Score | Precision | Recall | Recommendation |
|---|---|---|---|---|
| Random Forest | 0.780 | 22.5% | 39.6% | Production Ready |
| XGBoost | 0.775 | 18.9% | 45.3% | High Recall Alternative |
| Decision Tree | 0.724 | 20.1% | 35.2% | Interpretable Option |
| Logistic Regression | 0.581 | 15.8% | 28.7% | Baseline Model |
- β Outlier Detection & Handling: IQR-based approach with 83% data retention
- β Feature Engineering: 13 new predictive features including ratio-based metrics
- β Missing Value Treatment: Median imputation for numerical stability
- β Categorical Encoding: Label encoding for optimal model performance
- β Cross-Validation: 5-fold stratified validation for robust evaluation
- β Statistical Tests: Box-Tidwell linearity test and VIF multicollinearity analysis
- β Performance Metrics: Comprehensive evaluation with ROC curves, precision-recall, and business impact analysis
βββ logistic-tree-xgboost.ipynb # Main analysis notebook
βββ README.md # This file
βββ data/ # Dataset (downloaded via kagglehub)
pip install pandas numpy matplotlib seaborn scikit-learn xgboost statsmodels kagglehub- Clone this repository
- Open
logistic-tree-xgboost.ipynbin Jupyter - Run all cells to reproduce the analysis
The analysis uses the Credit Card Approval Prediction dataset from Kaggle, automatically downloaded via kagglehub.
- AUC: 0.780 - Good discriminative ability
- Business Impact: Identifies 40% of problematic loans
- Efficiency: 3% rejection rate for 1.7% actual bad rate
- Feature Importance: Age, employment stability, income ratios
- Precision: 22.5% (1 in 4 predicted bad clients is actually bad)
- Recall: 39.6% (catches 40% of actual bad clients)
- F1-Score: Balanced performance across metrics
- Cross-Validation: Consistent performance across folds
- Conservative: Lower threshold β catch more bad clients, higher rejection rate
- Balanced: Current threshold β optimal precision-recall trade-off
- Aggressive: Higher threshold β approve more applications, accept more risk
- 40% reduction in bad loan approvals
- Maintained approval rates (~97% of applications)
- Improved portfolio quality through data-driven decisions
- Scalable risk assessment for growing application volumes
- Box-Tidwell Test: Verified logistic regression linearity assumptions
- VIF Analysis: Confirmed low multicollinearity (all VIF < 5)
- Feature Transformations: Applied where linearity was violated
- Data Exploration: Comprehensive EDA with correlation analysis
- Feature Engineering: Created predictive ratio and categorical features
- Model Training: Trained multiple algorithms with proper validation
- Performance Evaluation: Business-focused metric interpretation
- Production Readiness: Final model selection and deployment recommendations
- Age: Older applicants tend to be lower risk
- Employment Stability: Long-term employment reduces default probability
- Income-to-Age Ratio: Financial maturity indicator
- Family Structure: Affects financial stability assessment
- Target Rate: 1.66% overall default rate in the dataset
- Geographic Patterns: Regional risk variations identified
- Demographic Trends: Age and employment status strongly predictive
- Advanced Modeling: Deep learning approaches for complex patterns
- Real-time Scoring: API development for production deployment
- Ensemble Methods: Combine multiple models for improved performance
- Explainable AI: SHAP values for individual prediction explanations
- Monitoring Dashboard: Real-time model performance tracking
- Hastie, T., Tibshirani, R., & Friedman, J. (2010). The Elements of Statistical Learning (2nd ed.)
- Logistic Regression Classifier Tutorial, Kaggle. Banerjee, P., 2019
- Credit risk assessment methodologies and best practices
- Source: Kaggle Credit Card Approval Prediction
- Size: 30,322 records after cleaning
- Features: 20 columns (19 features + 1 target)
- Target Distribution: 98.3% good clients, 1.7% bad clients
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions about this analysis or collaboration opportunities, please open an issue in this repository.
β Star this repository if you find it helpful!