Skip to content

HarshTomar1234/decifra

Repository files navigation

Decifra

Fraud Detection MLOps Pipeline with Explainable AI

Decifra is a production-ready machine learning operations (MLOps) pipeline designed for credit card fraud detection. The system not only identifies fraudulent transactions with high accuracy but also provides transparent, interpretable explanations for each prediction using state-of-the-art Explainable AI (XAI) techniques.

The name "Decifra" derives from "to decipher" - representing the project's core mission of uncovering hidden fraud patterns and making AI decisions transparent and understandable.

Model Performance

Metric Before Tuning After Optuna (100 trials)
PR-AUC 84.91% 88.52%
Precision 39.91% 87%
Recall 86.73% 86%
F1 Score 54.66% 86%

πŸ€— Live Demo on Hugging Face


Dashboard Preview

Dashboard Interface
Dashboard Interface - Real-time fraud detection with explainability

Model Metrics
Model Performance Metrics - PR-AUC: 88.52%

Tech Stack
Technology Stack Overview

Model Comparison
Model Performance Comparison

Architecture

MLOps Architecture
End-to-end MLOps Pipeline Architecture


Table of Contents


Quick Start

Option 1: Docker (Recommended)

# Clone repository
git clone https://github.com/HarshTomar1234/decifra.git
cd decifra

# Run with Docker Compose
docker-compose up --build

# Access:
# - Dashboard: http://localhost:8501
# - API: http://localhost:3000
# - MLflow: http://localhost:5000

Option 2: Local Setup

# Clone and setup
git clone https://github.com/HarshTomar1234/decifra.git
cd decifra
python -m venv .venv
.venv\Scripts\activate  # Windows
pip install -e ".[dev]"

# Run training pipeline
python -m pipelines.training_pipeline

# Start BentoML API
python -m src.serving.save_model
bentoml serve src.serving.service:FraudDetectorService

# Launch Dashboard
streamlit run dashboard/app.py

Overview

Financial fraud detection is a critical challenge in the banking and fintech industry. Traditional ML models often operate as "black boxes," providing predictions without explanations. This lack of transparency creates challenges for:

  • Regulatory Compliance: Regulations like GDPR require explanations for automated decisions
  • Fraud Analyst Trust: Investigators need to understand why transactions are flagged
  • Model Debugging: Data scientists need insights to improve model performance
  • Customer Experience: Reducing false positives requires understanding model behavior

Decifra addresses these challenges by combining robust fraud detection with comprehensive explainability.


Key Features

Machine Learning

  • Multi-model training with XGBoost, LightGBM, and Random Forest
  • Automated hyperparameter optimization using Optuna
  • Handling of highly imbalanced datasets using SMOTE and stratified sampling
  • Ensemble methods for improved prediction accuracy

Explainable AI (XAI)

  • SHAP (SHapley Additive exPlanations) for global and local feature importance
  • LIME (Local Interpretable Model-agnostic Explanations) for instance-level explanations
  • Visual explanation reports for each prediction
  • Feature contribution analysis for fraud decisions

MLOps Infrastructure

  • End-to-end pipeline orchestration with ZenML
  • Experiment tracking and model registry with MLflow
  • Data versioning with DVC
  • Model serving and API deployment with BentoML
  • Data validation with Great Expectations

Production Ready

  • RESTful API for real-time predictions
  • Interactive Streamlit dashboard for monitoring
  • Docker containerization for deployment
  • CI/CD pipeline with GitHub Actions
  • Comprehensive logging and monitoring

Technology Stack

Category Technology Purpose
Orchestration ZenML ML pipeline orchestration and workflow management
Experiment Tracking MLflow Experiment logging, model registry, and artifact storage
Data Versioning DVC Version control for datasets and model artifacts
Model Serving BentoML Model packaging and REST API deployment
Explainability SHAP, LIME Model interpretation and explanation generation
ML Models XGBoost, LightGBM, scikit-learn Gradient boosting and ensemble methods
Hyperparameter Tuning Optuna Bayesian optimization for hyperparameters
Data Validation Great Expectations Data quality checks and schema validation
Dashboard Streamlit Interactive web interface for monitoring
API Framework FastAPI High-performance REST API endpoints
Containerization Docker Application containerization and deployment

Project Architecture

                                    +-------------------+
                                    |   Streamlit       |
                                    |   Dashboard       |
                                    +--------+----------+
                                             |
+------------------+    +-------------------+|+-------------------+
|                  |    |                   |||                   |
|  Raw Data        +--->+  ZenML Pipeline   +-->  MLflow          |
|  (DVC Tracked)   |    |                   |||  Tracking         |
|                  |    +-------------------+|+-------------------+
+------------------+             |           |
                                 |           |
                    +------------+           +------------+
                    |                                     |
           +--------v--------+                   +--------v--------+
           |                 |                   |                 |
           |  Trained Model  |                   |  BentoML API    |
           |  (MLflow)       +------------------>+  Service        |
           |                 |                   |                 |
           +-----------------+                   +--------+--------+
                                                          |
                                                 +--------v--------+
                                                 |                 |
                                                 |  SHAP / LIME    |
                                                 |  Explanations   |
                                                 |                 |
                                                 +-----------------+

Project Structure

decifra/
β”œβ”€β”€ src/                          # Source code
β”‚   β”œβ”€β”€ config.py                 # Config loader (reads from configs/)
β”‚   β”œβ”€β”€ data/                     # Data ingestion and preprocessing
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ ingestion.py          # Data loading from sources
β”‚   β”‚   β”œβ”€β”€ preprocessing.py      # Feature scaling, encoding
β”‚   β”‚   └── validation.py         # Data quality checks
β”‚   β”œβ”€β”€ features/                 # Feature engineering
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── engineering.py        # Feature transformations
β”‚   β”œβ”€β”€ models/                   # Model definitions
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── tuner.py              # Optuna hyperparameter tuning
β”‚   β”œβ”€β”€ explainability/           # XAI implementations
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ shap_explainer.py     # SHAP explanations
β”‚   β”‚   └── lime_explainer.py     # LIME explanations
β”‚   └── serving/                  # Model serving
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ service.py            # BentoML service
β”‚       └── save_model.py         # Model export to BentoML
β”œβ”€β”€ pipelines/                    # ZenML pipelines
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ training_pipeline.py      # Training workflow
β”‚   └── steps/                    # Pipeline steps
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ data_loader.py
β”‚       β”œβ”€β”€ preprocessor.py
β”‚       β”œβ”€β”€ trainer.py
β”‚       β”œβ”€β”€ evaluator.py
β”‚       β”œβ”€β”€ explainer.py
β”‚       └── tuner.py              # Hyperparameter tuning step
β”œβ”€β”€ dashboard/                    # Streamlit application
β”‚   └── app.py                    # Main dashboard
β”œβ”€β”€ configs/                      # Configuration files
β”‚   └── config.yaml               # Hyperparameters & settings (SINGLE SOURCE OF TRUTH)
β”œβ”€β”€ data/                         # Data directory
β”‚   β”œβ”€β”€ raw/                      # Raw datasets
β”‚   └── processed/                # Processed datasets
β”œβ”€β”€ artifacts/                    # Generated artifacts
β”‚   β”œβ”€β”€ models/                   # Trained models (including tuned)
β”‚   └── explanations/             # SHAP & LIME plots
β”œβ”€β”€ notes/                        # Learning notes
β”‚   β”œβ”€β”€ 01_zenml_pipelines.md
β”‚   β”œβ”€β”€ 02_mlflow_tracking.md
β”‚   β”œβ”€β”€ 03_bentoml_serving.md
β”‚   β”œβ”€β”€ 04_streamlit_dashboard.md
β”‚   └── 05_optuna_tuning.md
β”œβ”€β”€ notebooks/                    # Jupyter notebooks
β”œβ”€β”€ tests/                        # Unit and integration tests
β”œβ”€β”€ docker/                       # Docker configurations
β”œβ”€β”€ .dvc/                         # DVC configuration
β”œβ”€β”€ .zen/                         # ZenML configuration
β”œβ”€β”€ pyproject.toml                # Project dependencies
β”œβ”€β”€ bentofile.yaml                # BentoML build configuration
β”œβ”€β”€ .env.example                  # Environment template
└── README.md                     # This file

Installation

Prerequisites

  • Python 3.9 or higher
  • Git
  • pip or conda

Setup

  1. Clone the repository:
git clone https://github.com/HarshTomar1234/decifra.git
cd decifra
  1. Create and activate a virtual environment:
python -m venv .venv

# Windows
.venv\Scripts\activate

# Linux/macOS
source .venv/bin/activate
  1. Install dependencies:
pip install -e ".[dev]"
  1. Initialize ZenML:
zenml init
  1. Configure environment variables:
cp .env.example .env
# Edit .env with your configuration

Configuration

The main configuration file is located at configs/config.yaml. Key settings include:

Data Settings

data:
  raw_path: "data/raw"
  processed_path: "data/processed"
  test_size: 0.2
  random_state: 42

Model Settings

models:
  xgboost:
    n_estimators: 100
    max_depth: 6
    learning_rate: 0.1

Explainability Settings

explainability:
  shap:
    max_display: 20
  lime:
    num_features: 10

Usage

Training Pipeline

Run the complete training pipeline:

python -m pipelines.training_pipeline

MLflow UI

View experiment tracking results:

mlflow ui --port 5000

Access at: http://localhost:5000

Start API Server

First, save the model to BentoML:

python -m src.serving.save_model

Then deploy the model as a REST API:

bentoml serve src.serving.service:FraudDetectorService

Access API docs at: http://localhost:3000

Launch Dashboard

Start the Streamlit monitoring dashboard:

streamlit run dashboard/app.py

MLOps Pipeline

The ZenML training pipeline consists of the following steps:

  1. Data Ingestion: Load raw transaction data
  2. Data Validation: Verify data quality using Great Expectations
  3. Preprocessing: Handle missing values, scale features, apply SMOTE
  4. Feature Engineering: Create derived features
  5. Model Training: Train multiple models with hyperparameter tuning
  6. Model Evaluation: Calculate metrics (Precision, Recall, F1, ROC-AUC, PR-AUC)
  7. Model Selection: Select best performing model
  8. Explainability: Generate SHAP values and feature importance
  9. Model Registration: Register model in MLflow

Explainability

SHAP Explanations

SHAP provides both global and local explanations:

  • Global: Overall feature importance across all predictions
  • Local: Per-prediction feature contributions

LIME Explanations

LIME generates human-readable explanations by approximating model behavior locally around each prediction.

Example Output

For a flagged transaction, the system provides:

  • Fraud probability score
  • Top contributing features (positive and negative)
  • Visual waterfall chart of feature contributions
  • Natural language explanation

API Reference

Endpoints

Method Endpoint Description
POST /predict Get fraud prediction
POST /predict_with_explanation Get prediction with SHAP/LIME explanation
GET /health Health check
GET /model_info Model metadata

Request Example

curl -X POST http://localhost:3000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [...]}'

Response Example

{
  "prediction": 1,
  "probability": 0.87,
  "is_fraud": true,
  "explanation": {
    "top_features": [
      {"feature": "V14", "contribution": 0.23},
      {"feature": "V4", "contribution": 0.18}
    ]
  }
}

Dashboard

The Streamlit dashboard provides:

  • Overview: Model performance metrics and confusion matrix
  • Predictions: Real-time fraud scoring interface
  • Explanations: Interactive SHAP and LIME visualizations
  • Monitoring: Data drift detection and prediction distribution
  • Analytics: Historical model performance trends

Testing

Run the test suite:

pytest tests/ -v

Run with coverage:

pytest tests/ --cov=src --cov-report=html

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-feature)
  3. Commit your changes (git commit -m 'Add new feature')
  4. Push to the branch (git push origin feature/new-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License. See the LICENSE file for details.


Acknowledgments

  • Credit Card Fraud Detection Dataset from Kaggle
  • ZenML, MLflow, DVC, and BentoML teams for excellent MLOps tools
  • SHAP and LIME authors for XAI libraries

About

πŸ” MLOps Fraud Detection Pipeline with Explainable AI - Featuring ZenML, MLflow, DVC, BentoML, SHAP & LIME for transparent predictions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors