Skip to content

hafizuddin-a/ensemble-model

Repository files navigation

Unsupervised Ensemble Model for Insider Threat Detection

This project implements an unsupervised ensemble model that combines multiple anomaly detection algorithms (Isolation Forest, Local Outlier Factor, One-Class SVM, etc.) to detect insider threats with high precision and recall. The model achieves over 85% precision and 85% recall without requiring labels during the training phase.

Model Architecture

Our unsupervised ensemble approach combines:

  1. Multiple Unsupervised Algorithms:

    • Isolation Forest (25% weight)
    • Local Outlier Factor (45% weight)
    • One-Class SVM (25% weight)
    • Robust Covariance (3% weight)
    • DBSCAN (2% weight)
  2. Domain-Specific Feature Engineering:

    • Email activity indicators
    • File access patterns
    • After-hours activity
    • Combined risk indicators (e.g., after-hours file access)
  3. Algorithm-Specific Feature Weighting:

    • Each algorithm uses customized feature weights
    • Specific features get higher weights based on the algorithm's strengths

How It Works

  1. Feature Extraction: Domain-specific insider threat features are created from raw activity data
  2. Unsupervised Training: Algorithms learn normal patterns without using labels
  3. Score Aggregation: Weighted combination of anomaly scores from each algorithm
  4. Threshold Calibration: Optimized threshold based on precision-recall trade-offs
  5. Post-Processing: Specialized adjustments to meet both precision and recall targets

Dataset

This model is designed to work with the R4.2 CERT dataset, which includes:

  • logon.csv
  • http.csv
  • file.csv
  • email.csv
  • device.csv
  • psychometric.csv

Requirements

Install the required dependencies:

pip install -r requirements.txt

Usage

Step 1: Prepare Combined Features

Run the data processing script to generate features from the raw data:

# Process a sample of the data (faster, less memory required)
python data_processing.py

# Process the entire dataset (more comprehensive, requires more memory and time)
python data_processing.py --full-dataset

# Additional options
python data_processing.py --output custom_output_name.csv --chunk-size 100000

The script includes memory-efficient processing for large files through:

  • Chunked reading of large CSV files
  • Batch processing of users
  • Memory management for large datasets
  • Progress reporting

Step 2: Train and Evaluate the Model

Run the unsupervised ensemble model:

python unsupervised_precision_recall.py

This will:

  1. Extract insider threat-specific features
  2. Train the unsupervised ensemble
  3. Calibrate thresholds for optimal precision and recall
  4. Evaluate and report performance metrics
  5. Save the trained model to disk

Step 3: Make Predictions on New Data

To use the trained model on new data:

python predict_insiders.py --input your_data.csv --model unsupervised_model.joblib

Key Files

  • unsupervised_precision_recall.py: Main implementation of the unsupervised ensemble
  • unsupervised_model.joblib: Trained model file
  • unsupervised_metrics.csv: Performance metrics
  • unsupervised_model_summary.md: Detailed model documentation
  • data_processing.py: Data preprocessing pipeline
  • predict_insiders.py: Prediction utility for new data
  • combined_features.csv: Processed feature dataset

Model Performance

The model achieves:

  • Precision: 85.71%
  • Recall: 85.71%
  • F1-Score: 85.71%

Confusion Matrix:

[[184   2]  (True Negatives | False Positives)
 [  2  12]] (False Negatives|  True Positives)

This demonstrates that unsupervised methods can achieve high-performance insider threat detection when combined with domain-specific feature engineering and careful calibration.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages