Unsupervised Ensemble Model for Insider Threat Detection

This project implements an unsupervised ensemble model that combines multiple anomaly detection algorithms (Isolation Forest, Local Outlier Factor, One-Class SVM, etc.) to detect insider threats with high precision and recall. The model achieves over 85% precision and 85% recall without requiring labels during the training phase.

Model Architecture

Our unsupervised ensemble approach combines:

Multiple Unsupervised Algorithms:
- Isolation Forest (25% weight)
- Local Outlier Factor (45% weight)
- One-Class SVM (25% weight)
- Robust Covariance (3% weight)
- DBSCAN (2% weight)
Domain-Specific Feature Engineering:
- Email activity indicators
- File access patterns
- After-hours activity
- Combined risk indicators (e.g., after-hours file access)
Algorithm-Specific Feature Weighting:
- Each algorithm uses customized feature weights
- Specific features get higher weights based on the algorithm's strengths

How It Works

Feature Extraction: Domain-specific insider threat features are created from raw activity data
Unsupervised Training: Algorithms learn normal patterns without using labels
Score Aggregation: Weighted combination of anomaly scores from each algorithm
Threshold Calibration: Optimized threshold based on precision-recall trade-offs
Post-Processing: Specialized adjustments to meet both precision and recall targets

Dataset

This model is designed to work with the R4.2 CERT dataset, which includes:

logon.csv
http.csv
file.csv
email.csv
device.csv
psychometric.csv

Requirements

Install the required dependencies:

pip install -r requirements.txt

Usage

Step 1: Prepare Combined Features

Run the data processing script to generate features from the raw data:

# Process a sample of the data (faster, less memory required)
python data_processing.py

# Process the entire dataset (more comprehensive, requires more memory and time)
python data_processing.py --full-dataset

# Additional options
python data_processing.py --output custom_output_name.csv --chunk-size 100000

The script includes memory-efficient processing for large files through:

Chunked reading of large CSV files
Batch processing of users
Memory management for large datasets
Progress reporting

Step 2: Train and Evaluate the Model

Run the unsupervised ensemble model:

python unsupervised_precision_recall.py

This will:

Extract insider threat-specific features
Train the unsupervised ensemble
Calibrate thresholds for optimal precision and recall
Evaluate and report performance metrics
Save the trained model to disk

Step 3: Make Predictions on New Data

To use the trained model on new data:

python predict_insiders.py --input your_data.csv --model unsupervised_model.joblib

Key Files

unsupervised_precision_recall.py: Main implementation of the unsupervised ensemble
unsupervised_model.joblib: Trained model file
unsupervised_metrics.csv: Performance metrics
unsupervised_model_summary.md: Detailed model documentation
data_processing.py: Data preprocessing pipeline
predict_insiders.py: Prediction utility for new data
combined_features.csv: Processed feature dataset

Model Performance

The model achieves:

Precision: 85.71%
Recall: 85.71%
F1-Score: 85.71%

Confusion Matrix:

[[184   2]  (True Negatives | False Positives)
 [  2  12]] (False Negatives|  True Positives)

This demonstrates that unsupervised methods can achieve high-performance insider threat detection when combined with domain-specific feature engineering and careful calibration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised Ensemble Model for Insider Threat Detection

Model Architecture

How It Works

Dataset

Requirements

Usage

Step 1: Prepare Combined Features

Step 2: Train and Evaluate the Model

Step 3: Make Predictions on New Data

Key Files

Model Performance

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
README.md		README.md
combined_features.csv		combined_features.csv
data_processing.py		data_processing.py
predict_insiders.py		predict_insiders.py
requirements.txt		requirements.txt
unsupervised_metrics.csv		unsupervised_metrics.csv
unsupervised_model.joblib		unsupervised_model.joblib
unsupervised_model_summary.md		unsupervised_model_summary.md
unsupervised_precision_recall.py		unsupervised_precision_recall.py

Folders and files

Latest commit

History

Repository files navigation

Unsupervised Ensemble Model for Insider Threat Detection

Model Architecture

How It Works

Dataset

Requirements

Usage

Step 1: Prepare Combined Features

Step 2: Train and Evaluate the Model

Step 3: Make Predictions on New Data

Key Files

Model Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages