This project implements an unsupervised ensemble model that combines multiple anomaly detection algorithms (Isolation Forest, Local Outlier Factor, One-Class SVM, etc.) to detect insider threats with high precision and recall. The model achieves over 85% precision and 85% recall without requiring labels during the training phase.
Our unsupervised ensemble approach combines:
-
Multiple Unsupervised Algorithms:
- Isolation Forest (25% weight)
- Local Outlier Factor (45% weight)
- One-Class SVM (25% weight)
- Robust Covariance (3% weight)
- DBSCAN (2% weight)
-
Domain-Specific Feature Engineering:
- Email activity indicators
- File access patterns
- After-hours activity
- Combined risk indicators (e.g., after-hours file access)
-
Algorithm-Specific Feature Weighting:
- Each algorithm uses customized feature weights
- Specific features get higher weights based on the algorithm's strengths
- Feature Extraction: Domain-specific insider threat features are created from raw activity data
- Unsupervised Training: Algorithms learn normal patterns without using labels
- Score Aggregation: Weighted combination of anomaly scores from each algorithm
- Threshold Calibration: Optimized threshold based on precision-recall trade-offs
- Post-Processing: Specialized adjustments to meet both precision and recall targets
This model is designed to work with the R4.2 CERT dataset, which includes:
- logon.csv
- http.csv
- file.csv
- email.csv
- device.csv
- psychometric.csv
Install the required dependencies:
pip install -r requirements.txtRun the data processing script to generate features from the raw data:
# Process a sample of the data (faster, less memory required)
python data_processing.py
# Process the entire dataset (more comprehensive, requires more memory and time)
python data_processing.py --full-dataset
# Additional options
python data_processing.py --output custom_output_name.csv --chunk-size 100000The script includes memory-efficient processing for large files through:
- Chunked reading of large CSV files
- Batch processing of users
- Memory management for large datasets
- Progress reporting
Run the unsupervised ensemble model:
python unsupervised_precision_recall.pyThis will:
- Extract insider threat-specific features
- Train the unsupervised ensemble
- Calibrate thresholds for optimal precision and recall
- Evaluate and report performance metrics
- Save the trained model to disk
To use the trained model on new data:
python predict_insiders.py --input your_data.csv --model unsupervised_model.joblibunsupervised_precision_recall.py: Main implementation of the unsupervised ensembleunsupervised_model.joblib: Trained model fileunsupervised_metrics.csv: Performance metricsunsupervised_model_summary.md: Detailed model documentationdata_processing.py: Data preprocessing pipelinepredict_insiders.py: Prediction utility for new datacombined_features.csv: Processed feature dataset
The model achieves:
- Precision: 85.71%
- Recall: 85.71%
- F1-Score: 85.71%
Confusion Matrix:
[[184 2] (True Negatives | False Positives)
[ 2 12]] (False Negatives| True Positives)
This demonstrates that unsupervised methods can achieve high-performance insider threat detection when combined with domain-specific feature engineering and careful calibration.