Unsupervised Deep Learning for Insider Threat Detection

This implementation provides a complete end-to-end system for detecting insider threats in security log streams using unsupervised deep learning, as described in my research paper: Unsupervised Deep Learning for Insider Threat Detection in Security Log Streams.

Overview

The system consists of two complementary neural network models:

LSTM Sequence Model: Detects anomalies in event sequences by predicting the next event
Autoencoder Profile Model: Detects anomalies in aggregated user behavior profiles

Architecture

┌─────────────────┐
│  Security Logs  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Normalization  │
└────────┬────────┘
         │
         ├─────────────────┬──────────────────┐
         ▼                 ▼                  ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Sequence   │  │   Profile    │  │   Storage    │
│  Features    │  │  Features    │  │              │
└──────┬───────┘  └──────┬───────┘  └──────────────┘
       │                 │
       ▼                 ▼
┌──────────────┐  ┌──────────────┐
│  LSTM Model  │  │ Autoencoder  │
└──────┬───────┘  └──────┬───────┘
       │                 │
       └────────┬────────┘
                ▼
         ┌──────────────┐
         │   Anomaly    │
         │   Detection  │
         └──────────────┘

Requirements

pip install torch numpy pandas scikit-learn

GPU Support (Recommended)

For faster training, use a GPU. The system will automatically detect and use CUDA if available.

Quick Start

1. Generate Synthetic Training Data

python generate_synthetic_logs.py \
    --output normalized_logs.json \
    --num-users 50 \
    --num-days 30 \
    --events-per-day 1000 \
    --anomaly-rate 0.02

This generates approximately 30,000 log events with realistic normal behavior and 2% anomalous events.

Options:

--num-users: Number of simulated users (default: 50)
--num-days: Number of days to simulate (default: 30)
--events-per-day: Average events per day (default: 1000)
--anomaly-rate: Fraction of anomalous events (default: 0.02 = 2%)
--seed: Random seed for reproducibility (default: 42)

2. Train the Models

python train_insider_threat_detection.py

This will:

Load and normalize the log data
Create sequence and profile features
Train the LSTM model for sequence anomaly detection
Train the Autoencoder for profile anomaly detection
Compute anomaly thresholds from validation data
Save trained models to models/ directory

Expected Output:

Loading logs from normalized_logs.json
Loaded 30000 log events
Unique users: 50
Unique event types: 12
Creating event sequences per user...
Created 15000 sequences
Creating aggregated profile features...
Created 3600 profile feature vectors
LSTM Model: 1,234,567 parameters
Autoencoder Model: 234,567 parameters
Training LSTM sequence model...
Epoch 1/30 - Train Loss: 2.3456, Val Loss: 2.1234
...
Training complete!
Models saved to models/

Configuration: Edit the Config class in train_insider_threat_detection.py to adjust:

Model architecture (layer sizes, hidden units, embeddings)
Training parameters (batch size, learning rate, epochs)
Anomaly detection thresholds

3. Run Inference

Demo Mode (Quick Test)

python inference_insider_threat_detection.py --mode demo

Stream Mode (Real-time Processing)

python inference_insider_threat_detection.py \
    --mode stream \
    --stream-source normalized_logs.json \
    --output alerts.jsonl

This processes events from the stream and saves detected anomalies to alerts.jsonl.

Alert Format:

{
  "type": "sequence_anomaly",
  "user_id": "user_042",
  "score": 8.7654,
  "threshold": 4.5123,
  "severity": "high",
  "timestamp": "2025-11-29T02:15:00Z",
  "event_type": "file_access"
}

Data Format

Input Log Format

Logs should be in JSONL (JSON Lines) format with the following schema:

{
  "timestamp": "2025-11-29T14:30:00Z",
  "user_id": "alice",
  "source_ip": "10.0.0.15",
  "dest_ip": "172.16.1.10",
  "event_type": "network_connection",
  "resource": "FileServer01",
  "data_volume": 5120,
  "result": "success"
}

Required Fields:

timestamp: ISO 8601 format
user_id: Username or user identifier
event_type: Type of event (logon, file_access, network_connection, etc.)
result: "success" or "failure"

Optional Fields:

source_ip, dest_ip: IP addresses
resource: Resource accessed (server, file, application)
data_volume: Bytes transferred

Supported Event Types

The system can handle any event types, but has been optimized for:

logon, logoff
file_access, file_write, file_delete
network_connection
process_execution
privilege_escalation
usb_activity
email_send
database_query
api_call

Model Details

LSTM Sequence Model

Architecture:

Embedding layers for categorical features (event type, IPs, resources)
2-layer LSTM with 128 hidden units per layer
Dropout (20%) for regularization
Prediction heads for next event classification

Input: Sequences of 50 events per user Output: Anomaly score based on prediction error

Autoencoder Profile Model

Architecture:

Encoder: 100 → 128 → 64 → 32 (bottleneck)
Decoder: 32 → 64 → 128 → 100
ReLU activations, Dropout (15%)

Input: Aggregated features over 1-hour windows Output: Reconstruction error as anomaly score

Profile Features:

Event counts by type
Data transfer volumes
Network behavior (unique IPs, destinations)
Success/failure rates
Temporal patterns (hour of day, weekend activity)
Event diversity (entropy)

Anomaly Detection

Threshold Setting

Anomaly thresholds are set at the 95th percentile of validation scores:

Scores above threshold → Flag as anomaly
Adjustable via Config.ANOMALY_PERCENTILE

Severity Levels

High: Score > 2× threshold
Medium: Score > threshold

Detection Capabilities

The system can detect:

Lateral Movement: Unusual access to new systems
Data Exfiltration: Abnormal data transfer volumes
Privilege Escalation: Unusual administrative activities
After-Hours Activity: Activity at atypical times
Living off the Land (LOTL): Unusual use of built-in tools

Performance Tuning

GPU vs CPU

Training: GPU recommended (5-10× faster)
Inference: CPU sufficient for most workloads
Automatic device selection in code

Memory Requirements

Training: ~4-8GB RAM (depends on dataset size)
Inference: ~1-2GB RAM
GPU: 4GB+ VRAM recommended

Scaling

For large deployments:

Horizontal Scaling: Partition users across multiple inference instances
Batch Processing: Process events in batches during training
Model Optimization: Quantization or distillation for faster inference

Cloud Deployment

AWS Example

Training:
- Use EC2 g4dn.xlarge (T4 GPU) for training
- Store data in S3
- Use SageMaker for managed training (optional)
Inference:
- Deploy on ECS Fargate or EC2
- Subscribe to Kinesis Data Stream for real-time logs
- Send alerts to SNS or SQS
Architecture:

Logs → Kinesis → Lambda (normalize) → Kinesis → ECS (inference) → SNS (alerts)
         ↓
        S3 (archive)

Monitoring and Maintenance

Model Retraining

Retrain models monthly or quarterly to adapt to:

New normal behavior patterns
Organizational changes
Evolving user roles

# Automated retraining (example cron)
0 0 1 * * python train_insider_threat_detection.py

Alert Tuning

Monitor false positive rate and adjust:

Anomaly percentile threshold
Feature engineering
Model architecture

Performance Metrics

Track:

Alert rate (anomalies per day)
Model inference latency
Resource utilization
Detection accuracy (if labeled data available)

Customization

Adding Custom Features

Edit create_profile_features() in train_insider_threat_detection.py:

features = {
    # ... existing features ...
    'custom_metric': compute_custom_metric(window_df),
}

Adjusting Model Architecture

Edit Config class:

class Config:
    # LSTM
    LSTM_HIDDEN_SIZE = 256  # Increase for more capacity
    LSTM_NUM_LAYERS = 3      # Add layers for deeper model
    
    # Autoencoder
    AUTOENCODER_LAYER_SIZES = [100, 150, 100, 50]  # Custom architecture

Custom Event Types

The system automatically handles new event types through embeddings. To improve detection for specific types, add them to the synthetic data generator or provide domain-specific features.

Troubleshooting

Common Issues

"No such file: normalized_logs.json"
- Run generate_synthetic_logs.py first
Out of Memory Error
- Reduce BATCH_SIZE in Config
- Use smaller sequences (SEQUENCE_LENGTH)
- Process data in chunks
Poor Detection Performance
- Increase training data
- Adjust anomaly threshold
- Add more features
- Retrain with recent data
Unknown Event Types in Inference
- Events not seen during training use default encoding
- Retrain model with updated vocabulary

References

Du et al. (2017). DeepLog: Anomaly Detection and Diagnosis from System Logs
Tuor et al. (2018). Deep Learning for Unsupervised Insider Threat Detection
Veeramachaneni et al. (2016). AI²: Training a Big Data Machine to Defend

License

This implementation is provided for educational and research purposes.

Author

Travis Lelle - Security Engineer and ML/AI Researcher

Support

For issues, questions, or contributions, please refer to the research paper for theoretical details and implementation guidance.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
generate_synthetic_logs.py		generate_synthetic_logs.py
inference_insider_threat_detection.py		inference_insider_threat_detection.py
train_insider_threat_detection.py		train_insider_threat_detection.py

Travis-ML/unsupervised-dl-insider-threat-detection

Folders and files

Latest commit

History

Repository files navigation