Skip to content

This implementation provides a complete end-to-end system for detecting insider threats in security log streams using unsupervised deep learning, as described in my research paper: Unsupervised Deep Learning for Insider Threat Detection in Security Log Streams.

Notifications You must be signed in to change notification settings

Travis-ML/unsupervised-dl-insider-threat-detection

Repository files navigation

Unsupervised Deep Learning for Insider Threat Detection

This implementation provides a complete end-to-end system for detecting insider threats in security log streams using unsupervised deep learning, as described in my research paper: Unsupervised Deep Learning for Insider Threat Detection in Security Log Streams.

Overview

The system consists of two complementary neural network models:

  1. LSTM Sequence Model: Detects anomalies in event sequences by predicting the next event
  2. Autoencoder Profile Model: Detects anomalies in aggregated user behavior profiles

Architecture

┌─────────────────┐
│  Security Logs  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Normalization  │
└────────┬────────┘
         │
         ├─────────────────┬──────────────────┐
         ▼                 ▼                  ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Sequence   │  │   Profile    │  │   Storage    │
│  Features    │  │  Features    │  │              │
└──────┬───────┘  └──────┬───────┘  └──────────────┘
       │                 │
       ▼                 ▼
┌──────────────┐  ┌──────────────┐
│  LSTM Model  │  │ Autoencoder  │
└──────┬───────┘  └──────┬───────┘
       │                 │
       └────────┬────────┘
                ▼
         ┌──────────────┐
         │   Anomaly    │
         │   Detection  │
         └──────────────┘

Requirements

pip install torch numpy pandas scikit-learn

GPU Support (Recommended)

For faster training, use a GPU. The system will automatically detect and use CUDA if available.

Quick Start

1. Generate Synthetic Training Data

python generate_synthetic_logs.py \
    --output normalized_logs.json \
    --num-users 50 \
    --num-days 30 \
    --events-per-day 1000 \
    --anomaly-rate 0.02

This generates approximately 30,000 log events with realistic normal behavior and 2% anomalous events.

Options:

  • --num-users: Number of simulated users (default: 50)
  • --num-days: Number of days to simulate (default: 30)
  • --events-per-day: Average events per day (default: 1000)
  • --anomaly-rate: Fraction of anomalous events (default: 0.02 = 2%)
  • --seed: Random seed for reproducibility (default: 42)

2. Train the Models

python train_insider_threat_detection.py

This will:

  • Load and normalize the log data
  • Create sequence and profile features
  • Train the LSTM model for sequence anomaly detection
  • Train the Autoencoder for profile anomaly detection
  • Compute anomaly thresholds from validation data
  • Save trained models to models/ directory

Expected Output:

Loading logs from normalized_logs.json
Loaded 30000 log events
Unique users: 50
Unique event types: 12
Creating event sequences per user...
Created 15000 sequences
Creating aggregated profile features...
Created 3600 profile feature vectors
LSTM Model: 1,234,567 parameters
Autoencoder Model: 234,567 parameters
Training LSTM sequence model...
Epoch 1/30 - Train Loss: 2.3456, Val Loss: 2.1234
...
Training complete!
Models saved to models/

Configuration: Edit the Config class in train_insider_threat_detection.py to adjust:

  • Model architecture (layer sizes, hidden units, embeddings)
  • Training parameters (batch size, learning rate, epochs)
  • Anomaly detection thresholds

3. Run Inference

Demo Mode (Quick Test)

python inference_insider_threat_detection.py --mode demo

Stream Mode (Real-time Processing)

python inference_insider_threat_detection.py \
    --mode stream \
    --stream-source normalized_logs.json \
    --output alerts.jsonl

This processes events from the stream and saves detected anomalies to alerts.jsonl.

Alert Format:

{
  "type": "sequence_anomaly",
  "user_id": "user_042",
  "score": 8.7654,
  "threshold": 4.5123,
  "severity": "high",
  "timestamp": "2025-11-29T02:15:00Z",
  "event_type": "file_access"
}

Data Format

Input Log Format

Logs should be in JSONL (JSON Lines) format with the following schema:

{
  "timestamp": "2025-11-29T14:30:00Z",
  "user_id": "alice",
  "source_ip": "10.0.0.15",
  "dest_ip": "172.16.1.10",
  "event_type": "network_connection",
  "resource": "FileServer01",
  "data_volume": 5120,
  "result": "success"
}

Required Fields:

  • timestamp: ISO 8601 format
  • user_id: Username or user identifier
  • event_type: Type of event (logon, file_access, network_connection, etc.)
  • result: "success" or "failure"

Optional Fields:

  • source_ip, dest_ip: IP addresses
  • resource: Resource accessed (server, file, application)
  • data_volume: Bytes transferred

Supported Event Types

The system can handle any event types, but has been optimized for:

  • logon, logoff
  • file_access, file_write, file_delete
  • network_connection
  • process_execution
  • privilege_escalation
  • usb_activity
  • email_send
  • database_query
  • api_call

Model Details

LSTM Sequence Model

Architecture:

  • Embedding layers for categorical features (event type, IPs, resources)
  • 2-layer LSTM with 128 hidden units per layer
  • Dropout (20%) for regularization
  • Prediction heads for next event classification

Input: Sequences of 50 events per user Output: Anomaly score based on prediction error

Autoencoder Profile Model

Architecture:

  • Encoder: 100 → 128 → 64 → 32 (bottleneck)
  • Decoder: 32 → 64 → 128 → 100
  • ReLU activations, Dropout (15%)

Input: Aggregated features over 1-hour windows Output: Reconstruction error as anomaly score

Profile Features:

  • Event counts by type
  • Data transfer volumes
  • Network behavior (unique IPs, destinations)
  • Success/failure rates
  • Temporal patterns (hour of day, weekend activity)
  • Event diversity (entropy)

Anomaly Detection

Threshold Setting

Anomaly thresholds are set at the 95th percentile of validation scores:

  • Scores above threshold → Flag as anomaly
  • Adjustable via Config.ANOMALY_PERCENTILE

Severity Levels

  • High: Score > 2× threshold
  • Medium: Score > threshold

Detection Capabilities

The system can detect:

  1. Lateral Movement: Unusual access to new systems
  2. Data Exfiltration: Abnormal data transfer volumes
  3. Privilege Escalation: Unusual administrative activities
  4. After-Hours Activity: Activity at atypical times
  5. Living off the Land (LOTL): Unusual use of built-in tools

Performance Tuning

GPU vs CPU

  • Training: GPU recommended (5-10× faster)
  • Inference: CPU sufficient for most workloads
  • Automatic device selection in code

Memory Requirements

  • Training: ~4-8GB RAM (depends on dataset size)
  • Inference: ~1-2GB RAM
  • GPU: 4GB+ VRAM recommended

Scaling

For large deployments:

  1. Horizontal Scaling: Partition users across multiple inference instances
  2. Batch Processing: Process events in batches during training
  3. Model Optimization: Quantization or distillation for faster inference

Cloud Deployment

AWS Example

  1. Training:

    • Use EC2 g4dn.xlarge (T4 GPU) for training
    • Store data in S3
    • Use SageMaker for managed training (optional)
  2. Inference:

    • Deploy on ECS Fargate or EC2
    • Subscribe to Kinesis Data Stream for real-time logs
    • Send alerts to SNS or SQS
  3. Architecture:

Logs → Kinesis → Lambda (normalize) → Kinesis → ECS (inference) → SNS (alerts)
         ↓
        S3 (archive)

Monitoring and Maintenance

Model Retraining

Retrain models monthly or quarterly to adapt to:

  • New normal behavior patterns
  • Organizational changes
  • Evolving user roles
# Automated retraining (example cron)
0 0 1 * * python train_insider_threat_detection.py

Alert Tuning

Monitor false positive rate and adjust:

  • Anomaly percentile threshold
  • Feature engineering
  • Model architecture

Performance Metrics

Track:

  • Alert rate (anomalies per day)
  • Model inference latency
  • Resource utilization
  • Detection accuracy (if labeled data available)

Customization

Adding Custom Features

Edit create_profile_features() in train_insider_threat_detection.py:

features = {
    # ... existing features ...
    'custom_metric': compute_custom_metric(window_df),
}

Adjusting Model Architecture

Edit Config class:

class Config:
    # LSTM
    LSTM_HIDDEN_SIZE = 256  # Increase for more capacity
    LSTM_NUM_LAYERS = 3      # Add layers for deeper model
    
    # Autoencoder
    AUTOENCODER_LAYER_SIZES = [100, 150, 100, 50]  # Custom architecture

Custom Event Types

The system automatically handles new event types through embeddings. To improve detection for specific types, add them to the synthetic data generator or provide domain-specific features.

Troubleshooting

Common Issues

  1. "No such file: normalized_logs.json"

    • Run generate_synthetic_logs.py first
  2. Out of Memory Error

    • Reduce BATCH_SIZE in Config
    • Use smaller sequences (SEQUENCE_LENGTH)
    • Process data in chunks
  3. Poor Detection Performance

    • Increase training data
    • Adjust anomaly threshold
    • Add more features
    • Retrain with recent data
  4. Unknown Event Types in Inference

    • Events not seen during training use default encoding
    • Retrain model with updated vocabulary

References

  • Du et al. (2017). DeepLog: Anomaly Detection and Diagnosis from System Logs
  • Tuor et al. (2018). Deep Learning for Unsupervised Insider Threat Detection
  • Veeramachaneni et al. (2016). AI²: Training a Big Data Machine to Defend

License

This implementation is provided for educational and research purposes.

Author

Travis Lelle - Security Engineer and ML/AI Researcher

Support

For issues, questions, or contributions, please refer to the research paper for theoretical details and implementation guidance.

About

This implementation provides a complete end-to-end system for detecting insider threats in security log streams using unsupervised deep learning, as described in my research paper: Unsupervised Deep Learning for Insider Threat Detection in Security Log Streams.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages