This implementation provides a complete end-to-end system for detecting insider threats in security log streams using unsupervised deep learning, as described in my research paper: Unsupervised Deep Learning for Insider Threat Detection in Security Log Streams.
The system consists of two complementary neural network models:
- LSTM Sequence Model: Detects anomalies in event sequences by predicting the next event
- Autoencoder Profile Model: Detects anomalies in aggregated user behavior profiles
┌─────────────────┐
│ Security Logs │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Normalization │
└────────┬────────┘
│
├─────────────────┬──────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Sequence │ │ Profile │ │ Storage │
│ Features │ │ Features │ │ │
└──────┬───────┘ └──────┬───────┘ └──────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ LSTM Model │ │ Autoencoder │
└──────┬───────┘ └──────┬───────┘
│ │
└────────┬────────┘
▼
┌──────────────┐
│ Anomaly │
│ Detection │
└──────────────┘
pip install torch numpy pandas scikit-learnFor faster training, use a GPU. The system will automatically detect and use CUDA if available.
python generate_synthetic_logs.py \
--output normalized_logs.json \
--num-users 50 \
--num-days 30 \
--events-per-day 1000 \
--anomaly-rate 0.02This generates approximately 30,000 log events with realistic normal behavior and 2% anomalous events.
Options:
--num-users: Number of simulated users (default: 50)--num-days: Number of days to simulate (default: 30)--events-per-day: Average events per day (default: 1000)--anomaly-rate: Fraction of anomalous events (default: 0.02 = 2%)--seed: Random seed for reproducibility (default: 42)
python train_insider_threat_detection.pyThis will:
- Load and normalize the log data
- Create sequence and profile features
- Train the LSTM model for sequence anomaly detection
- Train the Autoencoder for profile anomaly detection
- Compute anomaly thresholds from validation data
- Save trained models to
models/directory
Expected Output:
Loading logs from normalized_logs.json
Loaded 30000 log events
Unique users: 50
Unique event types: 12
Creating event sequences per user...
Created 15000 sequences
Creating aggregated profile features...
Created 3600 profile feature vectors
LSTM Model: 1,234,567 parameters
Autoencoder Model: 234,567 parameters
Training LSTM sequence model...
Epoch 1/30 - Train Loss: 2.3456, Val Loss: 2.1234
...
Training complete!
Models saved to models/
Configuration:
Edit the Config class in train_insider_threat_detection.py to adjust:
- Model architecture (layer sizes, hidden units, embeddings)
- Training parameters (batch size, learning rate, epochs)
- Anomaly detection thresholds
python inference_insider_threat_detection.py --mode demopython inference_insider_threat_detection.py \
--mode stream \
--stream-source normalized_logs.json \
--output alerts.jsonlThis processes events from the stream and saves detected anomalies to alerts.jsonl.
Alert Format:
{
"type": "sequence_anomaly",
"user_id": "user_042",
"score": 8.7654,
"threshold": 4.5123,
"severity": "high",
"timestamp": "2025-11-29T02:15:00Z",
"event_type": "file_access"
}Logs should be in JSONL (JSON Lines) format with the following schema:
{
"timestamp": "2025-11-29T14:30:00Z",
"user_id": "alice",
"source_ip": "10.0.0.15",
"dest_ip": "172.16.1.10",
"event_type": "network_connection",
"resource": "FileServer01",
"data_volume": 5120,
"result": "success"
}Required Fields:
timestamp: ISO 8601 formatuser_id: Username or user identifierevent_type: Type of event (logon, file_access, network_connection, etc.)result: "success" or "failure"
Optional Fields:
source_ip,dest_ip: IP addressesresource: Resource accessed (server, file, application)data_volume: Bytes transferred
The system can handle any event types, but has been optimized for:
logon,logofffile_access,file_write,file_deletenetwork_connectionprocess_executionprivilege_escalationusb_activityemail_senddatabase_queryapi_call
Architecture:
- Embedding layers for categorical features (event type, IPs, resources)
- 2-layer LSTM with 128 hidden units per layer
- Dropout (20%) for regularization
- Prediction heads for next event classification
Input: Sequences of 50 events per user Output: Anomaly score based on prediction error
Architecture:
- Encoder: 100 → 128 → 64 → 32 (bottleneck)
- Decoder: 32 → 64 → 128 → 100
- ReLU activations, Dropout (15%)
Input: Aggregated features over 1-hour windows Output: Reconstruction error as anomaly score
Profile Features:
- Event counts by type
- Data transfer volumes
- Network behavior (unique IPs, destinations)
- Success/failure rates
- Temporal patterns (hour of day, weekend activity)
- Event diversity (entropy)
Anomaly thresholds are set at the 95th percentile of validation scores:
- Scores above threshold → Flag as anomaly
- Adjustable via
Config.ANOMALY_PERCENTILE
- High: Score > 2× threshold
- Medium: Score > threshold
The system can detect:
- Lateral Movement: Unusual access to new systems
- Data Exfiltration: Abnormal data transfer volumes
- Privilege Escalation: Unusual administrative activities
- After-Hours Activity: Activity at atypical times
- Living off the Land (LOTL): Unusual use of built-in tools
- Training: GPU recommended (5-10× faster)
- Inference: CPU sufficient for most workloads
- Automatic device selection in code
- Training: ~4-8GB RAM (depends on dataset size)
- Inference: ~1-2GB RAM
- GPU: 4GB+ VRAM recommended
For large deployments:
- Horizontal Scaling: Partition users across multiple inference instances
- Batch Processing: Process events in batches during training
- Model Optimization: Quantization or distillation for faster inference
-
Training:
- Use EC2 g4dn.xlarge (T4 GPU) for training
- Store data in S3
- Use SageMaker for managed training (optional)
-
Inference:
- Deploy on ECS Fargate or EC2
- Subscribe to Kinesis Data Stream for real-time logs
- Send alerts to SNS or SQS
-
Architecture:
Logs → Kinesis → Lambda (normalize) → Kinesis → ECS (inference) → SNS (alerts)
↓
S3 (archive)
Retrain models monthly or quarterly to adapt to:
- New normal behavior patterns
- Organizational changes
- Evolving user roles
# Automated retraining (example cron)
0 0 1 * * python train_insider_threat_detection.pyMonitor false positive rate and adjust:
- Anomaly percentile threshold
- Feature engineering
- Model architecture
Track:
- Alert rate (anomalies per day)
- Model inference latency
- Resource utilization
- Detection accuracy (if labeled data available)
Edit create_profile_features() in train_insider_threat_detection.py:
features = {
# ... existing features ...
'custom_metric': compute_custom_metric(window_df),
}Edit Config class:
class Config:
# LSTM
LSTM_HIDDEN_SIZE = 256 # Increase for more capacity
LSTM_NUM_LAYERS = 3 # Add layers for deeper model
# Autoencoder
AUTOENCODER_LAYER_SIZES = [100, 150, 100, 50] # Custom architectureThe system automatically handles new event types through embeddings. To improve detection for specific types, add them to the synthetic data generator or provide domain-specific features.
-
"No such file: normalized_logs.json"
- Run
generate_synthetic_logs.pyfirst
- Run
-
Out of Memory Error
- Reduce
BATCH_SIZEin Config - Use smaller sequences (
SEQUENCE_LENGTH) - Process data in chunks
- Reduce
-
Poor Detection Performance
- Increase training data
- Adjust anomaly threshold
- Add more features
- Retrain with recent data
-
Unknown Event Types in Inference
- Events not seen during training use default encoding
- Retrain model with updated vocabulary
- Du et al. (2017). DeepLog: Anomaly Detection and Diagnosis from System Logs
- Tuor et al. (2018). Deep Learning for Unsupervised Insider Threat Detection
- Veeramachaneni et al. (2016). AI²: Training a Big Data Machine to Defend
This implementation is provided for educational and research purposes.
Travis Lelle - Security Engineer and ML/AI Researcher
For issues, questions, or contributions, please refer to the research paper for theoretical details and implementation guidance.