Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# DeepLogBot

[![PyPI version](https://img.shields.io/pypi/v/deeplogbot)](https://pypi.org/project/deeplogbot/)
[![Python](https://img.shields.io/pypi/pyversions/deeplogbot)](https://pypi.org/project/deeplogbot/)
[![Tests](https://github.com/ypriverol/deeplogbot/actions/workflows/tests.yml/badge.svg)](https://github.com/ypriverol/deeplogbot/actions/workflows/tests.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![llms.txt](https://img.shields.io/badge/llms.txt-available-blue)](https://github.com/ypriverol/deeplogbot/blob/main/llms.txt)

Bot detection and traffic classification for scientific data repository logs.

## Overview
Expand Down
216 changes: 216 additions & 0 deletions llms.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# DeepLogBot

> Bot detection and traffic classification for scientific data repository download logs.

DeepLogBot is an open-source framework for detecting and removing automated bot traffic from download logs of scientific data repositories. It classifies each geographic location into three categories: **bot** (scrapers, crawlers, coordinated bot farms), **hub** (legitimate automation such as institutional mirrors, CI/CD pipelines, reanalysis centers), and **organic** (human researchers). The tool was developed for the PRIDE proteomics archive but is designed to be applicable to any scientific data repository (ENA/SRA, PDB, MetaboLights, etc.).

## Links

- Source code: https://github.com/ypriverol/deeplogbot
- PyPI package: https://pypi.org/project/deeplogbot/
- PRIDE Archive: https://www.ebi.ac.uk/pride/
- Log processing pipeline: https://github.com/PRIDE-Archive/nf-downloadstats
- Paper: Perez-Riverol et al., "Tracking Dataset Reuse in Proteomics" (2025)

## Quick Start

```bash
pip install deeplogbot # rules method (no torch required)
pip install "deeplogbot[deep]" # deep method (includes torch)

# Rule-based classification (fast, interpretable)
deeplogbot -i downloads.parquet -o output/ -m rules

# Deep architecture (best accuracy, F1=0.775)
deeplogbot -i downloads.parquet -o output/ -m deep

# Sample large datasets for faster processing
deeplogbot -i downloads.parquet -o output/ -m deep --sample-size 1000000
```

## Key Results on PRIDE Archive

- Dataset: 159.3 million download records (2020-2025), 4.7 GB Parquet file
- 71,133 unique geographic locations from 235 countries
- Classification (deep method on full dataset):
- Bot: 37,779 locations (53.1%), accounting for 88.0% of all downloads (140.2M)
- Hub: 664 locations (0.9%), accounting for 11.3% of downloads (18.0M) across 58 countries
- Organic: 32,690 locations (46.0%), accounting for 0.7% of downloads (1.1M)
- After bot removal: 19.1M clean downloads across 34,085 datasets and 213 countries
- Top countries by clean downloads: US (5.1M, 26.8%), UK (4.5M, 23.6%), Germany (4.3M, 22.5%)
- Dataset reuse concentration: Gini = 0.84, top 1% of datasets = 43.3% of downloads

## Benchmark Results (1M-record sample, 1,411 ground truth locations)

| Method | Macro F1 | Bot Precision | Bot Recall | Hub F1 | Organic F1 |
|--------|----------|---------------|------------|--------|------------|
| Rules | 0.632 | 0.506 | 1.000 | 0.275 | 0.621 |
| Deep | 0.775 | 0.448 | 1.000 | 0.718 | 0.608 |

The deep method's main advantage is hub detection (F1=0.718 vs 0.275), distinguishing legitimate automation from harmful scraping.

## Classification Methods

### Rule-Based (`-m rules`)
YAML-configurable threshold patterns evaluated sequentially: first match determines classification. Requires no training, no torch dependency. Best for production use with known, stable patterns. Patterns defined in `deeplogbot/config.yaml` under `classification.rule_based`.

### Deep Architecture (`-m deep`)
Multi-stage learned pipeline (all locations pass through the full pipeline, no hard pre-filter lockout):

1. **Seed Selection** (`seed_selection.py`): Identify high-confidence bot/organic/hub training seeds from feature distributions using conservative heuristic criteria.
2. **Organic VAE** (`organic_vae.py`): Variational Autoencoder trained on organic seeds learns the normal-behavior manifold. Reconstruction error scores how "non-organic" each location is. Includes a Deep Isolation Forest on the VAE latent space.
3. **Temporal Consistency** (`temporal_consistency.py`): Modified z-score spike detection across yearly download patterns. No fixed thresholds.
4. **Fusion Meta-Learner** (`fusion.py`): Gradient-boosted classifier (sklearn) combining all anomaly signals into final bot/hub/organic probabilities.
5. **Soft Priors** (`deep_architecture.py:compute_soft_priors`): Pre-filter signals encoded as continuous features fed into the pipeline, not hard binary decisions.
6. **Reconciliation** (`deep_architecture.py:reconcile_prefilter_and_pipeline`): Override thresholds resolve disagreements between pre-filter and pipeline output (override=0.7, strict=0.8).
7. **Hub Protection** (`post_classification.py`): Structural override prevents legitimate automation from being misclassified as bots.

## Feature Engineering

~117 behavioral and discriminative features extracted per location, organized into categories:

- **Activity features**: download counts, unique users, downloads per user, unique projects, active hours, years active
- **Temporal features**: hourly/yearly entropy, working hours ratio, night activity ratio, circadian rhythm deviation, year-over-year CV, spike ratio
- **Behavioral features**: burst patterns, user coordination scores, session regularity, concurrent user patterns, weekend/weekday imbalance
- **Discriminative features**: file exploration patterns, user authenticity scores, bot composite score, request velocity, access regularity, Benford deviation
- **Time series features**: weekly autocorrelation, periodicity strength, trend slope/acceleration, momentum score, detrended volatility
- **Country-level features**: locations per country, suspicious location ratio, new location ratio

Features are extracted by the provider system (`deeplogbot/features/providers/ebi/`), which aggregates individual download events into location-level profiles using DuckDB.

## Input Format

Parquet file with one row per download event. Required columns:

| Column | Type | Description |
|--------|------|-------------|
| `accession` | string | Dataset accession (e.g., PXD000001) |
| `geo_location` | string | Geographic coordinates or location identifier |
| `country` | string | Country name |
| `year` | int | Download year |
| `date` | string/date | Download date |

Optional columns that improve classification: `filename`, `download_protocol`, `user_id` (anonymized hash), `hour`, `city`.

## Output Format

### Annotated Parquet
Original download records with appended classification columns:

| Column | Type | Description |
|--------|------|-------------|
| `is_bot` | bool | True if location classified as bot |
| `is_hub` | bool | True if location classified as legitimate automation |
| `is_organic` | bool | True if location classified as organic user |
| `classification_confidence` | float | Confidence score (0-1) |

### Reports
- `bot_detection_report.txt`: Summary with counts, breakdowns, top locations per category
- `location_analysis.csv`: Per-location features and classifications (~167 columns)
- `report.html`: Interactive HTML report with charts
- `plots/`: 10 PNG visualizations (geographic distribution, temporal patterns, feature distributions, anomaly analysis, etc.)

### Output Strategies
- `--output-strategy new_file` (default): Creates `<input>_annotated.parquet`
- `--output-strategy overwrite`: Rewrites the original parquet in place
- `--reports-only`: Only generates reports, no parquet annotation

## Project Structure

```
deeplogbot/
├── main.py # CLI entry point, orchestrates the pipeline
├── config.py # Configuration loading (config.yaml + taxonomy)
├── config.yaml # All classification rules, thresholds, taxonomy
├── features/
│ ├── base.py # Abstract base feature extractor
│ ├── schema.py # Log schema definitions
│ ├── registry.py # Feature documentation registry
│ └── providers/ebi/ # EBI/PRIDE-specific feature extraction
│ ├── ebi.py # Main location feature aggregation (DuckDB)
│ ├── behavioral.py # Burst, coordination, circadian features
│ ├── discriminative.py # Bot scores, authenticity, file patterns
│ ├── timeseries.py # Autocorrelation, periodicity, trend features
│ └── schema.py # EBI log schema
├── models/
│ ├── isoforest/models.py # Isolation Forest training (sklearn)
│ └── classification/
│ ├── rules.py # Rule-based hierarchical classifier
│ ├── deep_architecture.py # Deep pipeline orchestration, soft priors, reconciliation
│ ├── seed_selection.py # High-confidence seed identification for training
│ ├── organic_vae.py # VAE + Deep Isolation Forest (torch)
│ ├── temporal_consistency.py # Modified z-score spike detection
│ ├── fusion.py # Gradient-boosted meta-learner (sklearn)
│ ├── post_classification.py # Hub protection, logging, label finalization
│ └── feature_validation.py # Feature importance validation
├── reports/
│ ├── annotation.py # DuckDB-based parquet annotation
│ ├── reporting.py # Text report generation
│ ├── statistics.py # Summary statistics computation
│ ├── html_report.py # Interactive HTML report
│ └── visualizations.py # matplotlib chart generation
├── providers/
│ └── base_taxonomy.yaml # Classification taxonomy definition
└── utils/geography.py # Country/region geographic lookups
```

## Configuration

All classification parameters are in `deeplogbot/config.yaml`:

- `classification.rule_based`: Patterns for the rules method (bots, hubs, independent users)
- `classification.hub_protection`: Rules that prevent hubs from being classified as bots
- `classification.bot_detection`: Threshold-based bot detection criteria
- `classification.download_hub`: Hub identification thresholds
- `classification.stratified_prefiltering`: Pre-filter thresholds for the deep method
- `classification.taxonomy`: Two-stage hierarchy (behavior_type -> automation_category)
- `deep_reconciliation`: Override and strict thresholds for deep method reconciliation
- `isolation_forest`: Contamination rate, number of estimators

## Provider System

DeepLogBot uses a provider abstraction to support different log formats. Currently implemented:

- `ebi`: PRIDE/EBI download logs (default)

Custom providers can be added by implementing the feature extraction interface in `deeplogbot/features/providers/`. Each provider defines its own log schema, feature extraction logic, and classification rules.

## Dependencies

Core (always required): pandas, numpy, scikit-learn, scipy, duckdb, pyyaml
Optional (deep method): torch >= 2.1.0
Development: pytest, ruff

## Python API

```python
from deeplogbot import run_bot_annotator

results = run_bot_annotator(
input_parquet='downloads.parquet',
output_dir='output/',
classification_method='deep', # or 'rules'
sample_size=1000000, # optional sampling
contamination=0.15, # Isolation Forest parameter
)

# results contains: bot_count, hub_count, organic_count, total_locations
```

## CLI Options

| Option | Description | Default |
|--------|-------------|---------|
| `-i, --input` | Input parquet file | Required |
| `-o, --output-dir` | Output directory for reports | `output/bot_analysis` |
| `-m, --classification-method` | `rules` or `deep` | `rules` |
| `-c, --contamination` | Expected anomaly proportion | `0.15` |
| `-s, --sample-size` | Sample N records from input | None (use all) |
| `-p, --provider` | Log provider configuration | `ebi` |
| `--output-strategy` | `new_file`, `reports_only`, or `overwrite` | `new_file` |
| `--reports-only` | Only generate reports (no parquet) | False |
| `--compute-importances` | Compute feature importances | False |

## License

MIT