diff --git a/README.md b/README.md index 44aba08..ea4230c 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,11 @@ # DeepLogBot +[![PyPI version](https://img.shields.io/pypi/v/deeplogbot)](https://pypi.org/project/deeplogbot/) +[![Python](https://img.shields.io/pypi/pyversions/deeplogbot)](https://pypi.org/project/deeplogbot/) +[![Tests](https://github.com/ypriverol/deeplogbot/actions/workflows/tests.yml/badge.svg)](https://github.com/ypriverol/deeplogbot/actions/workflows/tests.yml) +[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) +[![llms.txt](https://img.shields.io/badge/llms.txt-available-blue)](https://github.com/ypriverol/deeplogbot/blob/main/llms.txt) + Bot detection and traffic classification for scientific data repository logs. ## Overview diff --git a/llms.txt b/llms.txt new file mode 100644 index 0000000..8707f1d --- /dev/null +++ b/llms.txt @@ -0,0 +1,216 @@ +# DeepLogBot + +> Bot detection and traffic classification for scientific data repository download logs. + +DeepLogBot is an open-source framework for detecting and removing automated bot traffic from download logs of scientific data repositories. It classifies each geographic location into three categories: **bot** (scrapers, crawlers, coordinated bot farms), **hub** (legitimate automation such as institutional mirrors, CI/CD pipelines, reanalysis centers), and **organic** (human researchers). The tool was developed for the PRIDE proteomics archive but is designed to be applicable to any scientific data repository (ENA/SRA, PDB, MetaboLights, etc.). + +## Links + +- Source code: https://github.com/ypriverol/deeplogbot +- PyPI package: https://pypi.org/project/deeplogbot/ +- PRIDE Archive: https://www.ebi.ac.uk/pride/ +- Log processing pipeline: https://github.com/PRIDE-Archive/nf-downloadstats +- Paper: Perez-Riverol et al., "Tracking Dataset Reuse in Proteomics" (2025) + +## Quick Start + +```bash +pip install deeplogbot # rules method (no torch required) +pip install "deeplogbot[deep]" # deep method (includes torch) + +# Rule-based classification (fast, interpretable) +deeplogbot -i downloads.parquet -o output/ -m rules + +# Deep architecture (best accuracy, F1=0.775) +deeplogbot -i downloads.parquet -o output/ -m deep + +# Sample large datasets for faster processing +deeplogbot -i downloads.parquet -o output/ -m deep --sample-size 1000000 +``` + +## Key Results on PRIDE Archive + +- Dataset: 159.3 million download records (2020-2025), 4.7 GB Parquet file +- 71,133 unique geographic locations from 235 countries +- Classification (deep method on full dataset): + - Bot: 37,779 locations (53.1%), accounting for 88.0% of all downloads (140.2M) + - Hub: 664 locations (0.9%), accounting for 11.3% of downloads (18.0M) across 58 countries + - Organic: 32,690 locations (46.0%), accounting for 0.7% of downloads (1.1M) +- After bot removal: 19.1M clean downloads across 34,085 datasets and 213 countries +- Top countries by clean downloads: US (5.1M, 26.8%), UK (4.5M, 23.6%), Germany (4.3M, 22.5%) +- Dataset reuse concentration: Gini = 0.84, top 1% of datasets = 43.3% of downloads + +## Benchmark Results (1M-record sample, 1,411 ground truth locations) + +| Method | Macro F1 | Bot Precision | Bot Recall | Hub F1 | Organic F1 | +|--------|----------|---------------|------------|--------|------------| +| Rules | 0.632 | 0.506 | 1.000 | 0.275 | 0.621 | +| Deep | 0.775 | 0.448 | 1.000 | 0.718 | 0.608 | + +The deep method's main advantage is hub detection (F1=0.718 vs 0.275), distinguishing legitimate automation from harmful scraping. + +## Classification Methods + +### Rule-Based (`-m rules`) +YAML-configurable threshold patterns evaluated sequentially: first match determines classification. Requires no training, no torch dependency. Best for production use with known, stable patterns. Patterns defined in `deeplogbot/config.yaml` under `classification.rule_based`. + +### Deep Architecture (`-m deep`) +Multi-stage learned pipeline (all locations pass through the full pipeline, no hard pre-filter lockout): + +1. **Seed Selection** (`seed_selection.py`): Identify high-confidence bot/organic/hub training seeds from feature distributions using conservative heuristic criteria. +2. **Organic VAE** (`organic_vae.py`): Variational Autoencoder trained on organic seeds learns the normal-behavior manifold. Reconstruction error scores how "non-organic" each location is. Includes a Deep Isolation Forest on the VAE latent space. +3. **Temporal Consistency** (`temporal_consistency.py`): Modified z-score spike detection across yearly download patterns. No fixed thresholds. +4. **Fusion Meta-Learner** (`fusion.py`): Gradient-boosted classifier (sklearn) combining all anomaly signals into final bot/hub/organic probabilities. +5. **Soft Priors** (`deep_architecture.py:compute_soft_priors`): Pre-filter signals encoded as continuous features fed into the pipeline, not hard binary decisions. +6. **Reconciliation** (`deep_architecture.py:reconcile_prefilter_and_pipeline`): Override thresholds resolve disagreements between pre-filter and pipeline output (override=0.7, strict=0.8). +7. **Hub Protection** (`post_classification.py`): Structural override prevents legitimate automation from being misclassified as bots. + +## Feature Engineering + +~117 behavioral and discriminative features extracted per location, organized into categories: + +- **Activity features**: download counts, unique users, downloads per user, unique projects, active hours, years active +- **Temporal features**: hourly/yearly entropy, working hours ratio, night activity ratio, circadian rhythm deviation, year-over-year CV, spike ratio +- **Behavioral features**: burst patterns, user coordination scores, session regularity, concurrent user patterns, weekend/weekday imbalance +- **Discriminative features**: file exploration patterns, user authenticity scores, bot composite score, request velocity, access regularity, Benford deviation +- **Time series features**: weekly autocorrelation, periodicity strength, trend slope/acceleration, momentum score, detrended volatility +- **Country-level features**: locations per country, suspicious location ratio, new location ratio + +Features are extracted by the provider system (`deeplogbot/features/providers/ebi/`), which aggregates individual download events into location-level profiles using DuckDB. + +## Input Format + +Parquet file with one row per download event. Required columns: + +| Column | Type | Description | +|--------|------|-------------| +| `accession` | string | Dataset accession (e.g., PXD000001) | +| `geo_location` | string | Geographic coordinates or location identifier | +| `country` | string | Country name | +| `year` | int | Download year | +| `date` | string/date | Download date | + +Optional columns that improve classification: `filename`, `download_protocol`, `user_id` (anonymized hash), `hour`, `city`. + +## Output Format + +### Annotated Parquet +Original download records with appended classification columns: + +| Column | Type | Description | +|--------|------|-------------| +| `is_bot` | bool | True if location classified as bot | +| `is_hub` | bool | True if location classified as legitimate automation | +| `is_organic` | bool | True if location classified as organic user | +| `classification_confidence` | float | Confidence score (0-1) | + +### Reports +- `bot_detection_report.txt`: Summary with counts, breakdowns, top locations per category +- `location_analysis.csv`: Per-location features and classifications (~167 columns) +- `report.html`: Interactive HTML report with charts +- `plots/`: 10 PNG visualizations (geographic distribution, temporal patterns, feature distributions, anomaly analysis, etc.) + +### Output Strategies +- `--output-strategy new_file` (default): Creates `_annotated.parquet` +- `--output-strategy overwrite`: Rewrites the original parquet in place +- `--reports-only`: Only generates reports, no parquet annotation + +## Project Structure + +``` +deeplogbot/ +├── main.py # CLI entry point, orchestrates the pipeline +├── config.py # Configuration loading (config.yaml + taxonomy) +├── config.yaml # All classification rules, thresholds, taxonomy +├── features/ +│ ├── base.py # Abstract base feature extractor +│ ├── schema.py # Log schema definitions +│ ├── registry.py # Feature documentation registry +│ └── providers/ebi/ # EBI/PRIDE-specific feature extraction +│ ├── ebi.py # Main location feature aggregation (DuckDB) +│ ├── behavioral.py # Burst, coordination, circadian features +│ ├── discriminative.py # Bot scores, authenticity, file patterns +│ ├── timeseries.py # Autocorrelation, periodicity, trend features +│ └── schema.py # EBI log schema +├── models/ +│ ├── isoforest/models.py # Isolation Forest training (sklearn) +│ └── classification/ +│ ├── rules.py # Rule-based hierarchical classifier +│ ├── deep_architecture.py # Deep pipeline orchestration, soft priors, reconciliation +│ ├── seed_selection.py # High-confidence seed identification for training +│ ├── organic_vae.py # VAE + Deep Isolation Forest (torch) +│ ├── temporal_consistency.py # Modified z-score spike detection +│ ├── fusion.py # Gradient-boosted meta-learner (sklearn) +│ ├── post_classification.py # Hub protection, logging, label finalization +│ └── feature_validation.py # Feature importance validation +├── reports/ +│ ├── annotation.py # DuckDB-based parquet annotation +│ ├── reporting.py # Text report generation +│ ├── statistics.py # Summary statistics computation +│ ├── html_report.py # Interactive HTML report +│ └── visualizations.py # matplotlib chart generation +├── providers/ +│ └── base_taxonomy.yaml # Classification taxonomy definition +└── utils/geography.py # Country/region geographic lookups +``` + +## Configuration + +All classification parameters are in `deeplogbot/config.yaml`: + +- `classification.rule_based`: Patterns for the rules method (bots, hubs, independent users) +- `classification.hub_protection`: Rules that prevent hubs from being classified as bots +- `classification.bot_detection`: Threshold-based bot detection criteria +- `classification.download_hub`: Hub identification thresholds +- `classification.stratified_prefiltering`: Pre-filter thresholds for the deep method +- `classification.taxonomy`: Two-stage hierarchy (behavior_type -> automation_category) +- `deep_reconciliation`: Override and strict thresholds for deep method reconciliation +- `isolation_forest`: Contamination rate, number of estimators + +## Provider System + +DeepLogBot uses a provider abstraction to support different log formats. Currently implemented: + +- `ebi`: PRIDE/EBI download logs (default) + +Custom providers can be added by implementing the feature extraction interface in `deeplogbot/features/providers/`. Each provider defines its own log schema, feature extraction logic, and classification rules. + +## Dependencies + +Core (always required): pandas, numpy, scikit-learn, scipy, duckdb, pyyaml +Optional (deep method): torch >= 2.1.0 +Development: pytest, ruff + +## Python API + +```python +from deeplogbot import run_bot_annotator + +results = run_bot_annotator( + input_parquet='downloads.parquet', + output_dir='output/', + classification_method='deep', # or 'rules' + sample_size=1000000, # optional sampling + contamination=0.15, # Isolation Forest parameter +) + +# results contains: bot_count, hub_count, organic_count, total_locations +``` + +## CLI Options + +| Option | Description | Default | +|--------|-------------|---------| +| `-i, --input` | Input parquet file | Required | +| `-o, --output-dir` | Output directory for reports | `output/bot_analysis` | +| `-m, --classification-method` | `rules` or `deep` | `rules` | +| `-c, --contamination` | Expected anomaly proportion | `0.15` | +| `-s, --sample-size` | Sample N records from input | None (use all) | +| `-p, --provider` | Log provider configuration | `ebi` | +| `--output-strategy` | `new_file`, `reports_only`, or `overwrite` | `new_file` | +| `--reports-only` | Only generate reports (no parquet) | False | +| `--compute-importances` | Compute feature importances | False | + +## License + +MIT