PRIDE-Archive · ypriverol · Feb 11, 2026 · Feb 11, 2026 · Feb 11, 2026
diff --git a/README.md b/README.md
@@ -1,5 +1,11 @@
 # DeepLogBot
 
+[![PyPI version](https://img.shields.io/pypi/v/deeplogbot)](https://pypi.org/project/deeplogbot/)
+[![Python](https://img.shields.io/pypi/pyversions/deeplogbot)](https://pypi.org/project/deeplogbot/)
+[![Tests](https://github.com/ypriverol/deeplogbot/actions/workflows/tests.yml/badge.svg)](https://github.com/ypriverol/deeplogbot/actions/workflows/tests.yml)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![llms.txt](https://img.shields.io/badge/llms.txt-available-blue)](https://github.com/ypriverol/deeplogbot/blob/main/llms.txt)
+
 Bot detection and traffic classification for scientific data repository logs.
 
 ## Overview

diff --git a/llms.txt b/llms.txt
@@ -0,0 +1,216 @@
+# DeepLogBot
+
+> Bot detection and traffic classification for scientific data repository download logs.
+
+DeepLogBot is an open-source framework for detecting and removing automated bot traffic from download logs of scientific data repositories. It classifies each geographic location into three categories: **bot** (scrapers, crawlers, coordinated bot farms), **hub** (legitimate automation such as institutional mirrors, CI/CD pipelines, reanalysis centers), and **organic** (human researchers). The tool was developed for the PRIDE proteomics archive but is designed to be applicable to any scientific data repository (ENA/SRA, PDB, MetaboLights, etc.).
+
+## Links
+
+- Source code: https://github.com/ypriverol/deeplogbot
+- PyPI package: https://pypi.org/project/deeplogbot/
+- PRIDE Archive: https://www.ebi.ac.uk/pride/
+- Log processing pipeline: https://github.com/PRIDE-Archive/nf-downloadstats
+- Paper: Perez-Riverol et al., "Tracking Dataset Reuse in Proteomics" (2025)
+
+## Quick Start
+
+```bash
+pip install deeplogbot            # rules method (no torch required)
+pip install "deeplogbot[deep]"    # deep method (includes torch)
+
+# Rule-based classification (fast, interpretable)
+deeplogbot -i downloads.parquet -o output/ -m rules
+
+# Deep architecture (best accuracy, F1=0.775)
+deeplogbot -i downloads.parquet -o output/ -m deep
+
+# Sample large datasets for faster processing
+deeplogbot -i downloads.parquet -o output/ -m deep --sample-size 1000000
+```
+
+## Key Results on PRIDE Archive
+
+- Dataset: 159.3 million download records (2020-2025), 4.7 GB Parquet file
+- 71,133 unique geographic locations from 235 countries
+- Classification (deep method on full dataset):
+  - Bot: 37,779 locations (53.1%), accounting for 88.0% of all downloads (140.2M)
+  - Hub: 664 locations (0.9%), accounting for 11.3% of downloads (18.0M) across 58 countries
+  - Organic: 32,690 locations (46.0%), accounting for 0.7% of downloads (1.1M)
+- After bot removal: 19.1M clean downloads across 34,085 datasets and 213 countries
+- Top countries by clean downloads: US (5.1M, 26.8%), UK (4.5M, 23.6%), Germany (4.3M, 22.5%)
+- Dataset reuse concentration: Gini = 0.84, top 1% of datasets = 43.3% of downloads
+
+## Benchmark Results (1M-record sample, 1,411 ground truth locations)
+
+| Method | Macro F1 | Bot Precision | Bot Recall | Hub F1 | Organic F1 |
+|--------|----------|---------------|------------|--------|------------|
+| Rules  | 0.632    | 0.506         | 1.000      | 0.275  | 0.621      |
+| Deep   | 0.775    | 0.448         | 1.000      | 0.718  | 0.608      |
+
+The deep method's main advantage is hub detection (F1=0.718 vs 0.275), distinguishing legitimate automation from harmful scraping.
+
+## Classification Methods
+
+### Rule-Based (`-m rules`)
+YAML-configurable threshold patterns evaluated sequentially: first match determines classification. Requires no training, no torch dependency. Best for production use with known, stable patterns. Patterns defined in `deeplogbot/config.yaml` under `classification.rule_based`.
+
+### Deep Architecture (`-m deep`)
+Multi-stage learned pipeline (all locations pass through the full pipeline, no hard pre-filter lockout):
+
+1. **Seed Selection** (`seed_selection.py`): Identify high-confidence bot/organic/hub training seeds from feature distributions using conservative heuristic criteria.
+2. **Organic VAE** (`organic_vae.py`): Variational Autoencoder trained on organic seeds learns the normal-behavior manifold. Reconstruction error scores how "non-organic" each location is. Includes a Deep Isolation Forest on the VAE latent space.
+3. **Temporal Consistency** (`temporal_consistency.py`): Modified z-score spike detection across yearly download patterns. No fixed thresholds.
+4. **Fusion Meta-Learner** (`fusion.py`): Gradient-boosted classifier (sklearn) combining all anomaly signals into final bot/hub/organic probabilities.
+5. **Soft Priors** (`deep_architecture.py:compute_soft_priors`): Pre-filter signals encoded as continuous features fed into the pipeline, not hard binary decisions.
+6. **Reconciliation** (`deep_architecture.py:reconcile_prefilter_and_pipeline`): Override thresholds resolve disagreements between pre-filter and pipeline output (override=0.7, strict=0.8).
+7. **Hub Protection** (`post_classification.py`): Structural override prevents legitimate automation from being misclassified as bots.
+
+## Feature Engineering
+
+~117 behavioral and discriminative features extracted per location, organized into categories:
+
+- **Activity features**: download counts, unique users, downloads per user, unique projects, active hours, years active
+- **Temporal features**: hourly/yearly entropy, working hours ratio, night activity ratio, circadian rhythm deviation, year-over-year CV, spike ratio
+- **Behavioral features**: burst patterns, user coordination scores, session regularity, concurrent user patterns, weekend/weekday imbalance
+- **Discriminative features**: file exploration patterns, user authenticity scores, bot composite score, request velocity, access regularity, Benford deviation
+- **Time series features**: weekly autocorrelation, periodicity strength, trend slope/acceleration, momentum score, detrended volatility
+- **Country-level features**: locations per country, suspicious location ratio, new location ratio
+
+Features are extracted by the provider system (`deeplogbot/features/providers/ebi/`), which aggregates individual download events into location-level profiles using DuckDB.
+
+## Input Format
+
+Parquet file with one row per download event. Required columns:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `accession` | string | Dataset accession (e.g., PXD000001) |
+| `geo_location` | string | Geographic coordinates or location identifier |
+| `country` | string | Country name |
+| `year` | int | Download year |
+| `date` | string/date | Download date |
+
+Optional columns that improve classification: `filename`, `download_protocol`, `user_id` (anonymized hash), `hour`, `city`.
+
+## Output Format
+
+### Annotated Parquet
+Original download records with appended classification columns:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `is_bot` | bool | True if location classified as bot |
+| `is_hub` | bool | True if location classified as legitimate automation |
+| `is_organic` | bool | True if location classified as organic user |
+| `classification_confidence` | float | Confidence score (0-1) |
+
+### Reports
+- `bot_detection_report.txt`: Summary with counts, breakdowns, top locations per category
+- `location_analysis.csv`: Per-location features and classifications (~167 columns)
+- `report.html`: Interactive HTML report with charts
+- `plots/`: 10 PNG visualizations (geographic distribution, temporal patterns, feature distributions, anomaly analysis, etc.)
+
+### Output Strategies
+- `--output-strategy new_file` (default): Creates `<input>_annotated.parquet`
+- `--output-strategy overwrite`: Rewrites the original parquet in place
+- `--reports-only`: Only generates reports, no parquet annotation
+
+## Project Structure
+
+```
+deeplogbot/
+├── main.py                         # CLI entry point, orchestrates the pipeline
+├── config.py                       # Configuration loading (config.yaml + taxonomy)
+├── config.yaml                     # All classification rules, thresholds, taxonomy
+├── features/
+│   ├── base.py                     # Abstract base feature extractor
+│   ├── schema.py                   # Log schema definitions
+│   ├── registry.py                 # Feature documentation registry
+│   └── providers/ebi/              # EBI/PRIDE-specific feature extraction
+│       ├── ebi.py                  # Main location feature aggregation (DuckDB)
+│       ├── behavioral.py           # Burst, coordination, circadian features
+│       ├── discriminative.py       # Bot scores, authenticity, file patterns
+│       ├── timeseries.py           # Autocorrelation, periodicity, trend features
+│       └── schema.py               # EBI log schema
+├── models/
+│   ├── isoforest/models.py         # Isolation Forest training (sklearn)
+│   └── classification/
+│       ├── rules.py                # Rule-based hierarchical classifier
+│       ├── deep_architecture.py    # Deep pipeline orchestration, soft priors, reconciliation
+│       ├── seed_selection.py       # High-confidence seed identification for training
+│       ├── organic_vae.py          # VAE + Deep Isolation Forest (torch)
+│       ├── temporal_consistency.py # Modified z-score spike detection
+│       ├── fusion.py               # Gradient-boosted meta-learner (sklearn)
+│       ├── post_classification.py  # Hub protection, logging, label finalization
+│       └── feature_validation.py   # Feature importance validation
+├── reports/
+│   ├── annotation.py               # DuckDB-based parquet annotation
+│   ├── reporting.py                # Text report generation
+│   ├── statistics.py               # Summary statistics computation
+│   ├── html_report.py              # Interactive HTML report
+│   └── visualizations.py           # matplotlib chart generation
+├── providers/
+│   └── base_taxonomy.yaml          # Classification taxonomy definition
+└── utils/geography.py              # Country/region geographic lookups
+```
+
+## Configuration
+
+All classification parameters are in `deeplogbot/config.yaml`:
+
+- `classification.rule_based`: Patterns for the rules method (bots, hubs, independent users)
+- `classification.hub_protection`: Rules that prevent hubs from being classified as bots
+- `classification.bot_detection`: Threshold-based bot detection criteria
+- `classification.download_hub`: Hub identification thresholds
+- `classification.stratified_prefiltering`: Pre-filter thresholds for the deep method
+- `classification.taxonomy`: Two-stage hierarchy (behavior_type -> automation_category)
+- `deep_reconciliation`: Override and strict thresholds for deep method reconciliation
+- `isolation_forest`: Contamination rate, number of estimators
+
+## Provider System
+
+DeepLogBot uses a provider abstraction to support different log formats. Currently implemented:
+
+- `ebi`: PRIDE/EBI download logs (default)
+
+Custom providers can be added by implementing the feature extraction interface in `deeplogbot/features/providers/`. Each provider defines its own log schema, feature extraction logic, and classification rules.
+
+## Dependencies
+
+Core (always required): pandas, numpy, scikit-learn, scipy, duckdb, pyyaml
+Optional (deep method): torch >= 2.1.0
+Development: pytest, ruff
+
+## Python API
+
+```python
+from deeplogbot import run_bot_annotator
+
+results = run_bot_annotator(
+    input_parquet='downloads.parquet',
+    output_dir='output/',
+    classification_method='deep',    # or 'rules'
+    sample_size=1000000,             # optional sampling
+    contamination=0.15,              # Isolation Forest parameter
+)
+
+# results contains: bot_count, hub_count, organic_count, total_locations
+```
+
+## CLI Options
+
+| Option | Description | Default |
+|--------|-------------|---------|
+| `-i, --input` | Input parquet file | Required |
+| `-o, --output-dir` | Output directory for reports | `output/bot_analysis` |
+| `-m, --classification-method` | `rules` or `deep` | `rules` |
+| `-c, --contamination` | Expected anomaly proportion | `0.15` |
+| `-s, --sample-size` | Sample N records from input | None (use all) |
+| `-p, --provider` | Log provider configuration | `ebi` |
+| `--output-strategy` | `new_file`, `reports_only`, or `overwrite` | `new_file` |
+| `--reports-only` | Only generate reports (no parquet) | False |
+| `--compute-importances` | Compute feature importances | False |
+
+## License
+
+MIT