Status: Phase 1 IMPLEMENTED β
Architecture: Async Pipeline with Military-Grade Resilience
Target: Autonomous acquisition of vanguard global knowledge
- Vanguard Filtering: Retrieve high-level knowledge published in the last 24 months
- Global Scope: Access sources in English, Russian, and Mandarin Chinese
- Resilient Acquisition: Circuit breakers, exponential backoff, stealth operations
- Full Auditability: SQLite audit log with complete traceability
PROMETHEUS v3.0
β
βββ Module A: Watchtower (Signal Detection) [TODO: Phase 2]
β βββ Time Filter (<24 months)
β βββ Language Filter (EN/RU/ZH, confidence>0.8)
β βββ Async Queue
β
βββ Module B: Hunter (Retrieval Engine) β
IMPLEMENTED
β βββ Source Prioritization (Open Access β Fallback)
β βββ Circuit Breaker (auto-disable failing sources)
β βββ Exponential Backoff (2s β 4s β 8s β 16s)
β βββ Stealth Ops (jitter, UA rotation)
β
βββ Module C: Alchemist (Content Processor) [TODO: Phase 2]
β βββ OCR Fallback (tesseract/nougat)
β βββ Metadata Enrichment (Crossref API)
β βββ Audit Manifest Generation
β
βββ Module D: Gatekeeper (Sandbox & Validation) [TODO: Phase 3]
βββ Sandbox Validation (MIME + Hash)
βββ Quotas (50/day)
βββ Sync to Reference/
cd d:/01_Capital_Workstation/02_Code/05_Deep_Researcher_Prometheus# Git Bash (MANDATORY per LEY 9)
python -m venv .venv
source .venv/Scripts/activate # Git Bash on Windows# Using pip-tools (LEY 10: Immutability)
pip install pip-tools
pip-compile requirements.in --output-file=requirements.txt --generate-hashes
pip-sync requirements.txt
# OR using pyproject.toml
pip install -e .playwright install chromium# Copy template
cp .env.template .env
# Edit with your settings
nano .env # or code .envCRITICAL: Set KNOWLEDGE_BASE_ROOT, HUNTING_LOG_DB, and adjust quotas.
import asyncio
from prometheus import (
setup_logging,
db_manager,
hunter,
DownloadTarget,
SourceType
)
async def main():
# Setup
setup_logging()
await db_manager.initialize()
# Create target
target = DownloadTarget(
url="https://arxiv.org/pdf/2401.12345.pdf",
doi="10.48550/arXiv.2401.12345",
title="Example Paper on Quantum ML",
source_type=SourceType.ARXIV
)
# Hunt!
file_path = await hunter.hunt(target)
if file_path:
print(f"β
Downloaded: {file_path}")
else:
print("β Download failed")
# Cleanup
await db_manager.close()
if __name__ == "__main__":
asyncio.run(main())import asyncio
from prometheus import setup_logging, db_manager, hunter, DownloadTarget
async def batch_download(targets: list[DownloadTarget]):
setup_logging()
await db_manager.initialize()
tasks = [hunter.hunt(target) for target in targets]
results = await asyncio.gather(*tasks, return_exceptions=True)
for target, result in zip(targets, results):
if isinstance(result, Exception):
print(f"β {target.url}: {result}")
elif result:
print(f"β
{target.url}: {result}")
else:
print(f"β οΈ {target.url}: Failed")
await db_manager.close()
# Usage
targets = [
DownloadTarget(url="https://arxiv.org/pdf/2401.00001.pdf"),
DownloadTarget(url="https://arxiv.org/pdf/2401.00002.pdf"),
# ... more targets
]
asyncio.run(batch_download(targets))import asyncio
from prometheus import setup_logging, db_manager
async def show_stats():
setup_logging()
await db_manager.initialize()
stats = await db_manager.get_stats()
print("π PROMETHEUS Stats:")
print(f" Total Downloads: {stats['total_downloads']}")
print(f" Success Rate: {stats['success_rate']}%")
print(f" Today Quota: {stats['today_quota_used']}/{stats['today_quota_limit']}")
print(f"\n Status Breakdown:")
for status, count in stats['status_breakdown'].items():
print(f" {status}: {count}")
await db_manager.close()
asyncio.run(show_stats())| Variable | Default | Description |
|---|---|---|
PROMETHEUS_ENV |
development |
development | production |
PROMETHEUS_LOG_LEVEL |
INFO |
DEBUG | INFO | WARNING | ERROR |
MAX_DOWNLOADS_PER_DAY |
50 |
Daily download quota |
MAX_CONCURRENT_DOWNLOADS |
5 |
Parallel downloads |
MAX_RETRIES |
3 |
Retry attempts per source |
CIRCUIT_BREAKER_FAILURE_THRESHOLD |
3 |
Failures before circuit opens |
CIRCUIT_BREAKER_TIMEOUT_SECONDS |
86400 |
Time before circuit reset (24h) |
MIN_DELAY_SECONDS |
2.0 |
Minimum jitter delay |
MAX_DELAY_SECONDS |
8.0 |
Maximum jitter delay |
SCIHUB_ENABLED |
false |
Enable Sci-Hub fallback |
05_Deep_Researcher_Prometheus/
βββ 01_Data/
β βββ 01_Raw/ # (unused for now)
β βββ 02_Intermediate/ # (unused for now)
β βββ 03_Processed/
β βββ hunting_log.sqlite # Audit database
β
βββ 02_Scripts/ # Operational scripts (future)
βββ 03_Output/ # Reports, manifests
βββ 04_Tests/ # Unit tests
β
βββ prometheus/ # Core package
β βββ __init__.py
β βββ config.py # Configuration management
β βββ logging_config.py # Structured logging
β βββ models.py # SQLAlchemy models
β βββ database.py # Async DB manager
β βββ hunter.py # Module B: Download engine
β
βββ pyproject.toml # Project metadata
βββ requirements.in # Dependency declarations
βββ .env.template # Configuration template
βββ .gitignore
βββ README.md # This file
- β
Uses
requirements.inβrequirements.txtwith hashes - β
Virtual environment isolation (
.venv) - β No global installations
- β No files in workspace root
- β
Fractal structure (
01_Data/,02_Scripts/, etc.) - β
.gitignorefor data/logs
- β All commands documented for Git Bash
- β POSIX-compatible paths
- β Type hints everywhere
- β Docstrings (Google style)
- β Structured logging (structlog)
- β Full error handling
Audit trail of all download attempts.
| Field | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
doi |
TEXT | DOI identifier |
url |
TEXT | Source URL |
status |
ENUM | QUEUED | IN_PROGRESS | SUCCESS | FAILED | CIRCUIT_OPEN |
source_type |
ENUM | ARXIV | UNPAYWALL | CORE | SCIHUB | etc. |
file_hash |
TEXT | SHA256 hash |
attempts |
INTEGER | Retry count |
created_at |
DATETIME | UTC timestamp |
completed_at |
DATETIME | UTC timestamp |
Circuit breaker state per source.
| Field | Type | Description |
|---|---|---|
source_name |
TEXT | Source identifier |
is_open |
BOOLEAN | Circuit breaker state |
consecutive_failures |
INTEGER | Failure streak |
last_failure_at |
DATETIME | Last failure timestamp |
Daily download quotas.
| Field | Type | Description |
|---|---|---|
date |
DATE | Day |
downloads_count |
INTEGER | Downloads performed |
# Run tests
pytest 04_Tests/ -v
# With coverage
pytest 04_Tests/ --cov=prometheus --cov-report=html- Async engine with asyncio
- SQLite audit log with SQLAlchemy
- Hunter module with circuit breaker
- Stealth session (UA rotation, jitter)
- Database manager with quotas
- Module A: Watchtower (signal detection)
- Time filter (<24 months)
- Language detection (lingua-py)
- Async queue management
- Module C: Alchemist (content processor)
- OCR fallback (tesseract/nougat)
- Metadata enrichment (Crossref API)
- YAML frontmatter generation
- Module D: Gatekeeper (sandbox validation)
- Launcher scripts (Bash/PowerShell)
- Systemd service
- Integration with
98_Maintenance/update_brain.bat - Pre-commit hooks
- Full test suite
Solution: Always call await db_manager.initialize() before using the database.
await db_manager.initialize()Solution: Check circuit_breaker_state table for failing sources:
state = await db_manager.get_circuit_state("arxiv")
print(state.consecutive_failures, state.is_open)To manually reset:
await db_manager.record_circuit_success("arxiv")Solution: Quota resets daily at 00:00 UTC. To override, increase MAX_DOWNLOADS_PER_DAY in .env.
Proprietary - Capital Workstation Internal Use Only
Project Lead: Capital Workstation
Codename: PROMETHEUS v3.0
Mission: Knowledge Singularity through Resilient Automation
"From the shadows of ignorance, we bring the fire of knowledge." π₯