Skip to content

Erickcondoy2005/05_Deep_Researcher_Prometheus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”₯ PROMETHEUS v3.0 - Deep Knowledge Singularity System

Status: Phase 1 IMPLEMENTED βœ…
Architecture: Async Pipeline with Military-Grade Resilience
Target: Autonomous acquisition of vanguard global knowledge


🎯 Strategic Objectives

  1. Vanguard Filtering: Retrieve high-level knowledge published in the last 24 months
  2. Global Scope: Access sources in English, Russian, and Mandarin Chinese
  3. Resilient Acquisition: Circuit breakers, exponential backoff, stealth operations
  4. Full Auditability: SQLite audit log with complete traceability

πŸ›οΈ Architecture Overview

PROMETHEUS v3.0
β”‚
β”œβ”€β”€ Module A: Watchtower (Signal Detection) [TODO: Phase 2]
β”‚   β”œβ”€β”€ Time Filter (<24 months)
β”‚   β”œβ”€β”€ Language Filter (EN/RU/ZH, confidence>0.8)
β”‚   └── Async Queue
β”‚
β”œβ”€β”€ Module B: Hunter (Retrieval Engine) βœ… IMPLEMENTED
β”‚   β”œβ”€β”€ Source Prioritization (Open Access β†’ Fallback)
β”‚   β”œβ”€β”€ Circuit Breaker (auto-disable failing sources)
β”‚   β”œβ”€β”€ Exponential Backoff (2s β†’ 4s β†’ 8s β†’ 16s)
β”‚   └── Stealth Ops (jitter, UA rotation)
β”‚
β”œβ”€β”€ Module C: Alchemist (Content Processor) [TODO: Phase 2]
β”‚   β”œβ”€β”€ OCR Fallback (tesseract/nougat)
β”‚   β”œβ”€β”€ Metadata Enrichment (Crossref API)
β”‚   └── Audit Manifest Generation
β”‚
└── Module D: Gatekeeper (Sandbox & Validation) [TODO: Phase 3]
    β”œβ”€β”€ Sandbox Validation (MIME + Hash)
    β”œβ”€β”€ Quotas (50/day)
    └── Sync to Reference/

πŸ“¦ Installation

1. Clone or Navigate to Project

cd d:/01_Capital_Workstation/02_Code/05_Deep_Researcher_Prometheus

2. Create Virtual Environment

# Git Bash (MANDATORY per LEY 9)
python -m venv .venv
source .venv/Scripts/activate  # Git Bash on Windows

3. Install Dependencies

# Using pip-tools (LEY 10: Immutability)
pip install pip-tools
pip-compile requirements.in --output-file=requirements.txt --generate-hashes
pip-sync requirements.txt

# OR using pyproject.toml
pip install -e .

4. Install Playwright (for web automation)

playwright install chromium

5. Configure Environment

# Copy template
cp .env.template .env

# Edit with your settings
nano .env  # or code .env

CRITICAL: Set KNOWLEDGE_BASE_ROOT, HUNTING_LOG_DB, and adjust quotas.


πŸš€ Quick Start

Example 1: Basic Download

import asyncio
from prometheus import (
    setup_logging,
    db_manager,
    hunter,
    DownloadTarget,
    SourceType
)

async def main():
    # Setup
    setup_logging()
    await db_manager.initialize()
    
    # Create target
    target = DownloadTarget(
        url="https://arxiv.org/pdf/2401.12345.pdf",
        doi="10.48550/arXiv.2401.12345",
        title="Example Paper on Quantum ML",
        source_type=SourceType.ARXIV
    )
    
    # Hunt!
    file_path = await hunter.hunt(target)
    
    if file_path:
        print(f"βœ… Downloaded: {file_path}")
    else:
        print("❌ Download failed")
    
    # Cleanup
    await db_manager.close()

if __name__ == "__main__":
    asyncio.run(main())

Example 2: Batch Download with Queue

import asyncio
from prometheus import setup_logging, db_manager, hunter, DownloadTarget

async def batch_download(targets: list[DownloadTarget]):
    setup_logging()
    await db_manager.initialize()
    
    tasks = [hunter.hunt(target) for target in targets]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    
    for target, result in zip(targets, results):
        if isinstance(result, Exception):
            print(f"❌ {target.url}: {result}")
        elif result:
            print(f"βœ… {target.url}: {result}")
        else:
            print(f"⚠️ {target.url}: Failed")
    
    await db_manager.close()

# Usage
targets = [
    DownloadTarget(url="https://arxiv.org/pdf/2401.00001.pdf"),
    DownloadTarget(url="https://arxiv.org/pdf/2401.00002.pdf"),
    # ... more targets
]

asyncio.run(batch_download(targets))

Example 3: Check Stats

import asyncio
from prometheus import setup_logging, db_manager

async def show_stats():
    setup_logging()
    await db_manager.initialize()
    
    stats = await db_manager.get_stats()
    
    print("πŸ“Š PROMETHEUS Stats:")
    print(f"  Total Downloads: {stats['total_downloads']}")
    print(f"  Success Rate: {stats['success_rate']}%")
    print(f"  Today Quota: {stats['today_quota_used']}/{stats['today_quota_limit']}")
    print(f"\n  Status Breakdown:")
    for status, count in stats['status_breakdown'].items():
        print(f"    {status}: {count}")
    
    await db_manager.close()

asyncio.run(show_stats())

πŸ› οΈ Configuration Reference

Environment Variables (.env)

Variable Default Description
PROMETHEUS_ENV development development | production
PROMETHEUS_LOG_LEVEL INFO DEBUG | INFO | WARNING | ERROR
MAX_DOWNLOADS_PER_DAY 50 Daily download quota
MAX_CONCURRENT_DOWNLOADS 5 Parallel downloads
MAX_RETRIES 3 Retry attempts per source
CIRCUIT_BREAKER_FAILURE_THRESHOLD 3 Failures before circuit opens
CIRCUIT_BREAKER_TIMEOUT_SECONDS 86400 Time before circuit reset (24h)
MIN_DELAY_SECONDS 2.0 Minimum jitter delay
MAX_DELAY_SECONDS 8.0 Maximum jitter delay
SCIHUB_ENABLED false Enable Sci-Hub fallback

πŸ“ Project Structure

05_Deep_Researcher_Prometheus/
β”œβ”€β”€ 01_Data/
β”‚   β”œβ”€β”€ 01_Raw/              # (unused for now)
β”‚   β”œβ”€β”€ 02_Intermediate/     # (unused for now)
β”‚   └── 03_Processed/
β”‚       └── hunting_log.sqlite  # Audit database
β”‚
β”œβ”€β”€ 02_Scripts/              # Operational scripts (future)
β”œβ”€β”€ 03_Output/               # Reports, manifests
β”œβ”€β”€ 04_Tests/                # Unit tests
β”‚
β”œβ”€β”€ prometheus/              # Core package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py           # Configuration management
β”‚   β”œβ”€β”€ logging_config.py   # Structured logging
β”‚   β”œβ”€β”€ models.py           # SQLAlchemy models
β”‚   β”œβ”€β”€ database.py         # Async DB manager
β”‚   └── hunter.py           # Module B: Download engine
β”‚
β”œβ”€β”€ pyproject.toml          # Project metadata
β”œβ”€β”€ requirements.in         # Dependency declarations
β”œβ”€β”€ .env.template           # Configuration template
β”œβ”€β”€ .gitignore
└── README.md               # This file

πŸ” Security & Compliance

LEY 10: Immutability of Dependencies

  • βœ… Uses requirements.in β†’ requirements.txt with hashes
  • βœ… Virtual environment isolation (.venv)
  • βœ… No global installations

LEY 1: Radical Hygiene

  • βœ… No files in workspace root
  • βœ… Fractal structure (01_Data/, 02_Scripts/, etc.)
  • βœ… .gitignore for data/logs

LEY 9: Sovereign Terminal

  • βœ… All commands documented for Git Bash
  • βœ… POSIX-compatible paths

LEY 4: Quantitative Standard

  • βœ… Type hints everywhere
  • βœ… Docstrings (Google style)
  • βœ… Structured logging (structlog)
  • βœ… Full error handling

πŸ“Š Database Schema

hunting_log

Audit trail of all download attempts.

Field Type Description
id INTEGER Primary key
doi TEXT DOI identifier
url TEXT Source URL
status ENUM QUEUED | IN_PROGRESS | SUCCESS | FAILED | CIRCUIT_OPEN
source_type ENUM ARXIV | UNPAYWALL | CORE | SCIHUB | etc.
file_hash TEXT SHA256 hash
attempts INTEGER Retry count
created_at DATETIME UTC timestamp
completed_at DATETIME UTC timestamp

circuit_breaker_state

Circuit breaker state per source.

Field Type Description
source_name TEXT Source identifier
is_open BOOLEAN Circuit breaker state
consecutive_failures INTEGER Failure streak
last_failure_at DATETIME Last failure timestamp

daily_quota

Daily download quotas.

Field Type Description
date DATE Day
downloads_count INTEGER Downloads performed

πŸ§ͺ Testing

# Run tests
pytest 04_Tests/ -v

# With coverage
pytest 04_Tests/ --cov=prometheus --cov-report=html

πŸ“‹ Roadmap

βœ… Phase 1: Core Infrastructure (DONE)

  • Async engine with asyncio
  • SQLite audit log with SQLAlchemy
  • Hunter module with circuit breaker
  • Stealth session (UA rotation, jitter)
  • Database manager with quotas

πŸ”„ Phase 2: Intelligence Layers (IN PROGRESS)

  • Module A: Watchtower (signal detection)
    • Time filter (<24 months)
    • Language detection (lingua-py)
    • Async queue management
  • Module C: Alchemist (content processor)
    • OCR fallback (tesseract/nougat)
    • Metadata enrichment (Crossref API)
    • YAML frontmatter generation

πŸ”œ Phase 3: Deployment (TODO)

  • Module D: Gatekeeper (sandbox validation)
  • Launcher scripts (Bash/PowerShell)
  • Systemd service
  • Integration with 98_Maintenance/update_brain.bat
  • Pre-commit hooks
  • Full test suite

πŸ› Troubleshooting

Issue: "Database no inicializada"

Solution: Always call await db_manager.initialize() before using the database.

await db_manager.initialize()

Issue: "Circuit breaker constantly open"

Solution: Check circuit_breaker_state table for failing sources:

state = await db_manager.get_circuit_state("arxiv")
print(state.consecutive_failures, state.is_open)

To manually reset:

await db_manager.record_circuit_success("arxiv")

Issue: "Daily quota exceeded"

Solution: Quota resets daily at 00:00 UTC. To override, increase MAX_DOWNLOADS_PER_DAY in .env.


πŸ“š References


πŸ“„ License

Proprietary - Capital Workstation Internal Use Only


πŸ™ Credits

Project Lead: Capital Workstation
Codename: PROMETHEUS v3.0
Mission: Knowledge Singularity through Resilient Automation


"From the shadows of ignorance, we bring the fire of knowledge." πŸ”₯

About

scraper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors