GoldDIGR: Search, Download, Analyze Open-Source Chemistry Data

Systematic, resumable mining of supplementary information (SI) files from chemistry journal articles. Extracts XYZ coordinates, computational-chemistry details, and BibLaTeX metadata — fully containerized, no manual setup.

What It Does

Given a CSV of article URLs, golddigr:

Scrapes article HTML (handles Cloudflare challenges)
Extracts SI download links (direct files + publisher download endpoints)
Downloads SI files (PDF, XYZ, CIF, MOL, DOCX, XLSX, ZIP, …)
Converts PDFs to text via StirlingPDF (sidecar container)
Cleans footers/headers, extracts XYZ coordinate blocks
Detects transition-state geometries
Extracts BibLaTeX metadata from HTML
Uses an LLM to pull structured computational-chemistry details

Every stage is tracked in SQLite. Re-running picks up exactly where it left off.

Computational Plugin

The plugin/ directory contains the post-extraction computational pipeline that processes XYZ structures identified by GoldDIGR through five stages:

Stage	Directory	Description
1	`01-charge-scan-tsopt-irc/`	Charge sampling (−1, 0, +1) → xTB frequency → TSOPT → IRC
2	`02-spin-wbo-scan/`	Two-pass spin-polarized energy scan + Wiberg bond order extraction
3	`03-irc-analysis/`	YARP bond-electron matrix analysis of IRC endpoints and trajectory
4	`04-sankey/`	Electron-flow Sankey diagrams from BEM time series
5	`05-reaction-classification/`	Classification into OA, RE, MI, β-atom elimination, C–H activation, TM

Each stage has its own subdirectory with scripts, templates, and a per-stage README. See plugin/README.md for full usage details and the pipeline flow diagram.

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                        docker compose up                            │
│                                                                     │
│  ┌──────────────┐         ┌───────────────────────────────────┐     │
│  │ StirlingPDF  │◄────────│       Python Pipeline             │     │
│  │  (sidecar)   │  HTTP   │                                   │     │
│  │              │         │  CSV → Scrape HTML                │     │
│  │  split-pages │         │        → Extract SI links         │     │
│  │  pdf-to-text │         │          → Download files         │     │
│  └──────────────┘         │            → Process PDFs ────────┤     │
│                           │              → Clean text         │     │
│                           │                → LLM extract      │     │
│                           │                  → Find TS XYZ    │     │
│                           └───────────────────────────────────┘     │
│                                        │                            │
│                              ┌─────────┴──────────┐                 │
│                              │   SQLite Ledger     │                 │
│                              │  (resumable state)  │                 │
│                              └────────────────────┘                 │
└──────────────────────────────────────────────────────────────────────┘

Pipeline Stages

PENDING → HTML_SCRAPED → LINKS_EXTRACTED → FILES_DOWNLOADED
        → PDF_PROCESSED → TEXT_EXTRACTED → DONE

Any stage can fail independently without affecting other articles.

Quick Start

Option A: Docker Compose (simplest)

Best for batch processing when Cloudflare isn't an issue.

git clone https://github.com/your-username/golddigr.git && cd golddigr

# Configure API keys
cp .env.example .env
# Edit .env with your LLM API key

# Place your article CSV
mkdir -p data/input
cp your-articles.csv data/input/articles.csv

# Run
docker compose up --build

# Or process a specific range
docker compose run pipeline python run.py --start 0 --end 100

# Check progress
docker compose run pipeline python run.py --status

# Retry failed jobs
docker compose run pipeline python run.py --retry-failed

Option B: Local Chrome (recommended for Cloudflare-heavy publishers)

Best for Wiley, Elsevier, and other publishers with aggressive bot detection. Uses your real Chrome browser with your existing cookies and login sessions.

# 1. Start StirlingPDF
docker run -p 8080:8080 stirlingtools/stirling-pdf:latest

# 2. Create a Python environment
conda create -n golddigr python=3.11 && conda activate golddigr
# or: python -m venv .venv && source .venv/bin/activate

# 3. Install Python dependencies
pip install -r requirements.txt

# 4. Set environment variables
export STIRLING_URL=http://localhost:8080
export OPENAI_API_KEY=sk-proj-...

# 5. Use the Chrome config
cp config.chrome.yaml config.yaml

# 6. Run (Chrome will open with your profile)
python run.py --start 0 --end 100

Chrome launches via CDP (Chrome DevTools Protocol) using your real profile — Cloudflare sees a legitimate browser with real cookies. If a CAPTCHA still appears, the pipeline will prompt you to solve it manually and retry.

Note: If you have multiple Chrome profiles, set chrome_profile_directory in config.yaml to match yours (e.g., Default, Profile 1). Check with: ls ~/.config/google-chrome/

Important: Chrome opens PDFs in its built-in viewer by default instead of downloading them. You need to change this once: Chrome → Settings → Privacy and security → Site settings → Additional content settings → PDF documents → select "Download PDFs"

Option C: Local Firefox

For publishers without aggressive bot detection (ACS, RSC, etc.).

# 1. Start StirlingPDF
docker run -p 8080:8080 stirlingtools/stirling-pdf:latest

# 2. Create a Python environment
conda create -n golddigr python=3.11 && conda activate golddigr
# or: python -m venv .venv && source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Set environment variables
export STIRLING_URL=http://localhost:8080
export OPENAI_API_KEY=sk-proj-...

# 5. Use the Firefox config
cp config.firefox-local.yaml config.yaml

# 6. Run
python run.py --start 0 --end 100

Option D: Singularity/Apptainer (HPC)

# Build the SIF (one-time)
docker build -t golddigr .
apptainer build golddigr.sif docker-daemon://golddigr:latest

# Run
export OPENAI_API_KEY=sk-proj-...
./golddigr --start 0 --end 100
./golddigr --status

Output

data/
├── input/
│   └── articles.csv              # your input
├── output/
│   ├── text/                     # extracted text from PDFs
│   │   └── {doi_path}/{stem}_xyz_clean.txt
│   ├── xyz/                      # transition-state structures
│   ├── biblatex/                 # BibLaTeX entries per article
│   └── comp_details/             # LLM-extracted comp-chem JSON
├── downloads/
│   ├── html/                     # scraped article pages
│   └── files/{doi_path}/         # raw SI files (PDF, XYZ, CIF, …)
└── db/
    ├── pipeline.db               # SQLite job ledger
    └── pipeline.log              # full run log

Sample Data

A sample-data/ directory is included with a complete input/output example from one article (Guérard et al., Chem. Eur. J. 2016, 22, 12332):

sample-data/
├── input/articles.csv                          # one-article CSV
├── output/
│   ├── xyz/10.1002/chem.201600922/…/01.xyz     # 12 extracted XYZ structures
│   ├── biblatex/0.bib                          # BibLaTeX entry
│   ├── comp_details/…/…_comp.json              # LLM-extracted comp-chem details
│   ├── text/…/…_full.txt                       # raw extracted text
│   └── figures/…/manifest.json                 # figure metadata (images removed)
├── db/pipeline.db                              # SQLite ledger showing DONE status
└── sample.log                                  # pipeline log

Use this to understand the expected output format, or as a test:

# Copy sample input and run
cp sample-data/input/articles.csv data/input/articles.csv
python run.py --start 0 --end 1

Configuration

Three example configs are provided:

File	Browser	Profile	Headless	Use case
`config.yaml`	Firefox	Clean	Yes	Docker / Singularity
`config.chrome.yaml`	Chrome	Your profile	No	Cloudflare-heavy publishers
`config.firefox-local.yaml`	Firefox	Clean	Yes	Local, no Cloudflare issues

Copy whichever fits your use case to config.yaml.

Key Settings

scraper:
  browser: chrome          # "firefox" or "chrome"
  browser_profile: auto    # "none", "auto", or explicit path
  chrome_profile_directory: Default  # Chrome profile subdirectory
  headless: false          # true for containers, false to see the browser
  interactive: true        # prompt on Cloudflare challenges

Cloudflare Handling

Many publishers (especially Wiley) use Cloudflare bot detection. golddigr handles this with a multi-tier approach:

Automatic wait — If a challenge page is detected, waits for it to auto-clear (often works for IP-based challenges)
Chrome + user profile — Using browser: chrome with browser_profile: auto launches Chrome with your real cookies and browser fingerprint. Cloudflare sees a legitimate browser and usually skips the challenge entirely.

Agent auto-click (optional) — A local vision model (Qwen2.5-VL) takes a screenshot, finds the Cloudflare checkbox, and clicks it via xdotool. Requires a GPU and extra dependencies:

pip install -r requirements-agent.txt
sudo apt install xdotool   # Linux

Enable in config.yaml:

agent:
  enabled: true
  providers:
    - qwen-vl-local        # Qwen2.5-VL-3B (~7GB VRAM)
  model_size: 3b           # or "7b" for more accuracy (~16GB VRAM)
  device: auto

Interactive fallback — If all automated methods fail and interactive: true, the pipeline prompts you to solve the CAPTCHA manually. Type Enter to retry or s to skip the article.
Cached challenge detection — If a previously saved HTML file contains a Cloudflare challenge page, it's automatically deleted and re-downloaded.

Project Structure

golddigr/
├── docker-compose.yml          # Two-service setup (pipeline + StirlingPDF)
├── Dockerfile                  # Python + Firefox + geckodriver
├── config.yaml                 # Default config (Docker)
├── config.chrome.yaml          # Local Chrome config
├── config.firefox-local.yaml   # Local Firefox config
├── .env.example                # API key template
├── run.py                      # CLI entry point
├── golddigr                    # Singularity/Apptainer wrapper
├── requirements.txt            # Core dependencies
├── requirements-agent.txt      # Optional: agent auto-click dependencies
├── sample-data/                # Example input/output from one article
├── pipeline/
│   ├── orchestrator.py         # State-machine driver
│   ├── job_db.py               # SQLite job ledger
│   ├── scraper.py              # Browser automation (Firefox + Chrome CDP)
│   ├── link_extractor.py       # SI link detection (direct + query-param URLs)
│   ├── stirling_client.py      # StirlingPDF Python wrapper
│   ├── pdf_processor.py        # PDF → text → cleanup → XYZ
│   ├── pdf_txt_processing.py   # Footer detection, header removal
│   ├── separate_xyz.py         # XYZ line detection, Gaussian conversion
│   ├── file_processors.py      # Route files by type (docx/xlsx/zip → text)
│   ├── metadata.py             # HTML meta → BibLaTeX
│   ├── figure_extractor.py     # PDF figure extraction
│   ├── cc_detector.py          # Comp-chem keyword detection + LLM
│   └── agent/                  # Optional: automated CAPTCHA solving
│       ├── solver.py           # Orchestration: screenshot → model → click
│       ├── clicker.py          # OS cursor control (xdotool / cliclick)
│       └── vision/             # Vision model providers
│           ├── qwen_vl.py      # Qwen2.5-VL (local, default)
│           ├── florence.py     # Florence-2 (local, experimental)
│           └── api_provider.py # Claude / OpenAI API fallback
└── data/                       # Mounted volumes (gitignored)

Key Design Decisions

Problem	Approach
Tracking progress	SQLite ledger with state machine
PDF processing	StirlingPDF sidecar, zero config
Cloudflare bypass	Chrome CDP with real user profile
Download retries	Watchdog thread + cookie strategy loop
Footer removal	NLTK-based detection in `pdf_txt_processing.py`
File format routing	Extension + query-param detection in `link_extractor.py`
Configuration	Single `config.yaml` + env var overrides
Resumability	Automatic — SQLite tracks every stage

License

MIT — see LICENSE.

Also, please give our repository a ⭐ if our code helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GoldDIGR: Search, Download, Analyze Open-Source Chemistry Data

What It Does

Computational Plugin

Architecture

Pipeline Stages

Quick Start

Option A: Docker Compose (simplest)

Option B: Local Chrome (recommended for Cloudflare-heavy publishers)

Option C: Local Firefox

Option D: Singularity/Apptainer (HPC)

Output

Sample Data

Configuration

Key Settings

Cloudflare Handling

Project Structure

Key Design Decisions

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
data		data
data_deposit		data_deposit
pipeline		pipeline
plugin		plugin
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Logo.png		Logo.png
README.md		README.md
config.chrome.yaml		config.chrome.yaml
config.firefox-local.yaml		config.firefox-local.yaml
config.yaml		config.yaml
docker-compose.yml		docker-compose.yml
golddigr		golddigr
requirements-agent.txt		requirements-agent.txt
requirements.txt		requirements.txt
run.py		run.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GoldDIGR: Search, Download, Analyze Open-Source Chemistry Data

What It Does

Computational Plugin

Architecture

Pipeline Stages

Quick Start

Option A: Docker Compose (simplest)

Option B: Local Chrome (recommended for Cloudflare-heavy publishers)

Option C: Local Firefox

Option D: Singularity/Apptainer (HPC)

Output

Sample Data

Configuration

Key Settings

Cloudflare Handling

Project Structure

Key Design Decisions

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages