Systematic, resumable mining of supplementary information (SI) files from chemistry journal articles. Extracts XYZ coordinates, computational-chemistry details, and BibLaTeX metadata — fully containerized, no manual setup.
Given a CSV of article URLs, golddigr:
- Scrapes article HTML (handles Cloudflare challenges)
- Extracts SI download links (direct files + publisher download endpoints)
- Downloads SI files (PDF, XYZ, CIF, MOL, DOCX, XLSX, ZIP, …)
- Converts PDFs to text via StirlingPDF (sidecar container)
- Cleans footers/headers, extracts XYZ coordinate blocks
- Detects transition-state geometries
- Extracts BibLaTeX metadata from HTML
- Uses an LLM to pull structured computational-chemistry details
Every stage is tracked in SQLite. Re-running picks up exactly where it left off.
The plugin/ directory contains the post-extraction computational pipeline
that processes XYZ structures identified by GoldDIGR through five stages:
| Stage | Directory | Description |
|---|---|---|
| 1 | 01-charge-scan-tsopt-irc/ |
Charge sampling (−1, 0, +1) → xTB frequency → TSOPT → IRC |
| 2 | 02-spin-wbo-scan/ |
Two-pass spin-polarized energy scan + Wiberg bond order extraction |
| 3 | 03-irc-analysis/ |
YARP bond-electron matrix analysis of IRC endpoints and trajectory |
| 4 | 04-sankey/ |
Electron-flow Sankey diagrams from BEM time series |
| 5 | 05-reaction-classification/ |
Classification into OA, RE, MI, β-atom elimination, C–H activation, TM |
Each stage has its own subdirectory with scripts, templates, and a per-stage
README. See plugin/README.md for full usage details and
the pipeline flow diagram.
┌──────────────────────────────────────────────────────────────────────┐
│ docker compose up │
│ │
│ ┌──────────────┐ ┌───────────────────────────────────┐ │
│ │ StirlingPDF │◄────────│ Python Pipeline │ │
│ │ (sidecar) │ HTTP │ │ │
│ │ │ │ CSV → Scrape HTML │ │
│ │ split-pages │ │ → Extract SI links │ │
│ │ pdf-to-text │ │ → Download files │ │
│ └──────────────┘ │ → Process PDFs ────────┤ │
│ │ → Clean text │ │
│ │ → LLM extract │ │
│ │ → Find TS XYZ │ │
│ └───────────────────────────────────┘ │
│ │ │
│ ┌─────────┴──────────┐ │
│ │ SQLite Ledger │ │
│ │ (resumable state) │ │
│ └────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
PENDING → HTML_SCRAPED → LINKS_EXTRACTED → FILES_DOWNLOADED
→ PDF_PROCESSED → TEXT_EXTRACTED → DONE
Any stage can fail independently without affecting other articles.
Best for batch processing when Cloudflare isn't an issue.
git clone https://github.com/your-username/golddigr.git && cd golddigr
# Configure API keys
cp .env.example .env
# Edit .env with your LLM API key
# Place your article CSV
mkdir -p data/input
cp your-articles.csv data/input/articles.csv
# Run
docker compose up --build
# Or process a specific range
docker compose run pipeline python run.py --start 0 --end 100
# Check progress
docker compose run pipeline python run.py --status
# Retry failed jobs
docker compose run pipeline python run.py --retry-failedBest for Wiley, Elsevier, and other publishers with aggressive bot detection. Uses your real Chrome browser with your existing cookies and login sessions.
# 1. Start StirlingPDF
docker run -p 8080:8080 stirlingtools/stirling-pdf:latest
# 2. Create a Python environment
conda create -n golddigr python=3.11 && conda activate golddigr
# or: python -m venv .venv && source .venv/bin/activate
# 3. Install Python dependencies
pip install -r requirements.txt
# 4. Set environment variables
export STIRLING_URL=http://localhost:8080
export OPENAI_API_KEY=sk-proj-...
# 5. Use the Chrome config
cp config.chrome.yaml config.yaml
# 6. Run (Chrome will open with your profile)
python run.py --start 0 --end 100Chrome launches via CDP (Chrome DevTools Protocol) using your real profile — Cloudflare sees a legitimate browser with real cookies. If a CAPTCHA still appears, the pipeline will prompt you to solve it manually and retry.
Note: If you have multiple Chrome profiles, set
chrome_profile_directoryin config.yaml to match yours (e.g.,Default,Profile 1). Check with:ls ~/.config/google-chrome/
Important: Chrome opens PDFs in its built-in viewer by default instead of downloading them. You need to change this once: Chrome → Settings → Privacy and security → Site settings → Additional content settings → PDF documents → select "Download PDFs"
For publishers without aggressive bot detection (ACS, RSC, etc.).
# 1. Start StirlingPDF
docker run -p 8080:8080 stirlingtools/stirling-pdf:latest
# 2. Create a Python environment
conda create -n golddigr python=3.11 && conda activate golddigr
# or: python -m venv .venv && source .venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set environment variables
export STIRLING_URL=http://localhost:8080
export OPENAI_API_KEY=sk-proj-...
# 5. Use the Firefox config
cp config.firefox-local.yaml config.yaml
# 6. Run
python run.py --start 0 --end 100# Build the SIF (one-time)
docker build -t golddigr .
apptainer build golddigr.sif docker-daemon://golddigr:latest
# Run
export OPENAI_API_KEY=sk-proj-...
./golddigr --start 0 --end 100
./golddigr --statusdata/
├── input/
│ └── articles.csv # your input
├── output/
│ ├── text/ # extracted text from PDFs
│ │ └── {doi_path}/{stem}_xyz_clean.txt
│ ├── xyz/ # transition-state structures
│ ├── biblatex/ # BibLaTeX entries per article
│ └── comp_details/ # LLM-extracted comp-chem JSON
├── downloads/
│ ├── html/ # scraped article pages
│ └── files/{doi_path}/ # raw SI files (PDF, XYZ, CIF, …)
└── db/
├── pipeline.db # SQLite job ledger
└── pipeline.log # full run log
A sample-data/ directory is included with a complete input/output example
from one article (Guérard et al., Chem. Eur. J. 2016, 22, 12332):
sample-data/
├── input/articles.csv # one-article CSV
├── output/
│ ├── xyz/10.1002/chem.201600922/…/01.xyz # 12 extracted XYZ structures
│ ├── biblatex/0.bib # BibLaTeX entry
│ ├── comp_details/…/…_comp.json # LLM-extracted comp-chem details
│ ├── text/…/…_full.txt # raw extracted text
│ └── figures/…/manifest.json # figure metadata (images removed)
├── db/pipeline.db # SQLite ledger showing DONE status
└── sample.log # pipeline log
Use this to understand the expected output format, or as a test:
# Copy sample input and run
cp sample-data/input/articles.csv data/input/articles.csv
python run.py --start 0 --end 1Three example configs are provided:
| File | Browser | Profile | Headless | Use case |
|---|---|---|---|---|
config.yaml |
Firefox | Clean | Yes | Docker / Singularity |
config.chrome.yaml |
Chrome | Your profile | No | Cloudflare-heavy publishers |
config.firefox-local.yaml |
Firefox | Clean | Yes | Local, no Cloudflare issues |
Copy whichever fits your use case to config.yaml.
scraper:
browser: chrome # "firefox" or "chrome"
browser_profile: auto # "none", "auto", or explicit path
chrome_profile_directory: Default # Chrome profile subdirectory
headless: false # true for containers, false to see the browser
interactive: true # prompt on Cloudflare challengesMany publishers (especially Wiley) use Cloudflare bot detection. golddigr handles this with a multi-tier approach:
-
Automatic wait — If a challenge page is detected, waits for it to auto-clear (often works for IP-based challenges)
-
Chrome + user profile — Using
browser: chromewithbrowser_profile: autolaunches Chrome with your real cookies and browser fingerprint. Cloudflare sees a legitimate browser and usually skips the challenge entirely. -
Agent auto-click (optional) — A local vision model (Qwen2.5-VL) takes a screenshot, finds the Cloudflare checkbox, and clicks it via
xdotool. Requires a GPU and extra dependencies:pip install -r requirements-agent.txt sudo apt install xdotool # LinuxEnable in config.yaml:
agent: enabled: true providers: - qwen-vl-local # Qwen2.5-VL-3B (~7GB VRAM) model_size: 3b # or "7b" for more accuracy (~16GB VRAM) device: auto
-
Interactive fallback — If all automated methods fail and
interactive: true, the pipeline prompts you to solve the CAPTCHA manually. Type Enter to retry orsto skip the article. -
Cached challenge detection — If a previously saved HTML file contains a Cloudflare challenge page, it's automatically deleted and re-downloaded.
golddigr/
├── docker-compose.yml # Two-service setup (pipeline + StirlingPDF)
├── Dockerfile # Python + Firefox + geckodriver
├── config.yaml # Default config (Docker)
├── config.chrome.yaml # Local Chrome config
├── config.firefox-local.yaml # Local Firefox config
├── .env.example # API key template
├── run.py # CLI entry point
├── golddigr # Singularity/Apptainer wrapper
├── requirements.txt # Core dependencies
├── requirements-agent.txt # Optional: agent auto-click dependencies
├── sample-data/ # Example input/output from one article
├── pipeline/
│ ├── orchestrator.py # State-machine driver
│ ├── job_db.py # SQLite job ledger
│ ├── scraper.py # Browser automation (Firefox + Chrome CDP)
│ ├── link_extractor.py # SI link detection (direct + query-param URLs)
│ ├── stirling_client.py # StirlingPDF Python wrapper
│ ├── pdf_processor.py # PDF → text → cleanup → XYZ
│ ├── pdf_txt_processing.py # Footer detection, header removal
│ ├── separate_xyz.py # XYZ line detection, Gaussian conversion
│ ├── file_processors.py # Route files by type (docx/xlsx/zip → text)
│ ├── metadata.py # HTML meta → BibLaTeX
│ ├── figure_extractor.py # PDF figure extraction
│ ├── cc_detector.py # Comp-chem keyword detection + LLM
│ └── agent/ # Optional: automated CAPTCHA solving
│ ├── solver.py # Orchestration: screenshot → model → click
│ ├── clicker.py # OS cursor control (xdotool / cliclick)
│ └── vision/ # Vision model providers
│ ├── qwen_vl.py # Qwen2.5-VL (local, default)
│ ├── florence.py # Florence-2 (local, experimental)
│ └── api_provider.py # Claude / OpenAI API fallback
└── data/ # Mounted volumes (gitignored)
| Problem | Approach |
|---|---|
| Tracking progress | SQLite ledger with state machine |
| PDF processing | StirlingPDF sidecar, zero config |
| Cloudflare bypass | Chrome CDP with real user profile |
| Download retries | Watchdog thread + cookie strategy loop |
| Footer removal | NLTK-based detection in pdf_txt_processing.py |
| File format routing | Extension + query-param detection in link_extractor.py |
| Configuration | Single config.yaml + env var overrides |
| Resumability | Automatic — SQLite tracks every stage |
MIT — see LICENSE.
Also, please give our repository a ⭐ if our code helps!
