Skip to content

HUSRCF/openfetcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

openfetcher / material-fetch

Fetch open-access scientific content from 17+ databases — in parallel.

material-fetch is a zero-dependency Python CLI that retrieves PDFs, chemical compound records, and materials data from legal open-access sources. Unlike sequential fallback tools, it fans out to all relevant resolvers simultaneously via ThreadPoolExecutor, then picks the highest-priority result.

Never accesses Sci-Hub or any paywall bypass.


Why material-fetch?

Classic sequential fetch material-fetch
Strategy Try source A → if miss, try B → … All sources queried in parallel
Typical latency Sum of all attempted sources ≈ Latency of the winning source
Input types DOI only DOI, arXiv ID, PMC ID, title, PubChem CID
Source count 4–5 13 paper sources + PubChem + Materials Project
Dependencies stdlib stdlib (zero external deps)

Covered platforms

Papers & literature

Source Domain API Key Required
Unpaywall General OA detection Email (UNPAYWALL_EMAIL)
OpenAlex Fully open academic graph No
Semantic Scholar AI-enriched paper graph Optional
Crossref Authoritative DOI metadata No
Europe PMC Life-science full text No
NCBI PubMed Central Biomedical OA No
arXiv Physics / CS / Math / QBio preprints No
bioRxiv / medRxiv Life-science & clinical preprints No
CORE Largest OA full-text aggregator Optional
INSPIRE-HEP High-energy physics No
DBLP Computer science venues No
NASA ADS Astrophysics & physics Required
zbMATH Mathematics reviews No

Specialty (non-paper)

Source Domain API Key Required
PubChem Chemical compounds (SMILES, IUPAC, formula) No
Materials Project Crystal structures, band gaps, thermodynamics Required

Installation

# Clone
git clone https://github.com/HUSRCF/openfetcher.git
cd openfetcher

# No pip install needed — pure Python stdlib
python scripts/fetch.py --version

Requirements: Python 3.8+, no external packages.


Quick start

# DOI → download PDF
python scripts/fetch.py 10.1038/s41586-021-03819-2

# Dry-run — resolve sources without downloading
python scripts/fetch.py 10.1038/s41586-021-03819-2 --dry-run

# arXiv paper
python scripts/fetch.py arxiv:1706.03762

# Plain-text title (resolved to DOI via OpenAlex / S2)
python scripts/fetch.py "attention is all you need"

# PMC ID
python scripts/fetch.py PMC7026016

# Chemical compound
python scripts/fetch.py CID2244

# Batch from file
python scripts/fetch.py --batch dois.txt --out ./papers

# Batch from stdin
echo "10.1038/s41586-021-03819-2" | python scripts/fetch.py --batch -

# Machine-readable schema
python scripts/fetch.py schema --pretty

Auto-detected input types

Type Example
DOI 10.1038/s41586-020-2649-2 or https://doi.org/10.…
arXiv ID 2301.07041 / arxiv:2301.07041 / hep-th/9711200
PMC ID PMC7026016
PubMed ID PMID12345678 or bare 7–9 digit number
PubChem CID CID2244
Title "attention is all you need" (parallel title search via OpenAlex + S2)

CLI reference

python scripts/fetch.py <QUERY> [options]
python scripts/fetch.py --batch <FILE|-> [options]
python scripts/fetch.py schema
Flag Default Description
query DOI, arXiv ID, PMC ID, PubChem CID, or title
--batch FILE File with one query per line; - reads stdin
--out DIR downloads Output directory
--dry-run off Resolve without downloading
--format json|text auto json when piped, text in terminal
--pretty off Pretty-print JSON output
--stream off Emit one NDJSON line per resolved query
--overwrite off Re-download files that already exist
--idempotency-key KEY Replay cached envelope without network I/O
--timeout SEC 20 Per-request HTTP timeout
--workers N 8 Parallel resolver threads per query
--version Print version and exit

Output format

stdout — one JSON envelope (or NDJSON with --stream). stderr — NDJSON progress events (JSON mode) or human prose (text mode).

// Success
{
  "ok": true,
  "data": {
    "results": [{
      "query":        "10.1038/s41586-021-03819-2",
      "success":      true,
      "source":       "europe_pmc",
      "pdf_url":      "https://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC8371605&blobtype=pdf",
      "file":         "downloads/Jumper_2021_Highly_accurate_protein_structure_predic.pdf",
      "meta":         {"title": "Highly accurate protein structure prediction with AlphaFold", "year": 2021, "author": "Jumper"},
      "sources_tried": ["unpaywall", "openalex", "semantic_scholar", "crossref", "europe_pmc", "..."]
    }],
    "summary": {"total": 1, "succeeded": 1, "failed": 0},
    "next": []
  },
  "meta": {
    "request_id": "req_46a95974efc9",
    "latency_ms": 1840,
    "schema_version": "1.0.0",
    "cli_version": "1.0.0"
  }
}

Exit codes

Code Meaning
0 All queries resolved
1 Not found (no OA copy; no transport failure)
3 Validation error (bad arguments)
4 Transport error (network / IO; retryable)

Environment variables

Variable Purpose
UNPAYWALL_EMAIL Contact email for Unpaywall API (highly recommended)
SEMANTIC_SCHOLAR_API_KEY Higher rate limits for Semantic Scholar
NASA_ADS_API_KEY Required to enable NASA ADS source
MATERIALS_PROJECT_API_KEY Required to enable Materials Project source
CORE_API_KEY Higher rate limits for CORE API
MATERIAL_FETCH_ALLOWED_HOSTS Comma-separated extra download hostnames
MATERIAL_FETCH_NO_AUTO_UPDATE Disable silent background git pull
MATERIAL_FETCH_UPDATE_INTERVAL Auto-update cooldown in seconds (default 86400)

Running tests

# Unit + CLI smoke tests (no network required)
python -m pytest tests/ -v

# Also run live integration tests (requires network)
LIVE=1 python -m pytest tests/ -v

Notes

  • Parallel by design. For a DOI query, all 12 resolvers start simultaneously. Total latency ≈ the winning source, not the sum.
  • Idempotent by default. Re-running against the same --out skips existing files. Use --idempotency-key for full envelope replay without any network I/O.
  • Auth is delegated. The script never prompts for credentials. Set environment variables in your shell; the script inherits them.
  • Host allowlist enforced. Downloads only from a curated list of trusted domains. Extend it with MATERIAL_FETCH_ALLOWED_HOSTS — never with a script flag.

License

MIT. See LICENSE.


Also see: README_TW.md for Traditional Chinese documentation.

About

This is a one-stop tool which use OpenAlex and other open-source sites for paper/material searching & fetching. Especially suits scientific and academic work.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages