Fetch open-access scientific content from 17+ databases — in parallel.
material-fetch is a zero-dependency Python CLI that retrieves PDFs, chemical compound records, and materials data from legal open-access sources. Unlike sequential fallback tools, it fans out to all relevant resolvers simultaneously via ThreadPoolExecutor, then picks the highest-priority result.
Never accesses Sci-Hub or any paywall bypass.
| Classic sequential fetch | material-fetch | |
|---|---|---|
| Strategy | Try source A → if miss, try B → … | All sources queried in parallel |
| Typical latency | Sum of all attempted sources | ≈ Latency of the winning source |
| Input types | DOI only | DOI, arXiv ID, PMC ID, title, PubChem CID |
| Source count | 4–5 | 13 paper sources + PubChem + Materials Project |
| Dependencies | stdlib | stdlib (zero external deps) |
| Source | Domain | API Key Required |
|---|---|---|
| Unpaywall | General OA detection | Email (UNPAYWALL_EMAIL) |
| OpenAlex | Fully open academic graph | No |
| Semantic Scholar | AI-enriched paper graph | Optional |
| Crossref | Authoritative DOI metadata | No |
| Europe PMC | Life-science full text | No |
| NCBI PubMed Central | Biomedical OA | No |
| arXiv | Physics / CS / Math / QBio preprints | No |
| bioRxiv / medRxiv | Life-science & clinical preprints | No |
| CORE | Largest OA full-text aggregator | Optional |
| INSPIRE-HEP | High-energy physics | No |
| DBLP | Computer science venues | No |
| NASA ADS | Astrophysics & physics | Required |
| zbMATH | Mathematics reviews | No |
| Source | Domain | API Key Required |
|---|---|---|
| PubChem | Chemical compounds (SMILES, IUPAC, formula) | No |
| Materials Project | Crystal structures, band gaps, thermodynamics | Required |
# Clone
git clone https://github.com/HUSRCF/openfetcher.git
cd openfetcher
# No pip install needed — pure Python stdlib
python scripts/fetch.py --versionRequirements: Python 3.8+, no external packages.
# DOI → download PDF
python scripts/fetch.py 10.1038/s41586-021-03819-2
# Dry-run — resolve sources without downloading
python scripts/fetch.py 10.1038/s41586-021-03819-2 --dry-run
# arXiv paper
python scripts/fetch.py arxiv:1706.03762
# Plain-text title (resolved to DOI via OpenAlex / S2)
python scripts/fetch.py "attention is all you need"
# PMC ID
python scripts/fetch.py PMC7026016
# Chemical compound
python scripts/fetch.py CID2244
# Batch from file
python scripts/fetch.py --batch dois.txt --out ./papers
# Batch from stdin
echo "10.1038/s41586-021-03819-2" | python scripts/fetch.py --batch -
# Machine-readable schema
python scripts/fetch.py schema --pretty| Type | Example |
|---|---|
| DOI | 10.1038/s41586-020-2649-2 or https://doi.org/10.… |
| arXiv ID | 2301.07041 / arxiv:2301.07041 / hep-th/9711200 |
| PMC ID | PMC7026016 |
| PubMed ID | PMID12345678 or bare 7–9 digit number |
| PubChem CID | CID2244 |
| Title | "attention is all you need" (parallel title search via OpenAlex + S2) |
python scripts/fetch.py <QUERY> [options]
python scripts/fetch.py --batch <FILE|-> [options]
python scripts/fetch.py schema
| Flag | Default | Description |
|---|---|---|
query |
— | DOI, arXiv ID, PMC ID, PubChem CID, or title |
--batch FILE |
— | File with one query per line; - reads stdin |
--out DIR |
downloads |
Output directory |
--dry-run |
off | Resolve without downloading |
--format json|text |
auto | json when piped, text in terminal |
--pretty |
off | Pretty-print JSON output |
--stream |
off | Emit one NDJSON line per resolved query |
--overwrite |
off | Re-download files that already exist |
--idempotency-key KEY |
— | Replay cached envelope without network I/O |
--timeout SEC |
20 |
Per-request HTTP timeout |
--workers N |
8 |
Parallel resolver threads per query |
--version |
— | Print version and exit |
stdout — one JSON envelope (or NDJSON with --stream).
stderr — NDJSON progress events (JSON mode) or human prose (text mode).
| Code | Meaning |
|---|---|
0 |
All queries resolved |
1 |
Not found (no OA copy; no transport failure) |
3 |
Validation error (bad arguments) |
4 |
Transport error (network / IO; retryable) |
| Variable | Purpose |
|---|---|
UNPAYWALL_EMAIL |
Contact email for Unpaywall API (highly recommended) |
SEMANTIC_SCHOLAR_API_KEY |
Higher rate limits for Semantic Scholar |
NASA_ADS_API_KEY |
Required to enable NASA ADS source |
MATERIALS_PROJECT_API_KEY |
Required to enable Materials Project source |
CORE_API_KEY |
Higher rate limits for CORE API |
MATERIAL_FETCH_ALLOWED_HOSTS |
Comma-separated extra download hostnames |
MATERIAL_FETCH_NO_AUTO_UPDATE |
Disable silent background git pull |
MATERIAL_FETCH_UPDATE_INTERVAL |
Auto-update cooldown in seconds (default 86400) |
# Unit + CLI smoke tests (no network required)
python -m pytest tests/ -v
# Also run live integration tests (requires network)
LIVE=1 python -m pytest tests/ -v- Parallel by design. For a DOI query, all 12 resolvers start simultaneously. Total latency ≈ the winning source, not the sum.
- Idempotent by default. Re-running against the same
--outskips existing files. Use--idempotency-keyfor full envelope replay without any network I/O. - Auth is delegated. The script never prompts for credentials. Set environment variables in your shell; the script inherits them.
- Host allowlist enforced. Downloads only from a curated list of trusted domains. Extend it with
MATERIAL_FETCH_ALLOWED_HOSTS— never with a script flag.
MIT. See LICENSE.
Also see: README_TW.md for Traditional Chinese documentation.