openfetcher / material-fetch

Fetch open-access scientific content from 17+ databases — in parallel.

material-fetch is a zero-dependency Python CLI that retrieves PDFs, chemical compound records, and materials data from legal open-access sources. Unlike sequential fallback tools, it fans out to all relevant resolvers simultaneously via ThreadPoolExecutor, then picks the highest-priority result.

Never accesses Sci-Hub or any paywall bypass.

Why material-fetch?

	Classic sequential fetch	material-fetch
Strategy	Try source A → if miss, try B → …	All sources queried in parallel
Typical latency	Sum of all attempted sources	≈ Latency of the winning source
Input types	DOI only	DOI, arXiv ID, PMC ID, title, PubChem CID
Source count	4–5	13 paper sources + PubChem + Materials Project
Dependencies	stdlib	stdlib (zero external deps)

Covered platforms

Papers & literature

Source	Domain	API Key Required
Unpaywall	General OA detection	Email (`UNPAYWALL_EMAIL`)
OpenAlex	Fully open academic graph	No
Semantic Scholar	AI-enriched paper graph	Optional
Crossref	Authoritative DOI metadata	No
Europe PMC	Life-science full text	No
NCBI PubMed Central	Biomedical OA	No
arXiv	Physics / CS / Math / QBio preprints	No
bioRxiv / medRxiv	Life-science & clinical preprints	No
CORE	Largest OA full-text aggregator	Optional
INSPIRE-HEP	High-energy physics	No
DBLP	Computer science venues	No
NASA ADS	Astrophysics & physics	Required
zbMATH	Mathematics reviews	No

Specialty (non-paper)

Source	Domain	API Key Required
PubChem	Chemical compounds (SMILES, IUPAC, formula)	No
Materials Project	Crystal structures, band gaps, thermodynamics	Required

Installation

# Clone
git clone https://github.com/HUSRCF/openfetcher.git
cd openfetcher

# No pip install needed — pure Python stdlib
python scripts/fetch.py --version

Requirements: Python 3.8+, no external packages.

Quick start

# DOI → download PDF
python scripts/fetch.py 10.1038/s41586-021-03819-2

# Dry-run — resolve sources without downloading
python scripts/fetch.py 10.1038/s41586-021-03819-2 --dry-run

# arXiv paper
python scripts/fetch.py arxiv:1706.03762

# Plain-text title (resolved to DOI via OpenAlex / S2)
python scripts/fetch.py "attention is all you need"

# PMC ID
python scripts/fetch.py PMC7026016

# Chemical compound
python scripts/fetch.py CID2244

# Batch from file
python scripts/fetch.py --batch dois.txt --out ./papers

# Batch from stdin
echo "10.1038/s41586-021-03819-2" | python scripts/fetch.py --batch -

# Machine-readable schema
python scripts/fetch.py schema --pretty

Auto-detected input types

Type	Example
DOI	`10.1038/s41586-020-2649-2` or `https://doi.org/10.…`
arXiv ID	`2301.07041` / `arxiv:2301.07041` / `hep-th/9711200`
PMC ID	`PMC7026016`
PubMed ID	`PMID12345678` or bare 7–9 digit number
PubChem CID	`CID2244`
Title	`"attention is all you need"` (parallel title search via OpenAlex + S2)

CLI reference

python scripts/fetch.py <QUERY> [options]
python scripts/fetch.py --batch <FILE|-> [options]
python scripts/fetch.py schema

Flag	Default	Description
`query`	—	DOI, arXiv ID, PMC ID, PubChem CID, or title
`--batch FILE`	—	File with one query per line; `-` reads stdin
`--out DIR`	`downloads`	Output directory
`--dry-run`	off	Resolve without downloading
`--format json\|text`	auto	`json` when piped, `text` in terminal
`--pretty`	off	Pretty-print JSON output
`--stream`	off	Emit one NDJSON line per resolved query
`--overwrite`	off	Re-download files that already exist
`--idempotency-key KEY`	—	Replay cached envelope without network I/O
`--timeout SEC`	`20`	Per-request HTTP timeout
`--workers N`	`8`	Parallel resolver threads per query
`--version`	—	Print version and exit

Output format

stdout — one JSON envelope (or NDJSON with --stream). stderr — NDJSON progress events (JSON mode) or human prose (text mode).

// Success
{
  "ok": true,
  "data": {
    "results": [{
      "query":        "10.1038/s41586-021-03819-2",
      "success":      true,
      "source":       "europe_pmc",
      "pdf_url":      "https://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC8371605&blobtype=pdf",
      "file":         "downloads/Jumper_2021_Highly_accurate_protein_structure_predic.pdf",
      "meta":         {"title": "Highly accurate protein structure prediction with AlphaFold", "year": 2021, "author": "Jumper"},
      "sources_tried": ["unpaywall", "openalex", "semantic_scholar", "crossref", "europe_pmc", "..."]
    }],
    "summary": {"total": 1, "succeeded": 1, "failed": 0},
    "next": []
  },
  "meta": {
    "request_id": "req_46a95974efc9",
    "latency_ms": 1840,
    "schema_version": "1.0.0",
    "cli_version": "1.0.0"
  }
}

Exit codes

Code	Meaning
`0`	All queries resolved
`1`	Not found (no OA copy; no transport failure)
`3`	Validation error (bad arguments)
`4`	Transport error (network / IO; retryable)

Environment variables

Variable	Purpose
`UNPAYWALL_EMAIL`	Contact email for Unpaywall API (highly recommended)
`SEMANTIC_SCHOLAR_API_KEY`	Higher rate limits for Semantic Scholar
`NASA_ADS_API_KEY`	Required to enable NASA ADS source
`MATERIALS_PROJECT_API_KEY`	Required to enable Materials Project source
`CORE_API_KEY`	Higher rate limits for CORE API
`MATERIAL_FETCH_ALLOWED_HOSTS`	Comma-separated extra download hostnames
`MATERIAL_FETCH_NO_AUTO_UPDATE`	Disable silent background git pull
`MATERIAL_FETCH_UPDATE_INTERVAL`	Auto-update cooldown in seconds (default `86400`)

Running tests

# Unit + CLI smoke tests (no network required)
python -m pytest tests/ -v

# Also run live integration tests (requires network)
LIVE=1 python -m pytest tests/ -v

Notes

Parallel by design. For a DOI query, all 12 resolvers start simultaneously. Total latency ≈ the winning source, not the sum.
Idempotent by default. Re-running against the same --out skips existing files. Use --idempotency-key for full envelope replay without any network I/O.
Auth is delegated. The script never prompts for credentials. Set environment variables in your shell; the script inherits them.
Host allowlist enforced. Downloads only from a curated list of trusted domains. Extend it with MATERIAL_FETCH_ALLOWED_HOSTS — never with a script flag.

License

MIT. See LICENSE.

Also see: README_TW.md for Traditional Chinese documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
SKILL.md		SKILL.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

openfetcher / material-fetch

Why material-fetch?

Covered platforms

Papers & literature

Specialty (non-paper)

Installation

Quick start

Auto-detected input types

CLI reference

Output format

Exit codes

Environment variables

Running tests

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

openfetcher / material-fetch

Why material-fetch?

Covered platforms

Papers & literature

Specialty (non-paper)

Installation

Quick start

Auto-detected input types

CLI reference

Output format

Exit codes

Environment variables

Running tests

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages