Releases: dccakes/scrape-gpt
v1.0.0 — Phase 1: XPath + LLM Fallback Scraping
scrape-gpt v1.0.0
Phase 1 of scrape-gpt is complete. This release ships a production-ready web scraping system that combines deterministic XPath extraction with LLM-powered fallback — dramatically cutting LLM costs compared to page-level extraction.
Architecture
The core insight: only send missing fields to the LLM, not the full page.
XPath extraction (fast, free)
|
v
All fields found? ──yes──> Done
|
no
v
Send ONLY missing fields to LLM
|
v
Merge + update coverage stats
This field-level fallback strategy achieves ~85% cost reduction vs calling the LLM for every page.
What's Included
Clean Architecture with dependency injection — swap any provider via .env, no code changes:
httpxasync fetcherlxmlXPath + CSS selector extractor- Direct LLM provider (Anthropic API,
extract_fieldsonly) - Local filesystem storage with coverage tracking
- Console alerting
ExtractWithFallbackuse case — field-level LLM fallbackScrapePageuse case — end-to-end orchestration- CLI (
scraper scrape URL,scraper info)
Demo — runs offline against bundled HTML fixtures:
make demoShows three scenarios: multi-item XPath extraction, single-item extraction, and live LLM fallback when a selector breaks.
Test coverage — 33 unit tests + 9 integration tests, 42 total:
make testCI — GitHub Actions on Python 3.11 and 3.12 (lint → type check → unit tests).
Cost at Scale
At 100 domains, 10k pages/month each:
| Approach | Annual Cost |
|---|---|
| Page-level LLM | $18,000 |
| Field-level fallback (this) | $2,820 |
| Savings | $15,180/yr |
Quick Start
git clone https://github.com/dccakes/scrape-gpt
cd scrape-gpt
pip install -e ".[dev]"
make demo # no API key neededWhat's Next (Phase 2)
- Multi-sample config generation (3–5 pages for robust selectors)
propose_selectors()/repair_selectors()in LLM provider- Validation + repair loops
GenerateConfiguse case
License
Apache 2.0