Releases · dccakes/scrape-gpt

scrape-gpt v1.0.0

Phase 1 of scrape-gpt is complete. This release ships a production-ready web scraping system that combines deterministic XPath extraction with LLM-powered fallback — dramatically cutting LLM costs compared to page-level extraction.

Architecture

The core insight: only send missing fields to the LLM, not the full page.

XPath extraction (fast, free)
        |
        v
  All fields found? ──yes──> Done
        |
       no
        v
Send ONLY missing fields to LLM
        |
        v
  Merge + update coverage stats

This field-level fallback strategy achieves ~85% cost reduction vs calling the LLM for every page.

What's Included

Clean Architecture with dependency injection — swap any provider via .env, no code changes:

httpx async fetcher
lxml XPath + CSS selector extractor
Direct LLM provider (Anthropic API, extract_fields only)
Local filesystem storage with coverage tracking
Console alerting
ExtractWithFallback use case — field-level LLM fallback
ScrapePage use case — end-to-end orchestration
CLI (scraper scrape URL, scraper info)

Demo — runs offline against bundled HTML fixtures:

make demo

Shows three scenarios: multi-item XPath extraction, single-item extraction, and live LLM fallback when a selector breaks.

Test coverage — 33 unit tests + 9 integration tests, 42 total:

make test

CI — GitHub Actions on Python 3.11 and 3.12 (lint → type check → unit tests).

Cost at Scale

At 100 domains, 10k pages/month each:

Approach	Annual Cost
Page-level LLM	$18,000
Field-level fallback (this)	$2,820
Savings	$15,180/yr

Quick Start

git clone https://github.com/dccakes/scrape-gpt
cd scrape-gpt
pip install -e ".[dev]"
make demo          # no API key needed

What's Next (Phase 2)

Multi-sample config generation (3–5 pages for robust selectors)
propose_selectors() / repair_selectors() in LLM provider
Validation + repair loops
GenerateConfig use case

License

Apache 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

scrape-gpt v1.0.0

Architecture

What's Included

Cost at Scale

Quick Start

What's Next (Phase 2)

License

Uh oh!

Releases: dccakes/scrape-gpt

v1.0.0 — Phase 1: XPath + LLM Fallback Scraping

scrape-gpt v1.0.0

Architecture

What's Included

Cost at Scale

Quick Start

What's Next (Phase 2)

License

Uh oh!