Skip to content

Releases: dccakes/scrape-gpt

v1.0.0 — Phase 1: XPath + LLM Fallback Scraping

13 Apr 15:19

Choose a tag to compare

scrape-gpt v1.0.0

Phase 1 of scrape-gpt is complete. This release ships a production-ready web scraping system that combines deterministic XPath extraction with LLM-powered fallback — dramatically cutting LLM costs compared to page-level extraction.

Architecture

The core insight: only send missing fields to the LLM, not the full page.

XPath extraction (fast, free)
        |
        v
  All fields found? ──yes──> Done
        |
       no
        v
Send ONLY missing fields to LLM
        |
        v
  Merge + update coverage stats

This field-level fallback strategy achieves ~85% cost reduction vs calling the LLM for every page.

What's Included

Clean Architecture with dependency injection — swap any provider via .env, no code changes:

  • httpx async fetcher
  • lxml XPath + CSS selector extractor
  • Direct LLM provider (Anthropic API, extract_fields only)
  • Local filesystem storage with coverage tracking
  • Console alerting
  • ExtractWithFallback use case — field-level LLM fallback
  • ScrapePage use case — end-to-end orchestration
  • CLI (scraper scrape URL, scraper info)

Demo — runs offline against bundled HTML fixtures:

make demo

Shows three scenarios: multi-item XPath extraction, single-item extraction, and live LLM fallback when a selector breaks.

Test coverage — 33 unit tests + 9 integration tests, 42 total:

make test

CI — GitHub Actions on Python 3.11 and 3.12 (lint → type check → unit tests).

Cost at Scale

At 100 domains, 10k pages/month each:

Approach Annual Cost
Page-level LLM $18,000
Field-level fallback (this) $2,820
Savings $15,180/yr

Quick Start

git clone https://github.com/dccakes/scrape-gpt
cd scrape-gpt
pip install -e ".[dev]"
make demo          # no API key needed

What's Next (Phase 2)

  • Multi-sample config generation (3–5 pages for robust selectors)
  • propose_selectors() / repair_selectors() in LLM provider
  • Validation + repair loops
  • GenerateConfig use case

License

Apache 2.0