Web Scraper

Turn a vague machine-learning data request into a usable dataset.

This project takes goals like:

"predict NBA player salary from performance stats"
"forecast U.S. state population growth from GDP"
"estimate startup valuation from funding"

and tries to produce:

a row schema
a set of likely public sources
a cleaned dataset
csv, parquet, and profile artifacts

It is built for list-heavy public web data, not arbitrary full-site crawling.

How it works

The pipeline has four stages:

architect Turns a user goal into a dataset blueprint: target field, feature fields, and starting URLs.
predictive builder Tries the fast path first by merging public HTML tables directly for supported goal families.
swarm Falls back to routed extraction when the direct table path is not enough.
synthesizer + formatter Cleans records, exports files, and writes a dataset profile.

The project is strongest when the target data exists on public rankings, directories, stat tables, or "List of..." pages.

Current strengths

Predictive datasets from public stat tables and list pages
Deterministic handling for several common goal families
API mode and local CLI mode
Regression coverage for routing, heuristics, and dataset assembly

Current limits

It is not a general-purpose crawler for any website
Anti-bot-heavy sources can still be inconsistent
Some goals depend on LLM-backed schema design and recovery
Browser fallback requires agent-browser

If a goal can be satisfied from public tables, this repo performs much better than if it has to infer data from scattered detail pages.

Quick start

Install dependencies:

python3 -m pip install -r requirements.txt

Set one API key if you want LLM-backed planning and synthesis:

export OPENAI_API_KEY=...

Run a build:

python3 main.py --goal "I want to predict NBA player salary" --max-agents 3

Output is written under the working directory or configured artifact directory.

Good example goals

These are the kinds of requests the system handles best:

I want to predict NCAA men's basketball team strength
I want to predict U.S. state population growth
I want to predict NBA player salary
Put together a machine-learning table for startup companies where valuation is the label and funding is a key predictor
Give me a dataset for the biggest U.S. banks so I can estimate market value from asset size and capital strength
I need laptop pricing data where the thing to predict is price and the inputs are hardware specs

CLI

Main entrypoint:

python3 main.py --goal "..." --max-agents 3

Useful commands:

python3 main.py --setup
python3 main.py --api-key-status
python3 main.py --model-status
python3 main.py --doctor

--doctor is the fastest preflight check before local use or deployment.

API

Run the API:

docker compose up --build

Then open:

http://localhost:8000/docs

Main endpoints:

POST /jobs
GET /jobs/{job_id}
GET /jobs/{job_id}/profile
GET /jobs/{job_id}/preview
GET /jobs/{job_id}/download/csv
GET /jobs/{job_id}/download/parquet
GET /jobs/{job_id}/download/profile

Local frontend demo

This repo now includes a lightweight React frontend under frontend/ for local demos.

Run the API in one terminal:

uvicorn api:app --reload

Run the frontend in another:

cd frontend
npm install
npm run dev

Then open:

http://localhost:5173

The frontend submits dataset jobs, polls job status, shows pipeline stages, fetches the dataset profile, renders a small row preview, and links to artifact downloads. Jobs run in-process by default, so you do not need Redis for local demos or the first deploy. In local development, Vite now proxies /jobs and /healthz to the FastAPI server on http://127.0.0.1:8000, so you should not need to type an API base URL manually.

Important files

main.py CLI entrypoint and setup flow.
api.py FastAPI service for queued jobs.
pipeline_service.py Shared orchestration for CLI and API paths.
architect.py Goal interpretation, source selection, and schema planning.
predictive_dataset_builder.py Deterministic wide-table assembly from compatible public tables.
smoke_tests.py Deterministic structural smoke tests.
edge_case_tests.py Regression tests for heuristics and weird phrasing.
DEPLOYMENT.md Deployment notes and readiness guidance.

Environment

Minimum practical setup:

Python 3.12+
one of:
- OPENAI_API_KEY
- GEMINI_API_KEY
- GROQ_API_KEY

For browser fallback:

agent-browser installed and available on PATH

Optional:

BROWSERBASE_API_KEY
BROWSERBASE_PROJECT_ID
tracing config for LangSmith or Phoenix

Testing

Run the local regression set:

python3 config_tests.py
python3 edge_case_tests.py
python3 routing_smoke_tests.py
python3 smoke_tests.py

Or run the broader preflight:

python3 main.py --doctor

Deployment

For service-oriented usage, see DEPLOYMENT.md.

Positioning

This repo sits between a brittle scraper script and a full data platform.

It is best viewed as a dataset-generation engine for public, semi-structured web data:

more automated than hand-written scrapers
more opinionated than a generic crawling framework
most effective when the requested dataset can be assembled from a small number of public tables or list pages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
fixtures		fixtures
frontend		frontend
.dockerignore		.dockerignore
.gitignore		.gitignore
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile		Dockerfile
README.md		README.md
agent.py		agent.py
api.py		api.py
architect.py		architect.py
browser.py		browser.py
celery_app.py		celery_app.py
checkpoint.py		checkpoint.py
config_tests.py		config_tests.py
crawlee_fetcher.py		crawlee_fetcher.py
crawlee_live_integration_tests.py		crawlee_live_integration_tests.py
data_validation.py		data_validation.py
dataset_profiler.py		dataset_profiler.py
demo_datasets.py		demo_datasets.py
deployment_checks.py		deployment_checks.py
docker-compose.yml		docker-compose.yml
domain_adapters.py		domain_adapters.py
edge_case_tests.py		edge_case_tests.py
entity_resolver.py		entity_resolver.py
env_utils.py		env_utils.py
exporter.py		exporter.py
extraction_router.py		extraction_router.py
fixture_integration_tests.py		fixture_integration_tests.py
formatter.py		formatter.py
goal_intent.py		goal_intent.py
html_table_extractor.py		html_table_extractor.py
job_store.py		job_store.py
list_page_extractor.py		list_page_extractor.py
llm.py		llm.py
main.py		main.py
page_state.py		page_state.py
pipeline_service.py		pipeline_service.py
post_extraction_pruner.py		post_extraction_pruner.py
predictive_dataset_builder.py		predictive_dataset_builder.py
render.yaml		render.yaml
requirements.txt		requirements.txt
router_fixture_integration_tests.py		router_fixture_integration_tests.py
routing_smoke_tests.py		routing_smoke_tests.py
settings.py		settings.py
smoke_tests.py		smoke_tests.py
source_adapters.py		source_adapters.py
source_discovery.py		source_discovery.py
source_health.py		source_health.py
source_memory.py		source_memory.py
source_ranker.py		source_ranker.py
step_logger.py		step_logger.py
swarm.py		swarm.py
synthesizer.py		synthesizer.py
text_cleaner.py		text_cleaner.py
tracing.py		tracing.py
worker.py		worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper

How it works

Current strengths

Current limits

Quick start

Good example goals

CLI

API

Local frontend demo

Important files

Environment

Testing

Deployment

Positioning

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Scraper

How it works

Current strengths

Current limits

Quick start

Good example goals

CLI

API

Local frontend demo

Important files

Environment

Testing

Deployment

Positioning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages