camplinks

Scrape and enrich US political election data from Wikipedia and Ballotpedia into a normalized SQLite database.

Data Model

---
config:
  theme: neutral
---
erDiagram
    elections ||--o{ candidates : has
    candidates ||--o{ contact_links : has
    candidates ||--o{ content : has

    elections {
        int election_id PK
        text state
        text race_type
        int year
        text district
        text election_stage
        text wikipedia_url
    }

    candidates {
        int candidate_id PK
        int election_id FK
        text party
        text candidate_name
        text wikipedia_url
        text ballotpedia_url
        real vote_pct
        int is_winner
    }

    contact_links {
        int contact_link_id PK
        int candidate_id FK
        text link_type
        text url
        text source
    }

    content {
        int content_id PK
        int candidate_id FK
        text candidate_name
        text page_url
        text page_type
        text link_type
        text unprocessed_text
        text cleaned_text
        text sampled_text
    }

link_type values: campaign_site, campaign_site_archived, campaign_facebook, campaign_x, campaign_instagram, personal_website, personal_facebook, personal_linkedin

source values: wikipedia, ballotpedia, web_search, wayback, csv_import

Quickstart

# Install
uv sync

# Scrape 2024 House races and enrich with contact info
python -m camplinks --year 2024 --race house

# Scrape 2024 Senate races
python -m camplinks --year 2024 --race senate

# Scrape 2025 gubernatorial races
python -m camplinks --year 2025 --race governor

# Scrape 2025 mayoral elections (Wikipedia, 62+ cities)
python -m camplinks --year 2025 --race municipal

# Scrape 2023-2026 mayoral elections (Ballotpedia, top-100 cities)
python -m camplinks --year 2023 --race bp_municipal --stage scrape

# Scrape gubernatorial elections from Ballotpedia (all 50 states)
python -m camplinks --year 2026 --race bp_governor --stage scrape

# Run all registered race types
python -m camplinks --year 2024 --race all

Available `--race` keys

Key	Description
`house`	US House of Representatives
`senate`	US Senate
`governor`	Governor (statewide)
`attorney_general`	Attorney General (statewide)
`special_house`	House special elections
`state_leg`	State legislature (regular sessions)
`state_leg_special`	State legislature special elections
`municipal`	Mayoral elections (Wikipedia)
`bp_municipal`	Mayoral elections (Ballotpedia, top-100 cities)
`bp_governor`	Gubernatorial elections (Ballotpedia, all states)
`judicial`	State Supreme Court elections
`all`	Run all of the above

The database is written to camplinks.db by default. Override with --db path/to/db.

Pipeline Stages

The pipeline runs five stages in order. Each stage is idempotent (safe to re-run).

Stage	What it does	Data source
scrape	Fetch election results from Wikipedia	Wikipedia state election pages
enrich	Extract campaign websites from candidate Wikipedia pages	Wikipedia candidate infoboxes
search	Find missing contact info via Ballotpedia and web search	Ballotpedia + DuckDuckGo
validate	Check campaign site accessibility, archive dead links	Wayback Machine API
get_text_content	Scrape visible text from campaign sites (home, policy, about pages)	Candidate campaign websites

Run individual stages with --stage:

python -m camplinks --year 2024 --race house --stage scrape
python -m camplinks --year 2024 --race house --stage enrich
python -m camplinks --year 2024 --race house --stage search
python -m camplinks --year 2024 --race house --stage validate
python -m camplinks --year 2024 --race house --stage get_text_content

The get_text_content stage writes scraped text to the content table. It processes candidates in batches of 150, skipping any already scraped. For each candidate it fetches the home page plus any policy and about subpages (found via <a> tags and buttons), then stores the raw text, cleaned text, and a random 40% sentence sample.

You can also run the scraper standalone:

python scraping-campaign-sites.py

Querying the Database

import sqlite3

conn = sqlite3.connect("camplinks.db")
conn.row_factory = sqlite3.Row

# All 2024 House winners with their campaign sites
rows = conn.execute("""
    SELECT c.candidate_name, c.party, e.state, e.district, cl.url
    FROM candidates c
    JOIN elections e ON c.election_id = e.election_id
    LEFT JOIN contact_links cl ON c.candidate_id = cl.candidate_id
        AND cl.link_type = 'campaign_site'
    WHERE c.is_winner = 1 AND e.year = 2024 AND e.race_type = 'US House'
    ORDER BY e.state, e.district
""").fetchall()

for r in rows:
    print(f"{r['state']}-{r['district']}: {r['candidate_name']} ({r['party']}) - {r['url']}")

Or with Polars:

import polars as pl

df = pl.read_database(
    "SELECT * FROM candidates c JOIN elections e ON c.election_id = e.election_id",
    "sqlite:///camplinks.db",
)

Adding a New Race Type

See USAGE.md for a walkthrough with examples.

Migrating from Legacy CSV

If you have an existing house_races_2024.csv from the old wide-format pipeline:

python convert_to_tidy.py --csv house_races_2024.csv --db camplinks.db

Development

uv sync
uv run pytest tests/
uv run mypy camplinks/
uv run ruff check .

Contributing

See CONTRIBUTING.md for setup instructions and guidelines.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

camplinks

Data Model

Quickstart

Available `--race` keys

Pipeline Stages

Querying the Database

Adding a New Race Type

Migrating from Legacy CSV

Development

Contributing

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

camplinks

Data Model

Quickstart

Available --race keys

Pipeline Stages

Querying the Database

Adding a New Race Type

Migrating from Legacy CSV

Development

Contributing

License

Available `--race` keys