Skip to content

Latest commit

 

History

History
201 lines (153 loc) · 5.63 KB

File metadata and controls

201 lines (153 loc) · 5.63 KB

camplinks

Scrape and enrich US political election data from Wikipedia and Ballotpedia into a normalized SQLite database.

Data Model

---
config:
  theme: neutral
---
erDiagram
    elections ||--o{ candidates : has
    candidates ||--o{ contact_links : has
    candidates ||--o{ content : has

    elections {
        int election_id PK
        text state
        text race_type
        int year
        text district
        text election_stage
        text wikipedia_url
    }

    candidates {
        int candidate_id PK
        int election_id FK
        text party
        text candidate_name
        text wikipedia_url
        text ballotpedia_url
        real vote_pct
        int is_winner
    }

    contact_links {
        int contact_link_id PK
        int candidate_id FK
        text link_type
        text url
        text source
    }

    content {
        int content_id PK
        int candidate_id FK
        text candidate_name
        text page_url
        text page_type
        text link_type
        text unprocessed_text
        text cleaned_text
        text sampled_text
    }
Loading

link_type values: campaign_site, campaign_site_archived, campaign_facebook, campaign_x, campaign_instagram, personal_website, personal_facebook, personal_linkedin

source values: wikipedia, ballotpedia, web_search, wayback, csv_import

Quickstart

# Install
uv sync

# Scrape 2024 House races and enrich with contact info
python -m camplinks --year 2024 --race house

# Scrape 2024 Senate races
python -m camplinks --year 2024 --race senate

# Scrape 2025 gubernatorial races
python -m camplinks --year 2025 --race governor

# Scrape 2025 mayoral elections (Wikipedia, 62+ cities)
python -m camplinks --year 2025 --race municipal

# Scrape 2023-2026 mayoral elections (Ballotpedia, top-100 cities)
python -m camplinks --year 2023 --race bp_municipal --stage scrape

# Scrape gubernatorial elections from Ballotpedia (all 50 states)
python -m camplinks --year 2026 --race bp_governor --stage scrape

# Run all registered race types
python -m camplinks --year 2024 --race all

Available --race keys

Key Description
house US House of Representatives
senate US Senate
governor Governor (statewide)
attorney_general Attorney General (statewide)
special_house House special elections
state_leg State legislature (regular sessions)
state_leg_special State legislature special elections
municipal Mayoral elections (Wikipedia)
bp_municipal Mayoral elections (Ballotpedia, top-100 cities)
bp_governor Gubernatorial elections (Ballotpedia, all states)
judicial State Supreme Court elections
all Run all of the above

The database is written to camplinks.db by default. Override with --db path/to/db.

Pipeline Stages

The pipeline runs five stages in order. Each stage is idempotent (safe to re-run).

Stage What it does Data source
scrape Fetch election results from Wikipedia Wikipedia state election pages
enrich Extract campaign websites from candidate Wikipedia pages Wikipedia candidate infoboxes
search Find missing contact info via Ballotpedia and web search Ballotpedia + DuckDuckGo
validate Check campaign site accessibility, archive dead links Wayback Machine API
get_text_content Scrape visible text from campaign sites (home, policy, about pages) Candidate campaign websites

Run individual stages with --stage:

python -m camplinks --year 2024 --race house --stage scrape
python -m camplinks --year 2024 --race house --stage enrich
python -m camplinks --year 2024 --race house --stage search
python -m camplinks --year 2024 --race house --stage validate
python -m camplinks --year 2024 --race house --stage get_text_content

The get_text_content stage writes scraped text to the content table. It processes candidates in batches of 150, skipping any already scraped. For each candidate it fetches the home page plus any policy and about subpages (found via <a> tags and buttons), then stores the raw text, cleaned text, and a random 40% sentence sample.

You can also run the scraper standalone:

python scraping-campaign-sites.py

Querying the Database

import sqlite3

conn = sqlite3.connect("camplinks.db")
conn.row_factory = sqlite3.Row

# All 2024 House winners with their campaign sites
rows = conn.execute("""
    SELECT c.candidate_name, c.party, e.state, e.district, cl.url
    FROM candidates c
    JOIN elections e ON c.election_id = e.election_id
    LEFT JOIN contact_links cl ON c.candidate_id = cl.candidate_id
        AND cl.link_type = 'campaign_site'
    WHERE c.is_winner = 1 AND e.year = 2024 AND e.race_type = 'US House'
    ORDER BY e.state, e.district
""").fetchall()

for r in rows:
    print(f"{r['state']}-{r['district']}: {r['candidate_name']} ({r['party']}) - {r['url']}")

Or with Polars:

import polars as pl

df = pl.read_database(
    "SELECT * FROM candidates c JOIN elections e ON c.election_id = e.election_id",
    "sqlite:///camplinks.db",
)

Adding a New Race Type

See USAGE.md for a walkthrough with examples.

Migrating from Legacy CSV

If you have an existing house_races_2024.csv from the old wide-format pipeline:

python convert_to_tidy.py --csv house_races_2024.csv --db camplinks.db

Development

uv sync
uv run pytest tests/
uv run mypy camplinks/
uv run ruff check .

Contributing

See CONTRIBUTING.md for setup instructions and guidelines.

License

MIT