Scrape and enrich US political election data from Wikipedia and Ballotpedia into a normalized SQLite database.
---
config:
theme: neutral
---
erDiagram
elections ||--o{ candidates : has
candidates ||--o{ contact_links : has
candidates ||--o{ content : has
elections {
int election_id PK
text state
text race_type
int year
text district
text election_stage
text wikipedia_url
}
candidates {
int candidate_id PK
int election_id FK
text party
text candidate_name
text wikipedia_url
text ballotpedia_url
real vote_pct
int is_winner
}
contact_links {
int contact_link_id PK
int candidate_id FK
text link_type
text url
text source
}
content {
int content_id PK
int candidate_id FK
text candidate_name
text page_url
text page_type
text link_type
text unprocessed_text
text cleaned_text
text sampled_text
}
link_type values: campaign_site, campaign_site_archived, campaign_facebook, campaign_x, campaign_instagram, personal_website, personal_facebook, personal_linkedin
source values: wikipedia, ballotpedia, web_search, wayback, csv_import
# Install
uv sync
# Scrape 2024 House races and enrich with contact info
python -m camplinks --year 2024 --race house
# Scrape 2024 Senate races
python -m camplinks --year 2024 --race senate
# Scrape 2025 gubernatorial races
python -m camplinks --year 2025 --race governor
# Scrape 2025 mayoral elections (Wikipedia, 62+ cities)
python -m camplinks --year 2025 --race municipal
# Scrape 2023-2026 mayoral elections (Ballotpedia, top-100 cities)
python -m camplinks --year 2023 --race bp_municipal --stage scrape
# Scrape gubernatorial elections from Ballotpedia (all 50 states)
python -m camplinks --year 2026 --race bp_governor --stage scrape
# Run all registered race types
python -m camplinks --year 2024 --race all| Key | Description |
|---|---|
house |
US House of Representatives |
senate |
US Senate |
governor |
Governor (statewide) |
attorney_general |
Attorney General (statewide) |
special_house |
House special elections |
state_leg |
State legislature (regular sessions) |
state_leg_special |
State legislature special elections |
municipal |
Mayoral elections (Wikipedia) |
bp_municipal |
Mayoral elections (Ballotpedia, top-100 cities) |
bp_governor |
Gubernatorial elections (Ballotpedia, all states) |
judicial |
State Supreme Court elections |
all |
Run all of the above |
The database is written to camplinks.db by default. Override with --db path/to/db.
The pipeline runs five stages in order. Each stage is idempotent (safe to re-run).
| Stage | What it does | Data source |
|---|---|---|
| scrape | Fetch election results from Wikipedia | Wikipedia state election pages |
| enrich | Extract campaign websites from candidate Wikipedia pages | Wikipedia candidate infoboxes |
| search | Find missing contact info via Ballotpedia and web search | Ballotpedia + DuckDuckGo |
| validate | Check campaign site accessibility, archive dead links | Wayback Machine API |
| get_text_content | Scrape visible text from campaign sites (home, policy, about pages) | Candidate campaign websites |
Run individual stages with --stage:
python -m camplinks --year 2024 --race house --stage scrape
python -m camplinks --year 2024 --race house --stage enrich
python -m camplinks --year 2024 --race house --stage search
python -m camplinks --year 2024 --race house --stage validate
python -m camplinks --year 2024 --race house --stage get_text_contentThe get_text_content stage writes scraped text to the content table. It processes candidates in batches of 150, skipping any already scraped. For each candidate it fetches the home page plus any policy and about subpages (found via <a> tags and buttons), then stores the raw text, cleaned text, and a random 40% sentence sample.
You can also run the scraper standalone:
python scraping-campaign-sites.pyimport sqlite3
conn = sqlite3.connect("camplinks.db")
conn.row_factory = sqlite3.Row
# All 2024 House winners with their campaign sites
rows = conn.execute("""
SELECT c.candidate_name, c.party, e.state, e.district, cl.url
FROM candidates c
JOIN elections e ON c.election_id = e.election_id
LEFT JOIN contact_links cl ON c.candidate_id = cl.candidate_id
AND cl.link_type = 'campaign_site'
WHERE c.is_winner = 1 AND e.year = 2024 AND e.race_type = 'US House'
ORDER BY e.state, e.district
""").fetchall()
for r in rows:
print(f"{r['state']}-{r['district']}: {r['candidate_name']} ({r['party']}) - {r['url']}")Or with Polars:
import polars as pl
df = pl.read_database(
"SELECT * FROM candidates c JOIN elections e ON c.election_id = e.election_id",
"sqlite:///camplinks.db",
)See USAGE.md for a walkthrough with examples.
If you have an existing house_races_2024.csv from the old wide-format pipeline:
python convert_to_tidy.py --csv house_races_2024.csv --db camplinks.dbuv sync
uv run pytest tests/
uv run mypy camplinks/
uv run ruff check .See CONTRIBUTING.md for setup instructions and guidelines.