Collect, clean, classify, and unify Open Source Hardware (OSH) project metadata from multiple sources into a single normalized SQLite database.
OSH Datasets aggregates metadata from 12 open-source hardware platforms and academic repositories. Raw data is scraped from each source, cleaned into standardized formats, and loaded into a unified SQLite database with a normalized schema. The pipeline supports cross-source deduplication, license normalization to SPDX identifiers, BOM component name normalization, and DOI enrichment.
Table of Contents
| Source | Platform | Description | Auth Required |
|---|---|---|---|
| Hackaday | hackaday.io | Community hardware projects with component lists | API keys |
| OSHWA | oshwa.org | OSHWA-certified hardware projects | JWT token |
| HardwareX (OHX) | HardwareX journal | Peer-reviewed hardware papers with BOM extraction | No |
| Hardware.io | openhardware.io | Open hardware project registry with BOM data | No |
| OHR | ohwr.org | CERN Open Hardware Repository (GitLab) | No |
| OSF | osf.io | Open Science Framework project metadata | No |
| Kitspace | kitspace.org | PCB project sharing platform with BOM data | No (optional Selenium) |
| Mendeley Data | data.mendeley.com | Dataset metadata via OAI-PMH | No |
| JOH | Journal of Open Hardware | Journal article metadata | No |
| PLOS | plos.org | Data availability statements and repo links | No |
| OpenAlex | openalex.org | Academic paper metadata by DOI | No |
| GitHub | github.com | Repository metrics for linked projects | Token |
| GitLab | gitlab.com | Repository metrics for linked projects | Optional token |
erDiagram
projects ||--o{ licenses : has
projects ||--o{ tags : has
projects ||--o{ contributors : has
projects ||--o{ metrics : has
projects ||--o{ bom_components : contains
projects ||--o{ publications : references
projects ||--o{ cross_references : "linked via a"
projects ||--o{ cross_references : "linked via b"
projects ||--o| repo_metrics : has
projects ||--o{ bom_file_paths : has
projects {
int id PK
text source
text source_id
text name
text description
text url
text repo_url
text documentation_url
text author
text country
text category
text created_at
text updated_at
}
licenses {
int id PK
int project_id FK
text license_type
text license_name
text license_normalized
}
tags {
int id PK
int project_id FK
text tag
}
contributors {
int id PK
int project_id FK
text name
text role
text permission
}
metrics {
int id PK
int project_id FK
text metric_name
int metric_value
}
bom_components {
int id PK
int project_id FK
text reference
text component_name
int quantity
real unit_cost
text manufacturer
text part_number
text component_normalized
}
publications {
int id PK
int project_id FK
text doi
text title
int publication_year
text journal
int cited_by_count
int open_access
}
cross_references {
int id PK
int project_id_a FK
int project_id_b FK
text match_type
real confidence
}
repo_metrics {
int id PK
int project_id FK
int stars
int forks
int watchers
int open_issues
int total_issues
int open_prs
int closed_prs
int total_prs
int releases_count
int contributors_count
int community_health
text primary_language
int has_bom
int has_readme
int repo_size_kb
int total_files
int archived
text pushed_at
}
bom_file_paths {
int id PK
int project_id FK
text file_path
}
| Table | Records |
|---|---|
| projects | 10,698 |
| bom_components | 40,416 |
Projects by source: Hackaday (5,697), OSHWA (3,052), HardwareX (567), Hardware.io (515), OHR (247), OSF (208), Kitspace (186), Mendeley (178), JOH (29), PLOS (19)
BOM components by source: Hackaday (22,918), HardwareX (9,498), Hardware.io (4,715), Kitspace (3,285)
- Python >= 3.11
- uv for package management
# Clone the repository
git clone https://github.com/WeberLab-UW/OSH_Datasets.git
cd OSH_Datasets
# Create virtual environment and install
uv venv
uv pip install -e ".[dev]"
# For Selenium-based scraping (Kitspace full mode)
uv pip install -e ".[scrape]"Create a .env file in the project root with your API credentials:
| Variable | Default | Description |
|---|---|---|
OSHWA_API_TOKEN |
-- | JWT token for OSHWA Certification API |
HACKADAY_API_KEYS |
-- | Comma-separated Hackaday.io API keys |
GITHUB_TOKEN |
-- | GitHub personal access token |
GITLAB_TOKEN |
-- | GitLab personal access token (optional) |
OPENALEX_EMAIL |
-- | Email for polite pool access (optional) |
# .env
OSHWA_API_TOKEN=your_jwt_token
HACKADAY_API_KEYS=key1,key2,key3
GITHUB_TOKEN=ghp_your_token
GITLAB_TOKEN=glpat_your_token
OPENALEX_EMAIL=you@example.com# Load all cleaned data into SQLite (no API keys needed)
uv run python -m osh_datasets.load_all
# Query the database
uv run python -c "
from osh_datasets.db import open_connection
from osh_datasets.config import DB_PATH
conn = open_connection(DB_PATH)
for r in conn.execute('SELECT source, COUNT(*) FROM projects GROUP BY source').fetchall():
print(f'{r[0]}: {r[1]}')
conn.close()
"# Run all scrapers
uv run python -m osh_datasets.scrape_all
# Run specific sources
uv run python -m osh_datasets.scrape_all oshwa ohr hackaday# Initialize the database and load all cleaned data
uv run python -m osh_datasets.load_allPost-processing runs automatically after loading: license normalization, component name normalization, DOI backfilling, cross-source deduplication, and GitHub enrichment.
The GitHub enrichment pipeline fetches repository metrics, community health, and detects BOM files for projects with GitHub repo URLs. Requires GITHUB_TOKEN in .env.
# Scrape GitHub metadata (auto-generates repos.txt from DB)
uv run python -m osh_datasets.scrape_all github
# Enrich: update existing projects with scraped GitHub data
uv run python -m osh_datasets.enrichment.githubThe full pipeline: scrape (raw JSON) -> clean (standardized CSV) -> load (SQLite) -> enrich (normalization, dedup, GitHub metadata).
data/
raw/<source>/ # Scraper output (JSON per source)
cleaned/<source>/ # Standardized CSV files
osh_datasets.db # Unified SQLite database
src/osh_datasets/
config.py # Paths, logging, environment helpers
db.py # SQLite schema, connection management, upsert helpers
http.py # Shared HTTP session with retry and rate limiting
token_manager.py # API token rotation (GitHub, GitLab, Hackaday)
scrape_all.py # Scraper orchestrator
load_all.py # Loader orchestrator + post-processing pipeline
dedup.py # Cross-source deduplication via repo URL matching
license_normalizer.py # Map free-text licenses to SPDX identifiers
component_normalizer.py # Normalize BOM component names (units, case, unicode)
enrich_ohx_dois.py # DOI backfilling for HardwareX publications
scrapers/ # 13 scraper modules (one per source)
base.py # BaseScraper ABC
oshwa.py ohr.py hackaday.py kitspace.py hardwareio.py
ohx.py osf.py plos.py openalex.py mendeley.py
github.py gitlab.py
loaders/ # 11 loader modules (one per source)
base.py # BaseLoader ABC
hackaday.py oshwa.py ohr.py kitspace.py hardwareio.py
ohx.py osf.py plos.py mendeley.py joh.py
enrichment/ # Post-scrape enrichment modules
github.py # GitHub metadata -> repo_metrics, BOM detection
tests/
test_db.py # Database schema and helper tests
test_loaders.py # Loader integration tests
test_scrapers.py # Scraper unit tests (mocked HTTP)
test_enrichment.py # GitHub enrichment pipeline tests
test_component_normalizer.py # Component name normalizer tests
# Run all tests
uv run pytest tests/ -v
# Run a single test file
uv run pytest tests/test_component_normalizer.py -v
# Type checking
uv run mypy src/
# Lint and format
uv run ruff check src/ tests/
uv run ruff format src/ tests/- Create
scrapers/<source>.pysubclassingBaseScraperwithsource_nameandscrape() -> Path - Register in
scrapers/__init__.py:ALL_SCRAPERS - Create
loaders/<source>.pysubclassingBaseLoaderwithsource_nameandload(db_path) -> int - Register in
load_all.py:ALL_LOADERS - Add mocked tests in
tests/test_scrapers.pyandtests/test_loaders.py
See CONTRIBUTING.md for details on reporting bugs, suggesting features, and submitting pull requests.
This project is licensed under the MIT License. See LICENSE for details.
- Data sourced from OSHWA, Hackaday.io, CERN OHR, Kitspace, OpenHardware.io, HardwareX, OSF, PLOS, OpenAlex, Mendeley Data, and JOH
- Built with polars, orjson, Beautiful Soup, and requests