Website: https://lasisilab.github.io/PODFRIDGE-Databases/
This repository is part of the PODFRIDGE project — Population Differences in Forensic Relative Identification via Genetic Genealogy. The first phase focuses on consolidating authoritative information about where and how DNA profiles are stored in U.S. databases so later modeling can plug in realistic availability assumptions.
This dataset is described in:
Pryor, Y.; Donadio, J. P.; Muller, S.C.; Wilson, J.; Lasisi, T. (2025). National and state-level datasets of United States forensic DNA databases 2001–2025. arXiv preprint. 10.5281/zenodo.17215677
Dataset DOI: 10.5281/zenodo.17215677
If you just want to get a quick overview of the data and analyses, you can view the rendered website hosted on GitHub Pages: https://lasisilab.github.io/PODFRIDGE-Databases/
If you want to run the code locally or contribute, follow the setup instructions below.
# Clone the repository
git clone https://github.com/lasisilab/PODFRIDGE-Databases.git
cd PODFRIDGE-Databases
# Run automated setup (installs all dependencies)
bash setup/setup.sh
# Preview the website locally
quarto previewIf you prefer to install dependencies separately:
# Install Python packages (for web scraping and analyses)
pip install -r setup/requirements.txt
# Install R packages (for Quarto analyses)
Rscript setup/install.R
# Preview the website
quarto previewCRAN mirror error: The install.R script automatically sets the CRAN mirror. If you encounter issues, manually set it in R:
options(repos = c(CRAN = "https://cloud.r-project.org/"))Package conflicts: Ensure you're using compatible versions:
R --version # Should be >= 4.0
python3 --version # Should be >= 3.13
quarto --version # Should be >= 1.3Reconstructs the growth of the FBI's National DNA Index System using archived snapshots from the Internet Archive's Wayback Machine.
- Data Source: FBI CODIS-NDIS Statistics pages
- Coverage: Monthly snapshots from 2001-2025
- Metrics: Offender profiles, arrestee profiles, forensic profiles, participating laboratories, investigations aided
- Methods: Web scraping, HTML parsing, temporal validation, outlier detection
View NDIS Scraping Methodology →
Compiles current state-level DNA database statistics and policy information across all 50 states and Washington D.C.
- Data Source: State government websites, legislative databases
- Coverage: Current snapshot (August 2025)
- Content: Profile counts by type (where available), arrestee collection policies, familial search authorization, statutory citations
- Methods: Systematic web searches, policy documentation, legal statute review
Standardizes demographic composition data from state DNA databases obtained through public records requests documented in Murphy & Tong (2020).
- Data Source: FOIA responses from 7 states (Murphy & Tong, 2020, Appendix A)
- Coverage: ~2018
- Content: Racial and gender composition by profile type (offender/arrestee/forensic)
- Methods: OCR processing, data standardization, quality validation
Documents the methodology and data sources used in Murphy & Tong (2020) for calculating annual DNA collection rates by race.
- Data Source: Murphy & Tong (2020, Appendix B)
- Coverage: All 50 states
- Content: Annual collection estimates, Census demographics, calculated collection rates by race
- Methods: Data provenance tracking, methodology documentation
analysis/– Quarto notebooks (NDIS, SDIS, FOIA, methodology, version freeze)data/{annual_dna_collection, foia, ndis, ndis_crossref, sdis}- various dataset types/raw/– Raw inputs (Wayback HTML snapshots, FOIA PDFs, etc.)/intermediate/– Staging outputs produced during processing/final/– Latest cleaned datasets for each component
data/versioned_data/– Snapshots of final processed datasets created viaanalysis/version_freeze.qmdsetup/– Setup scripts to run code (setup.sh,install.R, Python requirements)docs/– Rendered website for GitHub Pages_freeze- caching some expensive computation for some of the analysis to simplify renderpodfridge-db-env- venv info for running python scripts fors craping
- Yemko Pryor (
@ypryor) - João Pedro Donadio (
@DonadioJP) - Sam Muller
- Jenna Wilson
- Tina Lasisi (
@lasisilab)
-
Python (≥ 3.13)
-
R (≥ 4.0)
-
Quarto (≥ 1.3)
-
Python packages:
requests,beautifulsoup4,lxml,pandas,tqdm,hashlib,collections,pathlib,datetime,os -
R packages:
tidyverse,rvest,httr,lubridate,jsonlite,knitr,plotly
-
Web Scraping: Internet Archive Wayback Machine API
-
Data Validation: Monotonicity testing, median absolute deviation (MAD) outlier detection
-
External Validation: Comparison with peer-reviewed publications and FBI press releases
-
Reproducibility: All processing code available; versioned datasets archived on Zenodo
All final datasets are archived and publicly available on Zenodo:
Zenodo Repository: 10.5281/zenodo.17215677
The repository includes:
NDIS_time_series.csv- Monthly NDIS statistics (2001-2025)SDIS_cross_section.csv- State-level profiles and policiesFOIA_Demographics.csv- Demographic composition from FOIA responses (Murphy & Tong, 2020)Annual_DNA_Collection.csv- Annual collection rates (Murphy & Tong, 2020)- Raw HTML files, intermediate processing outputs, and complete documentation
Code: MIT License
Data: CC BY 4.0 (see Zenodo release; FOIA-derived data subject to original authors' permissions)
Last update: November 09, 2025