PODFRIDGE - U.S. Forensic DNA Database

Website: https://lasisilab.github.io/PODFRIDGE-Databases/

Overview

This repository is part of the PODFRIDGE project — Population Differences in Forensic Relative Identification via Genetic Genealogy. The first phase focuses on consolidating authoritative information about where and how DNA profiles are stored in U.S. databases so later modeling can plug in realistic availability assumptions.

Associated Publication

This dataset is described in:

Pryor, Y.; Donadio, J. P.; Muller, S.C.; Wilson, J.; Lasisi, T. (2025). National and state-level datasets of United States forensic DNA databases 2001–2025. arXiv preprint. 10.5281/zenodo.17215677

Dataset DOI: 10.5281/zenodo.17215677

Getting Started

If you just want to get a quick overview of the data and analyses, you can view the rendered website hosted on GitHub Pages: https://lasisilab.github.io/PODFRIDGE-Databases/

If you want to run the code locally or contribute, follow the setup instructions below.

Prerequisites

R ≥ 4.0 (Download)
Python ≥ 3.13 (Download)
Quarto ≥ 1.3 (Download)

Quick Setup

# Clone the repository
git clone https://github.com/lasisilab/PODFRIDGE-Databases.git
cd PODFRIDGE-Databases

# Run automated setup (installs all dependencies)
bash setup/setup.sh

# Preview the website locally
quarto preview

Manual Setup

If you prefer to install dependencies separately:

# Install Python packages (for web scraping and analyses)
pip install -r setup/requirements.txt

# Install R packages (for Quarto analyses)
Rscript setup/install.R

# Preview the website
quarto preview

Troubleshooting

CRAN mirror error: The install.R script automatically sets the CRAN mirror. If you encounter issues, manually set it in R:

options(repos = c(CRAN = "https://cloud.r-project.org/"))

Package conflicts: Ensure you're using compatible versions:

R --version        # Should be >= 4.0
python3 --version  # Should be >= 3.13
quarto --version   # Should be >= 1.3

Project Components

1. NDIS Time Series Analysis (2001-2025)

Reconstructs the growth of the FBI's National DNA Index System using archived snapshots from the Internet Archive's Wayback Machine.

Data Source: FBI CODIS-NDIS Statistics pages
Coverage: Monthly snapshots from 2001-2025
Metrics: Offender profiles, arrestee profiles, forensic profiles, participating laboratories, investigations aided
Methods: Web scraping, HTML parsing, temporal validation, outlier detection

View NDIS Scraping Methodology →

View NDIS Analysis →

2. SDIS Cross-Sectional Summary (2025)

Compiles current state-level DNA database statistics and policy information across all 50 states and Washington D.C.

Data Source: State government websites, legislative databases
Coverage: Current snapshot (August 2025)
Content: Profile counts by type (where available), arrestee collection policies, familial search authorization, statutory citations
Methods: Systematic web searches, policy documentation, legal statute review

View SDIS Analysis →

3. FOIA Demographic Data Processing

Standardizes demographic composition data from state DNA databases obtained through public records requests documented in Murphy & Tong (2020).

Data Source: FOIA responses from 7 states (Murphy & Tong, 2020, Appendix A)
Coverage: ~2018
Content: Racial and gender composition by profile type (offender/arrestee/forensic)
Methods: OCR processing, data standardization, quality validation

View FOIA Analysis →

4. Annual DNA Collection Methodology

Documents the methodology and data sources used in Murphy & Tong (2020) for calculating annual DNA collection rates by race.

Data Source: Murphy & Tong (2020, Appendix B)
Coverage: All 50 states
Content: Annual collection estimates, Census demographics, calculated collection rates by race
Methods: Data provenance tracking, methodology documentation

View Methodology →

Repository Structure

analysis/ – Quarto notebooks (NDIS, SDIS, FOIA, methodology, version freeze)
data/{annual_dna_collection, foia, ndis, ndis_crossref, sdis} - various dataset types
- /raw/ – Raw inputs (Wayback HTML snapshots, FOIA PDFs, etc.)
- /intermediate/ – Staging outputs produced during processing
- /final/ – Latest cleaned datasets for each component
data/versioned_data/ – Snapshots of final processed datasets created via analysis/version_freeze.qmd
setup/ – Setup scripts to run code (setup.sh, install.R, Python requirements)
docs/ – Rendered website for GitHub Pages
_freeze - caching some expensive computation for some of the analysis to simplify render
podfridge-db-env - venv info for running python scripts fors craping

Authors

Yemko Pryor (@ypryor)
João Pedro Donadio (@DonadioJP)
Sam Muller
Jenna Wilson
Tina Lasisi (@lasisilab)

Technical Details

Software Requirements

Python (≥ 3.13)
R (≥ 4.0)
Quarto (≥ 1.3)
Python packages: requests, beautifulsoup4, lxml, pandas, tqdm, hashlib, collections, pathlib, datetime, os
R packages: tidyverse, rvest, httr, lubridate, jsonlite, knitr, plotly

Key Methods

Web Scraping: Internet Archive Wayback Machine API
Data Validation: Monotonicity testing, median absolute deviation (MAD) outlier detection
External Validation: Comparison with peer-reviewed publications and FBI press releases
Reproducibility: All processing code available; versioned datasets archived on Zenodo

Data Access

All final datasets are archived and publicly available on Zenodo:

Zenodo Repository: 10.5281/zenodo.17215677

The repository includes:

NDIS_time_series.csv - Monthly NDIS statistics (2001-2025)
SDIS_cross_section.csv - State-level profiles and policies
FOIA_Demographics.csv - Demographic composition from FOIA responses (Murphy & Tong, 2020)
Annual_DNA_Collection.csv - Annual collection rates (Murphy & Tong, 2020)
Raw HTML files, intermediate processing outputs, and complete documentation

License

Code: MIT License
Data: CC BY 4.0 (see Zenodo release; FOIA-derived data subject to original authors' permissions)

Last update: November 09, 2025

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
_freeze		_freeze
analysis		analysis
data		data
docs		docs
output/figures		output/figures
setup		setup
.Rprofile		.Rprofile
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_quarto.yml		_quarto.yml
index.qmd		index.qmd
styles.css		styles.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PODFRIDGE - U.S. Forensic DNA Database

Overview

Associated Publication

Getting Started

Prerequisites

Quick Setup

Manual Setup

Troubleshooting

Project Components

1. NDIS Time Series Analysis (2001-2025)

2. SDIS Cross-Sectional Summary (2025)

3. FOIA Demographic Data Processing

4. Annual DNA Collection Methodology

Repository Structure

Authors

Technical Details

Software Requirements

Key Methods

Data Access

License

About

Uh oh!

Contributors 3

Uh oh!

Languages

License

lasisilab/PODFRIDGE-Databases

Folders and files

Latest commit

History

Repository files navigation

PODFRIDGE - U.S. Forensic DNA Database

Overview

Associated Publication

Getting Started

Prerequisites

Quick Setup

Manual Setup

Troubleshooting

Project Components

1. NDIS Time Series Analysis (2001-2025)

2. SDIS Cross-Sectional Summary (2025)

3. FOIA Demographic Data Processing

4. Annual DNA Collection Methodology

Repository Structure

Authors

Technical Details

Software Requirements

Key Methods

Data Access

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 3

Uh oh!

Languages