Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
.Rproj.user
.Rhistory
.RData
.rds
.Ruserdata
*.csv
shap_files/
Expand Down
11 changes: 11 additions & 0 deletions bibliography/packages.bib
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
@article{r-amadeus,
author = {Manware, Mitchell and Song, Insang and Marques, Eva S. and Kassien, Mariana Alifa and Clark, Lara P. and Messier, Kyle P.},
title = {Amadeus: Accessing and analyzing large scale environmental data in R},
year = 2025,
journal = {Environ. Model. Softw.},
volume = 186,
number = C,
doi = {10.1016/j.envsoft.2025.106352},
url = {https://doi.org/10.1016/j.envsoft.2025.106352}
}

@Manual{R-base,
title = {R: A Language and Environment for Statistical Computing},
author = {{R Core Team}},
Expand Down
267 changes: 267 additions & 0 deletions chapters/06-01-hcup-individual-usecase.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
---
bibliography: ["bibliography/book.bib", "bibliography/packages.bib"]
biblio-style: apalike
csl: "bibliography/chicago-author-date-with-note.csl"
link-citations: true
link-bibliography: true
output:
html_document:
css: style.css
---

# Linking to Individuals

[![Profile-CDM](images/user_profiles/profilecdm.svg)](#profilecdm)
[![Profile-CRE](images/user_profiles/profilecre.svg)](#profilecre)
[![Profile-EDU](images/user_profiles/profileedu.svg)](#profileedu)
[![Profile-EPI](images/user_profiles/profileepi.svg)](#profileepi)
[![Profile-GEO](images/user_profiles/profilegeo.svg)](#profilegeo)
[![Profile-STU](images/user_profiles/profilestu.svg)](#profilestu)

### "Linking Individual-level Health Data to Environmental Variables: a Case Study of Rheumatoid Arthritis in Utah"

**Date Modified**: September 29, 2025

**Author(s)**: Austin Rau [![author-lpc](images/orcid.png){width=10}](https://orcid.org/0000-0002-5818-4864)

**Programming Language(s)**: R

## Introduction

The purpose of this tutorial is to demonstrate joining individual/payer level health information to environmental data. In this tutorial, individual-level health data from the Healthcare Utilization Project (HCUP) state emergency department database (SEDD) will be joined to environmental data from the `amadeus` [@r-amadeus] package. The SEDD provides individual-level emergency department encounter data with variables including the month of the encounter and the zip code of the patient.

As an illustrative example, emergency department (ED) vists for rheumatoid arthritis (RA) were extracted from the SEDD for the state of Utah from 2016 - 2020. RA is a chronic autoimmune disease in which the immune system attacks the joints. This can cause symptoms including swelling, pain, and joint stiffness. ICD-10 codes (M05.X, M06.X) for the primary discharge diagnosis variable (I10_DX1) in the SEDD database were used to extract RA ED encounters.

### Outline

This tutorial provides example code in `R`:

1. To prepare environmental data for joining with individual-level health data
2. To join environmental to individual-level health data
3. To visualize trends in environmental data across time for individuals

::: {.important}
The purpose of this tutorial is not to provide access to HCUP state database data.
:::

::: {.warning}
The exploratory analyses in this tutorial are for educational purposes only.
:::


## Set-up

### Load R Packages

```{r, echo = TRUE, warning = FALSE, message = FALSE}
# load libraries
library(tidyverse)
library(slider)
```

### Set working environment

```{r, eval = FALSE}
# set a working directory to the data
wd <- "your working directory"

```

```{r, echo = FALSE}

wd <- "/ddn/gs1/home/rauat/PCOR_bookdown_tools/"

```

## Reading data

In this step, the analyst will read in individual-level health data that is spatially referenced (i.e., contains information on the location of the patient such as county or zip code of residence). For the purposes of this tutorial, the individual-level health data are RA ED encounters for the state of Utah from 2016 - 2020 that are spatially referenced at the zip code level.

```{r, echo = TRUE}

ra_dat <- readRDS(paste0(wd, "dataset/ra_utah_px.rds"))

```

Next, we will load pre-processed environmental data from the `amadeus` package at the zip code level.

```{r, echo = TRUE, message = FALSE}
# read in entire folder of environmental data
# at zip code level as single dataframe
env_dat <- list.files(paste0(wd, "dataset/env_data"),
pattern = ".*csv", full.names = TRUE) %>%
map_dfr(~ read_csv(.x) %>%
# add file name as column
mutate(source_file = basename(.x)))

# filter to zip codes in patient data
env_dat <- env_dat %>%
dplyr::filter(geoid %in% ra_dat$ZIP)
```


Data cleaning needs to be completed before joining the environmental and health data. In the code chunk below, we create variables for month and year from the environmental data.

```{r, echo = TRUE}
# create a column for month and year based on the source_file variable
env_dat$year <- str_sub(env_dat$source_file, start = 5, end = 8)

env_dat$month <- str_sub(env_dat$source_file, start = 10, end = 11)

# create a pseudo-date column
env_dat$pseudo_date <- ymd(paste0(env_dat$year, "-", env_dat$month, "-", "01"))

```

In this example, we will focus on monthly mean, daily maximum temperature data from GridMet (measured in Kelvin) and monthly surface pressure data from MERRA-2 (measured in Pascals). Temperature data will be convered to degrees Celsius prior to joining with the health data.

We will also explore the notion of delayed effects - environmental exposures from the recent past may be associated with health outcomes. To reflect this, we will calculate a 2-month rolling mean for our environmental variables.

```{r, echo = TRUE}

env_cleaned <- env_dat %>%
select(geoid, tmmx, ps, pseudo_date, month, year) %>%
# Convert max temperature from Kelvin to Celsius
mutate(tmmx = tmmx - 273.15) %>%
group_by(geoid) %>%
arrange(pseudo_date, .by_group = TRUE) %>%
# Calculate 2-month rolling mean
mutate(roll_tmmx = slide_dbl(tmmx, mean, .before = 1, .complete = TRUE),
roll_ps = slide_dbl(ps, mean, .before = 1, .complete = TRUE))

```


## Joining data

```{r, echo = FALSE}
# add a leading zero to month column in patient data
ra_dat$AMONTH <- str_pad(ra_dat$AMONTH, width = 2, side = "left", pad = "0")

```

We will join the environmental data to the health data using both temporal and spatial information that is common between the two datasets. For our health data, we have information on the month, year and zip code of the ED visits. For our environmental data, we have information on the month, year and zip code for our temperature and surface pressure variables. Our join, therefore, will be based on year, month, and zip code to successfully merge the two datasets.

```{r echo=TRUE}
# Join to environmental data based on month, year and zip
res_df <- ra_dat %>%
left_join(env_cleaned,
by = c("AYEAR" = "year", "AMONTH" = "month", "ZIP" = "geoid")) %>%
# drop rows that did not match on zip
filter(!is.na(tmmx))
```


## Visualzing data

What do the first few rows of our combined dataset look like?

```{r, message = FALSE}
res_df %>%
select(person_id, AMONTH, AYEAR, contains("tmmx"), contains("ps")) %>%
head(n = 5)

```

We can see that each person in our dataset has associated environmental data.

### Histograms

Using histograms, we will explore the distribution of the environmental data for all RA ED visits in the dataset.


#### Histogram (temperature)

```{r, echo = TRUE, warning = FALSE, message = FALSE}
ggplot(data = res_df, aes(x = roll_tmmx)) +
geom_histogram() +
theme_minimal() +
labs(x = "2-month rolling mean\nmax monthly temperature (C)",
y = "Frequency") +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 11))
```


#### Histogram (surface pressure)

```{r, echo = TRUE, message = FALSE, warning = FALSE}
ggplot(data = res_df, aes(x = ps)) +
geom_histogram() +
theme_minimal() +
labs(x = "Surface pressure (Pa)", y = "Frequency") +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 11))
```


### Spaghetti plots

Now lets' zoom in on visualizing changes in environmental variables over time for a select group of RA patients who had muliple ED encounters over the study period.
Let's calculate the number of visits each person had and restrict the data to only patients who had 10+ RA ED visits from 2016 - 2020.

```{r, echo = TRUE}
# Count number of encounters over time
res_df <- res_df %>%
group_by(person_id) %>%
mutate(n_visits = n())
```


```{r, echo = TRUE}
res_long <- res_df %>%
# Filter to only individuals who had 10+ encounters for visualization purposes
dplyr::filter(n_visits > 9) %>%
mutate(person_id = as.factor(person_id)) %>%
group_by(AYEAR, AMONTH)
```


#### Spaghetti plot (temperature)

```{r}

# 2-month rolling tmax
ggplot(data = res_long, aes(x = pseudo_date, y = roll_tmmx,
color = person_id, group = person_id)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
labs(x = "Year", y = "2-month rolling mean of\nmax monthly temperature (C)",
color = "Person ID",
caption = "Each point represents an ED visit for a given patient") +
theme_minimal() +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 11),
legend.text = element_text(size = 11),
legend.title = element_text(size = 12))
```

Here we can observe the change in the 2-month rolling mean monthly max temperature across RA ED visits for 3 patients. Each encounter is marked by a point on the plot.

#### Spaghetti plot (surface pressure)

This same plot can be created to examine trends in surface pressure for this subset of patients.

```{r, warning = FALSE}
ggplot(data = res_long, aes(x = pseudo_date, y = ps,
color = person_id, group = person_id)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
labs(x = "Year", y = "Surface pressure (Pa)", color = "Person ID",
caption = "Each point represents an ED visit for a given patient.
\nPerson ID 228 had missing surface pressure data for 1
encounter-month leading to discontinuity in line chart.") +
theme_minimal() +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 11),
legend.text = element_text(size = 11))
```


## Concluding Remarks

These exploratory plots can be used as a mechanism for hypothesis generation for larger epidemiological studies investigating associations between environmental exposures and healthcare utilization for at-risk populations such as individuals with autoimmune diseases.




Binary file added dataset/ra_utah_px.rds
Binary file not shown.
Loading