Merge pull request #86 from NIEHS/ar-indvdata-0929

raucoder10 · web-flow · commit 4b1f176239ba · 2025-09-29T14:18:50.000-04:00
New chapter on linking individual level data to environmental data
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,7 @@
 .Rproj.user
 .Rhistory
 .RData
+.rds
 .Ruserdata
 *.csv
 shap_files/
diff --git a/bibliography/packages.bib b/bibliography/packages.bib
@@ -1,3 +1,14 @@
+@article{r-amadeus,
+author = {Manware, Mitchell and Song, Insang and Marques, Eva S. and Kassien, Mariana Alifa and Clark, Lara P. and Messier, Kyle P.},
+title = {Amadeus: Accessing and analyzing large scale environmental data in R},
+year = 2025,
+journal = {Environ. Model. Softw.},
+volume = 186,
+number = C,
+doi = {10.1016/j.envsoft.2025.106352},
+url = {https://doi.org/10.1016/j.envsoft.2025.106352}
+}
+
 @Manual{R-base,
   title = {R: A Language and Environment for Statistical Computing},
   author = {{R Core Team}},
diff --git a/chapters/06-01-hcup-individual-usecase.Rmd b/chapters/06-01-hcup-individual-usecase.Rmd
@@ -0,0 +1,267 @@
+---
+bibliography: ["bibliography/book.bib", "bibliography/packages.bib"]
+biblio-style: apalike
+csl: "bibliography/chicago-author-date-with-note.csl"
+link-citations: true
+link-bibliography: true
+output: 
+  html_document:
+    css: style.css
+---
+
+# Linking to Individuals 
+
+[![Profile-CDM](images/user_profiles/profilecdm.svg)](#profilecdm)
+[![Profile-CRE](images/user_profiles/profilecre.svg)](#profilecre)
+[![Profile-EDU](images/user_profiles/profileedu.svg)](#profileedu)
+[![Profile-EPI](images/user_profiles/profileepi.svg)](#profileepi)
+[![Profile-GEO](images/user_profiles/profilegeo.svg)](#profilegeo)
+[![Profile-STU](images/user_profiles/profilestu.svg)](#profilestu)
+
+### "Linking Individual-level Health Data to Environmental Variables: a Case Study of Rheumatoid Arthritis in Utah"
+
+**Date Modified**: September 29, 2025
+
+**Author(s)**: Austin Rau [![author-lpc](images/orcid.png){width=10}](https://orcid.org/0000-0002-5818-4864)
+
+**Programming Language(s)**: R
+
+## Introduction
+
+The purpose of this tutorial is to demonstrate joining individual/payer level health information to environmental data. In this tutorial, individual-level health data from the Healthcare Utilization Project (HCUP) state emergency department database (SEDD) will be joined to environmental data from the `amadeus` [@r-amadeus] package. The SEDD provides individual-level emergency department encounter data with variables including the month of the encounter and the zip code of the patient. 
+
+As an illustrative example, emergency department (ED) vists for rheumatoid arthritis (RA) were extracted from the SEDD for the state of Utah from 2016 - 2020. RA is a chronic autoimmune disease in which the immune system attacks the joints. This can cause symptoms including swelling, pain, and joint stiffness. ICD-10 codes (M05.X, M06.X) for the primary discharge diagnosis variable (I10_DX1) in the SEDD database were used to extract RA ED encounters.
+
+### Outline
+
+This tutorial provides example code in `R`: 
+
+1. To prepare environmental data for joining with individual-level health data
+2. To join environmental to individual-level health data
+3. To visualize trends in environmental data across time for individuals
+
+::: {.important}
+The purpose of this tutorial is not to provide access to HCUP state database data.
+:::
+
+::: {.warning}
+The exploratory analyses in this tutorial are for educational purposes only. 
+:::
+
+ 
+## Set-up
+
+### Load R Packages
+
+```{r, echo = TRUE, warning = FALSE, message = FALSE}
+# load libraries
+library(tidyverse)
+library(slider)
+```
+
+### Set working environment
+
+```{r, eval = FALSE}
+# set a working directory to the data
+wd <- "your working directory"
+
+```
+
+```{r, echo = FALSE}
+
+wd <- "/ddn/gs1/home/rauat/PCOR_bookdown_tools/"
+
+```
+
+## Reading data
+
+In this step, the analyst will read in individual-level health data that is spatially referenced (i.e., contains information on the location of the patient such as county or zip code of residence). For the purposes of this tutorial, the individual-level health data are RA ED encounters for the state of Utah from 2016 - 2020 that are spatially referenced at the zip code level.
+
+```{r, echo = TRUE}
+
+ra_dat <- readRDS(paste0(wd, "dataset/ra_utah_px.rds"))
+
+```
+
+Next, we will load pre-processed environmental data from the `amadeus` package at the zip code level.
+
+```{r, echo = TRUE, message = FALSE}
+# read in entire folder of environmental data
+# at zip code level as single dataframe
+env_dat <- list.files(paste0(wd, "dataset/env_data"),
+                      pattern = ".*csv", full.names = TRUE) %>%
+  map_dfr(~ read_csv(.x) %>%
+            # add file name as column
+            mutate(source_file = basename(.x)))
+
+# filter to zip codes in patient data
+env_dat <- env_dat %>%
+  dplyr::filter(geoid %in% ra_dat$ZIP)
+```
+
+
+Data cleaning needs to be completed before joining the environmental and health data. In the code chunk below, we create variables for month and year from the environmental data.
+
+```{r, echo = TRUE}
+# create a column for month and year based on the source_file variable
+env_dat$year <- str_sub(env_dat$source_file, start = 5, end = 8)
+
+env_dat$month <- str_sub(env_dat$source_file, start = 10, end = 11)
+
+# create a pseudo-date column
+env_dat$pseudo_date <- ymd(paste0(env_dat$year, "-", env_dat$month, "-", "01"))
+
+```
+
+In this example, we will focus on monthly mean, daily maximum temperature data from GridMet (measured in Kelvin) and monthly surface pressure data from MERRA-2 (measured in Pascals). Temperature data will be convered to degrees Celsius prior to joining with the health data. 
+
+We will also explore the notion of delayed effects - environmental exposures from the recent past may be associated with health outcomes. To reflect this, we will calculate a 2-month rolling mean for our environmental variables.
+
+```{r, echo = TRUE}
+
+env_cleaned <- env_dat %>%
+  select(geoid, tmmx, ps, pseudo_date, month, year) %>%
+  # Convert max temperature from Kelvin to Celsius
+  mutate(tmmx = tmmx - 273.15) %>%
+  group_by(geoid) %>%
+  arrange(pseudo_date, .by_group = TRUE) %>%
+  # Calculate 2-month rolling mean
+  mutate(roll_tmmx = slide_dbl(tmmx, mean, .before = 1, .complete = TRUE),
+         roll_ps = slide_dbl(ps, mean, .before = 1, .complete = TRUE))
+
+```
+
+
+## Joining data
+
+```{r, echo = FALSE}
+# add a leading zero to month column in patient data
+ra_dat$AMONTH <- str_pad(ra_dat$AMONTH, width = 2, side = "left", pad = "0")
+
+```
+
+We will join the environmental data to the health data using both temporal and spatial information that is common between the two datasets. For our health data, we have information on the month, year and zip code of the ED visits. For our environmental data, we have information on the month, year and zip code for our temperature and surface pressure variables. Our join, therefore, will be based on year, month, and zip code to successfully merge the two datasets.
+
+```{r echo=TRUE}
+# Join to environmental data based on month, year and zip
+res_df <- ra_dat %>%
+  left_join(env_cleaned,
+            by = c("AYEAR" = "year", "AMONTH" = "month", "ZIP" = "geoid")) %>%
+  # drop rows that did not match on zip
+  filter(!is.na(tmmx))
+```
+
+
+## Visualzing data
+
+What do the first few rows of our combined dataset look like?
+
+```{r, message = FALSE}
+res_df %>%
+  select(person_id, AMONTH, AYEAR, contains("tmmx"), contains("ps")) %>%
+  head(n = 5)
+
+```
+
+We can see that each person in our dataset has associated environmental data. 
+
+### Histograms
+
+Using histograms, we will explore the distribution of the environmental data for all RA ED visits in the dataset.
+
+
+#### Histogram (temperature)
+
+```{r, echo = TRUE, warning = FALSE, message = FALSE}
+ggplot(data = res_df, aes(x = roll_tmmx)) +
+  geom_histogram() +
+  theme_minimal() +
+  labs(x = "2-month rolling mean\nmax monthly temperature (C)",
+       y = "Frequency") +
+  theme(axis.title = element_text(size = 12),
+        axis.text = element_text(size = 11))
+```
+
+
+#### Histogram (surface pressure)
+
+```{r, echo = TRUE, message = FALSE, warning = FALSE}
+ggplot(data = res_df, aes(x = ps)) +
+  geom_histogram() +
+  theme_minimal() +
+  labs(x = "Surface pressure (Pa)", y = "Frequency") +
+  theme(axis.title = element_text(size = 12),
+        axis.text = element_text(size = 11))
+```
+
+
+### Spaghetti plots
+
+Now lets' zoom in on visualizing changes in environmental variables over time for a select group of RA patients who had muliple ED encounters over the study period.
+Let's calculate the number of visits each person had and restrict the data to only patients who had 10+ RA ED visits from 2016 - 2020.
+
+```{r, echo = TRUE}
+# Count number of encounters over time
+res_df <- res_df %>%
+  group_by(person_id) %>%
+  mutate(n_visits = n())
+```
+
+
+```{r, echo = TRUE}
+res_long <- res_df %>%
+  # Filter to only individuals who had 10+ encounters for visualization purposes
+  dplyr::filter(n_visits > 9) %>%
+  mutate(person_id = as.factor(person_id)) %>%
+  group_by(AYEAR, AMONTH)
+```
+
+
+#### Spaghetti plot (temperature)
+
+```{r}
+
+# 2-month rolling tmax
+ggplot(data = res_long, aes(x = pseudo_date, y = roll_tmmx,
+                            color = person_id, group = person_id)) +
+  geom_line(linewidth = 1) +
+  geom_point(size = 2) +
+  labs(x = "Year", y = "2-month rolling mean of\nmax monthly temperature (C)",
+       color = "Person ID",
+       caption = "Each point represents an ED visit for a given patient") +
+  theme_minimal() +
+  theme(axis.title = element_text(size = 12),
+        axis.text = element_text(size = 11),
+        legend.text = element_text(size = 11),
+        legend.title = element_text(size = 12))
+```
+
+Here we can observe the change in the 2-month rolling mean monthly max temperature across RA ED visits for 3 patients. Each encounter is marked by a point on the plot.
+
+#### Spaghetti plot (surface pressure)
+
+This same plot can be created to examine trends in surface pressure for this subset of patients.
+
+```{r, warning = FALSE}
+ggplot(data = res_long, aes(x = pseudo_date, y = ps,
+                            color = person_id, group = person_id)) +
+  geom_line(linewidth = 1) +
+  geom_point(size = 2) +
+  labs(x = "Year", y = "Surface pressure (Pa)", color = "Person ID",
+       caption = "Each point represents an ED visit for a given patient.
+  \nPerson ID 228 had missing surface pressure data for 1 
+  encounter-month leading to discontinuity in line chart.") +
+  theme_minimal() +
+  theme(axis.title = element_text(size = 12),
+        axis.text = element_text(size = 11),
+        legend.text = element_text(size = 11))
+```
+
+
+## Concluding Remarks
+
+These exploratory plots can be used as a mechanism for hypothesis generation for larger epidemiological studies investigating associations between environmental exposures and healthcare utilization for at-risk populations such as individuals with autoimmune diseases.
+
+
+
+
diff --git a/dataset/ra_utah_px.rds b/dataset/ra_utah_px.rds