Skip to content

Commit 4b1f176

Browse files
authored
Merge pull request #86 from NIEHS/ar-indvdata-0929
New chapter on linking individual level data to environmental data
2 parents 4d75b70 + fd90f58 commit 4b1f176

File tree

4 files changed

+279
-0
lines changed

4 files changed

+279
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
.Rproj.user
22
.Rhistory
33
.RData
4+
.rds
45
.Ruserdata
56
*.csv
67
shap_files/

bibliography/packages.bib

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,14 @@
1+
@article{r-amadeus,
2+
author = {Manware, Mitchell and Song, Insang and Marques, Eva S. and Kassien, Mariana Alifa and Clark, Lara P. and Messier, Kyle P.},
3+
title = {Amadeus: Accessing and analyzing large scale environmental data in R},
4+
year = 2025,
5+
journal = {Environ. Model. Softw.},
6+
volume = 186,
7+
number = C,
8+
doi = {10.1016/j.envsoft.2025.106352},
9+
url = {https://doi.org/10.1016/j.envsoft.2025.106352}
10+
}
11+
112
@Manual{R-base,
213
title = {R: A Language and Environment for Statistical Computing},
314
author = {{R Core Team}},
Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
---
2+
bibliography: ["bibliography/book.bib", "bibliography/packages.bib"]
3+
biblio-style: apalike
4+
csl: "bibliography/chicago-author-date-with-note.csl"
5+
link-citations: true
6+
link-bibliography: true
7+
output:
8+
html_document:
9+
css: style.css
10+
---
11+
12+
# Linking to Individuals
13+
14+
[![Profile-CDM](images/user_profiles/profilecdm.svg)](#profilecdm)
15+
[![Profile-CRE](images/user_profiles/profilecre.svg)](#profilecre)
16+
[![Profile-EDU](images/user_profiles/profileedu.svg)](#profileedu)
17+
[![Profile-EPI](images/user_profiles/profileepi.svg)](#profileepi)
18+
[![Profile-GEO](images/user_profiles/profilegeo.svg)](#profilegeo)
19+
[![Profile-STU](images/user_profiles/profilestu.svg)](#profilestu)
20+
21+
### "Linking Individual-level Health Data to Environmental Variables: a Case Study of Rheumatoid Arthritis in Utah"
22+
23+
**Date Modified**: September 29, 2025
24+
25+
**Author(s)**: Austin Rau [![author-lpc](images/orcid.png){width=10}](https://orcid.org/0000-0002-5818-4864)
26+
27+
**Programming Language(s)**: R
28+
29+
## Introduction
30+
31+
The purpose of this tutorial is to demonstrate joining individual/payer level health information to environmental data. In this tutorial, individual-level health data from the Healthcare Utilization Project (HCUP) state emergency department database (SEDD) will be joined to environmental data from the `amadeus` [@r-amadeus] package. The SEDD provides individual-level emergency department encounter data with variables including the month of the encounter and the zip code of the patient.
32+
33+
As an illustrative example, emergency department (ED) vists for rheumatoid arthritis (RA) were extracted from the SEDD for the state of Utah from 2016 - 2020. RA is a chronic autoimmune disease in which the immune system attacks the joints. This can cause symptoms including swelling, pain, and joint stiffness. ICD-10 codes (M05.X, M06.X) for the primary discharge diagnosis variable (I10_DX1) in the SEDD database were used to extract RA ED encounters.
34+
35+
### Outline
36+
37+
This tutorial provides example code in `R`:
38+
39+
1. To prepare environmental data for joining with individual-level health data
40+
2. To join environmental to individual-level health data
41+
3. To visualize trends in environmental data across time for individuals
42+
43+
::: {.important}
44+
The purpose of this tutorial is not to provide access to HCUP state database data.
45+
:::
46+
47+
::: {.warning}
48+
The exploratory analyses in this tutorial are for educational purposes only.
49+
:::
50+
51+
52+
## Set-up
53+
54+
### Load R Packages
55+
56+
```{r, echo = TRUE, warning = FALSE, message = FALSE}
57+
# load libraries
58+
library(tidyverse)
59+
library(slider)
60+
```
61+
62+
### Set working environment
63+
64+
```{r, eval = FALSE}
65+
# set a working directory to the data
66+
wd <- "your working directory"
67+
68+
```
69+
70+
```{r, echo = FALSE}
71+
72+
wd <- "/ddn/gs1/home/rauat/PCOR_bookdown_tools/"
73+
74+
```
75+
76+
## Reading data
77+
78+
In this step, the analyst will read in individual-level health data that is spatially referenced (i.e., contains information on the location of the patient such as county or zip code of residence). For the purposes of this tutorial, the individual-level health data are RA ED encounters for the state of Utah from 2016 - 2020 that are spatially referenced at the zip code level.
79+
80+
```{r, echo = TRUE}
81+
82+
ra_dat <- readRDS(paste0(wd, "dataset/ra_utah_px.rds"))
83+
84+
```
85+
86+
Next, we will load pre-processed environmental data from the `amadeus` package at the zip code level.
87+
88+
```{r, echo = TRUE, message = FALSE}
89+
# read in entire folder of environmental data
90+
# at zip code level as single dataframe
91+
env_dat <- list.files(paste0(wd, "dataset/env_data"),
92+
pattern = ".*csv", full.names = TRUE) %>%
93+
map_dfr(~ read_csv(.x) %>%
94+
# add file name as column
95+
mutate(source_file = basename(.x)))
96+
97+
# filter to zip codes in patient data
98+
env_dat <- env_dat %>%
99+
dplyr::filter(geoid %in% ra_dat$ZIP)
100+
```
101+
102+
103+
Data cleaning needs to be completed before joining the environmental and health data. In the code chunk below, we create variables for month and year from the environmental data.
104+
105+
```{r, echo = TRUE}
106+
# create a column for month and year based on the source_file variable
107+
env_dat$year <- str_sub(env_dat$source_file, start = 5, end = 8)
108+
109+
env_dat$month <- str_sub(env_dat$source_file, start = 10, end = 11)
110+
111+
# create a pseudo-date column
112+
env_dat$pseudo_date <- ymd(paste0(env_dat$year, "-", env_dat$month, "-", "01"))
113+
114+
```
115+
116+
In this example, we will focus on monthly mean, daily maximum temperature data from GridMet (measured in Kelvin) and monthly surface pressure data from MERRA-2 (measured in Pascals). Temperature data will be convered to degrees Celsius prior to joining with the health data.
117+
118+
We will also explore the notion of delayed effects - environmental exposures from the recent past may be associated with health outcomes. To reflect this, we will calculate a 2-month rolling mean for our environmental variables.
119+
120+
```{r, echo = TRUE}
121+
122+
env_cleaned <- env_dat %>%
123+
select(geoid, tmmx, ps, pseudo_date, month, year) %>%
124+
# Convert max temperature from Kelvin to Celsius
125+
mutate(tmmx = tmmx - 273.15) %>%
126+
group_by(geoid) %>%
127+
arrange(pseudo_date, .by_group = TRUE) %>%
128+
# Calculate 2-month rolling mean
129+
mutate(roll_tmmx = slide_dbl(tmmx, mean, .before = 1, .complete = TRUE),
130+
roll_ps = slide_dbl(ps, mean, .before = 1, .complete = TRUE))
131+
132+
```
133+
134+
135+
## Joining data
136+
137+
```{r, echo = FALSE}
138+
# add a leading zero to month column in patient data
139+
ra_dat$AMONTH <- str_pad(ra_dat$AMONTH, width = 2, side = "left", pad = "0")
140+
141+
```
142+
143+
We will join the environmental data to the health data using both temporal and spatial information that is common between the two datasets. For our health data, we have information on the month, year and zip code of the ED visits. For our environmental data, we have information on the month, year and zip code for our temperature and surface pressure variables. Our join, therefore, will be based on year, month, and zip code to successfully merge the two datasets.
144+
145+
```{r echo=TRUE}
146+
# Join to environmental data based on month, year and zip
147+
res_df <- ra_dat %>%
148+
left_join(env_cleaned,
149+
by = c("AYEAR" = "year", "AMONTH" = "month", "ZIP" = "geoid")) %>%
150+
# drop rows that did not match on zip
151+
filter(!is.na(tmmx))
152+
```
153+
154+
155+
## Visualzing data
156+
157+
What do the first few rows of our combined dataset look like?
158+
159+
```{r, message = FALSE}
160+
res_df %>%
161+
select(person_id, AMONTH, AYEAR, contains("tmmx"), contains("ps")) %>%
162+
head(n = 5)
163+
164+
```
165+
166+
We can see that each person in our dataset has associated environmental data.
167+
168+
### Histograms
169+
170+
Using histograms, we will explore the distribution of the environmental data for all RA ED visits in the dataset.
171+
172+
173+
#### Histogram (temperature)
174+
175+
```{r, echo = TRUE, warning = FALSE, message = FALSE}
176+
ggplot(data = res_df, aes(x = roll_tmmx)) +
177+
geom_histogram() +
178+
theme_minimal() +
179+
labs(x = "2-month rolling mean\nmax monthly temperature (C)",
180+
y = "Frequency") +
181+
theme(axis.title = element_text(size = 12),
182+
axis.text = element_text(size = 11))
183+
```
184+
185+
186+
#### Histogram (surface pressure)
187+
188+
```{r, echo = TRUE, message = FALSE, warning = FALSE}
189+
ggplot(data = res_df, aes(x = ps)) +
190+
geom_histogram() +
191+
theme_minimal() +
192+
labs(x = "Surface pressure (Pa)", y = "Frequency") +
193+
theme(axis.title = element_text(size = 12),
194+
axis.text = element_text(size = 11))
195+
```
196+
197+
198+
### Spaghetti plots
199+
200+
Now lets' zoom in on visualizing changes in environmental variables over time for a select group of RA patients who had muliple ED encounters over the study period.
201+
Let's calculate the number of visits each person had and restrict the data to only patients who had 10+ RA ED visits from 2016 - 2020.
202+
203+
```{r, echo = TRUE}
204+
# Count number of encounters over time
205+
res_df <- res_df %>%
206+
group_by(person_id) %>%
207+
mutate(n_visits = n())
208+
```
209+
210+
211+
```{r, echo = TRUE}
212+
res_long <- res_df %>%
213+
# Filter to only individuals who had 10+ encounters for visualization purposes
214+
dplyr::filter(n_visits > 9) %>%
215+
mutate(person_id = as.factor(person_id)) %>%
216+
group_by(AYEAR, AMONTH)
217+
```
218+
219+
220+
#### Spaghetti plot (temperature)
221+
222+
```{r}
223+
224+
# 2-month rolling tmax
225+
ggplot(data = res_long, aes(x = pseudo_date, y = roll_tmmx,
226+
color = person_id, group = person_id)) +
227+
geom_line(linewidth = 1) +
228+
geom_point(size = 2) +
229+
labs(x = "Year", y = "2-month rolling mean of\nmax monthly temperature (C)",
230+
color = "Person ID",
231+
caption = "Each point represents an ED visit for a given patient") +
232+
theme_minimal() +
233+
theme(axis.title = element_text(size = 12),
234+
axis.text = element_text(size = 11),
235+
legend.text = element_text(size = 11),
236+
legend.title = element_text(size = 12))
237+
```
238+
239+
Here we can observe the change in the 2-month rolling mean monthly max temperature across RA ED visits for 3 patients. Each encounter is marked by a point on the plot.
240+
241+
#### Spaghetti plot (surface pressure)
242+
243+
This same plot can be created to examine trends in surface pressure for this subset of patients.
244+
245+
```{r, warning = FALSE}
246+
ggplot(data = res_long, aes(x = pseudo_date, y = ps,
247+
color = person_id, group = person_id)) +
248+
geom_line(linewidth = 1) +
249+
geom_point(size = 2) +
250+
labs(x = "Year", y = "Surface pressure (Pa)", color = "Person ID",
251+
caption = "Each point represents an ED visit for a given patient.
252+
\nPerson ID 228 had missing surface pressure data for 1
253+
encounter-month leading to discontinuity in line chart.") +
254+
theme_minimal() +
255+
theme(axis.title = element_text(size = 12),
256+
axis.text = element_text(size = 11),
257+
legend.text = element_text(size = 11))
258+
```
259+
260+
261+
## Concluding Remarks
262+
263+
These exploratory plots can be used as a mechanism for hypothesis generation for larger epidemiological studies investigating associations between environmental exposures and healthcare utilization for at-risk populations such as individuals with autoimmune diseases.
264+
265+
266+
267+

dataset/ra_utah_px.rds

2.93 KB
Binary file not shown.

0 commit comments

Comments
 (0)