-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathUnivariate.qmd
316 lines (219 loc) · 11.4 KB
/
Univariate.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
---
title: "Univariable analysis"
editor: visual
---
## 1. Learning outcomes
At the end of the session, participants will be able to:
- Perform hypothesis tests.
- Estimate risk ratios (also called relative risk) for categorical data.
- Interpret the univariable results.
- (Optional) Investigate dose-response relationships in categorical data.
## 2. Story/plot description
You have just estimated risk ratios (RR) of individual food items manually to better understand the concept. RR give you an idea of which food items could be the culprit(s) of the outbreak.
Now, you will practice performing hypothesis testing and calculating RR in R to investigate the associations of suspicious food items with the disease.
Note that, although this is not an exhaustive list of tests, the following can be useful for this exercise (relevant for comparisons between two groups):
- **For continuous variables:**
- `shapiro.test()` (for checking if normally distributed)
- `t.test()` (for normal distributions)
- `wilcox.test()` (for non-parametric testing. Used to determine if two numeric samples are from the same distribution, when their populations are not normally distributed or have unequal variance.)
- **For categorical variables:**
- `chisq.test()` (if all comparisons include at least 5 cases)
- `fisher.test()` (if there are \< 5 cases for any comparison)
## 3. Questions/Assignments
## 3.1. Install packages and load libraries
```{r}
# Load the required libraries into the current R session:
pacman::p_load(rio,
here,
tidyverse,
skimr,
plyr,
janitor,
lubridate,
gtsummary,
flextable,
officer,
epikit,
apyramid,
scales,
EpiStats,
broom)
```
## 3.2. Import your data
```{r, Import_data}
# Import the raw data set:
copdata <- rio::import(here::here("data", "Spetses_clean2_2024.rds"))
```
## 3.3. Hypothesis tests for other variables
Check if the following variables are associated with being a case: age, sex, class and group.
#### a) age
With the Shapiro-Wilk test we check if the variables are following the normal distribution. The null hypothesis is that the data follow a normal distribution, therefore, rejecting the null hypothesis means that the data do not follow the normal distribution. A p-value below the cutoff for rejecting the null hypothesis, e.g., a p-value\<0.05 means that we reject the null hypothesis that the data follow the normal distribution. For age, the p-value is \<0.05, therefore we reject the null hypothesis that the data are normally distributed. As we see in the graph most frequently reported age is \<20 years.
```{r}
# Check if age overall follows a normal distribution:
shapiro.test(copdata$age)
# Can simply have a look at
hist(copdata$age)
# Looking only at the students:
students <- copdata %>%
filter(group == "student")
hist(students$age)
```
Age overall (nor within the students' group) is not normally distributed.
We compare the age for cases and non-cases using the Wilcoxon test that is used when the data are not normally distributed. The null hypothesis is that there is no difference in the age between the two groups compared. Given that p-value\>0.05 we do not reject the null hypothesis.
```{r}
# Perform Wilcoxon rank sum test on age and sex:
wilcox.test(age ~ case,
data = copdata)
```
#### b) sex
```{r}
copdata %>%
select(sex, case) %>%
tbl_summary(by = case) %>%
add_p()
```
::: {.callout-warning title="Let's stop and... think!" collapse="true"}
What do these results tell you - is there an association between sex and being a case?
Do these results differ from what you expected when looking at the descriptive figures (case proportions stratified by sex, or the age sex pyramid)? If so, why do you think this is?
:::
#### c) class
```{r}
copdata %>%
select(class, case) %>%
tbl_summary(by = case) %>%
add_p()
```
In this case you have a 2x3 contingency table of class against disease status, we can use a chi-square here. P-value of the association between class and age is p-val = 0.09, suggesting that there is an association at the p-val = 0.20 level among at least one of the 3 different classes. Fellows may want to investigate this further, but in the remaining of the study this is not explored further.
A hypothesis could be that one of the classes sat together and, for whatever reason, they ate more of the contaminated food item at those tables. This could be further studied if we had the spatial distribution of where the people sat at the dinner. Another hypothesis is that one of the classes differ from other classes in some characteristics that made them more susceptible or more exposed to the infected food item.
#### d) group
```{r}
copdata %>%
select(group, case) %>%
tbl_summary(by = case) %>%
add_p()
```
P-value of the association between group and age is p-val = 0.2, suggesting that there may be an association at the p-val = 0.20 level. Explanations are like above, with variable class.
#### Let's do all together
```{r}
copdata %>%
select(sex, class, group, case) %>%
tbl_summary(by = case) %>%
add_p()
```
## 3.4. Risk Ratios
The risk ratios of each food item (including the 2x2 table) are reported below. The output of the `CS()` command is two tables: one with the 2x2 table and one with the risk difference, the risk ratio and the attributable fraction among exposed as well as the attributable fraction among the population (and the confidence intervals for all the estimates). The Chi-square and the p-value are also reported. In the second part, a table with all the food items is printed including attack rates for exposed and unexposed as well as risk ratios and the 95% confidence intervals (CI ll and CI ul, for the lower and upper interval) and p-values.
### a) Calculate 95% CI Risk Ratios for food
To see if food items (note these are categorical variables) are associated with being a case, calculate risk ratios and 95% confidence intervals for food items. You can calculate this individually for each food item (using `CSTable()`), or all at once (see hint below).
::: {.callout-tip title="Need a little bit of help?" collapse="true"}
Create a `food_vars` vector containing food variables of interest and use the function `CSTable()` of the EpiStats package.
*Using this hint, you will create a table with the name of exposure variables, the total number of exposed, the number of exposed cases, the attack rate among the exposed, the total number of unexposed, the number of unexposed cases, the attack rate among the unexposed, risk ratios, 95% percent confidence intervals, and p-values. Amazing, isn't it? Have a look at `??CSTable` to learn more.*
:::
```{r}
# You could use the EpiStats package for each food item
CS(copdata, "case", "feta")
CS(copdata, "case", "sardines")
CS(copdata, "case", "eggplant")
CS(copdata, "case", "moussaka")
```
```{r}
# You can save time (and probably typos!) by creating a vector for food variables...
food_vars <- c("feta", "sardines", "eggplant", "moussaka",
"orzo", "greeksal", "dessert", "bread",
"champagne", "beer", "redwine", "whitewine")
# ...and using EpiStats::CSTable() to run all variables together!
CSTable(copdata, "case", food_vars)
```
### b) Prepare the RR table for publication
::: {.callout-tip title="Need a little bit of help?" collapse="true"}
Use `flextable()` and `set_header_labels()`.
:::
```{r}
rr_tbl <- CSTable(copdata, "case", food_vars) %>%
as.data.frame() %>%
rownames_to_column() %>%
flextable() %>%
set_header_labels(
values = c("Food Item",
"Total exposed",
"Cases exposed",
"AR among exposed",
"Total unexposed",
"Cases unexposed",
"AR among unexposed",
"RR",
"95% lower CI",
"95% upper CI",
"p-value"))
```
::: {.callout-warning title="Let's stop and... think!" collapse="true"}
· What can you infer from this table?
· Considering the relative risks, which food or drink items do you think were most likely to be the vehicle(s) of infection in this outbreak?
· Do you think there are any confounders or effect modifiers? If so, how would you investigate these further?
:::
The interesting results here are that the food items that are most suspicious are orzo, moussaka and champagne. Orzo as such is unlikely to be contaminated, but as you can see in the picture (and from the dinner night), it was served with pesto! Maybe it was the pesto? Who-ho-ho!
Before one jumps into conclusions, consider that this result could be due to confounding! Maybe orzo was "clean" but eaten by all the people who ate the food item that actually was contaminated!(Optional)
## 3.5. (Optional) Dose Response
Check for a dose-response relationship between the food items with the highest RR values (the top 3) and being a case.
#### a) Orzo
Using `epitools::riskratio` function:
```{r}
epitools::riskratio(copdata$orzoD,
copdata$case,
conf.level = 0.95)
```
Using a binomial regression:
```{r}
# Binomial regression for RRs.
# The outcome needs to be exponentiated so we can interpret it properly!
binom_orzoD <- glm(case ~ orzoD, data = copdata,
family = binomial(link = "log"))
# To get exponentiated:
binom_orzoD_exp <- glm(case ~ orzoD, data = copdata,
family = binomial(link = "log")) %>%
tidy(exponentiate = TRUE,
conf.int = TRUE)
binom_orzoD_exp
```
::: {.callout-warning title="Let's stop and... think!" collapse="true"}
What do these results tell you?
:::
Results suggest a dose response relationship of having eaten orzo, pointing towards orzoas the potential vehicle. The higher the amount of orzothey ate, the stronger is the association (RR) with getting ill/being a case.
#### b) Moussaka
Using `epitools::riskratio` function:
```{r}
epitools::riskratio(copdata$moussakaD,
copdata$case,
conf.level = 0.95)
```
Using a binomial regression:
```{r}
# Let's get the results directly exponentiated
binom_moussakaD_exp <- glm(case ~ moussakaD, data = copdata,
family = binomial(link = "log")) %>%
tidy(exponentiate = TRUE,
conf.int = TRUE)
binom_moussakaD_exp
```
#### c) Champagne
Using `epitools::riskratio` function:
```{r}
epitools::riskratio(copdata$champagneD,
copdata$case,
conf.level = 0.95)
```
Using a binomial regression:
```{r}
# Let's get the results directly exponentiated
binom_champagneD_exp <- glm(case ~ champagneD, data = copdata,
family = binomial(link = "log")) %>%
tidy(exponentiate = TRUE,
conf.int = TRUE)
binom_champagneD_exp
```
## 3.6 Summary
As a summary of what you've done above, answer these questions:
- Is the respondents' sex associated with being a case?
- Is the school class associated with being a case?
- Which foods increase the risk of being a case?
- (Optional) Is there a dose-response relationship between the food items and being a case?
- What do you think is the most likely culprit(s) of this outbreak at this point? Are there any risk factors you would like to highlight?