Univariate.qmd

---
title: "Univariable analysis"
editor: visual
---

## 1. Learning outcomes

At the end of the session, participants will be able to:

-   Perform hypothesis tests.

-   Estimate risk ratios (also called relative risk) for categorical data.

-   Interpret the univariable results.

-   (Optional) Investigate dose-response relationships in categorical data.

## 2. Story/plot description

You have just estimated risk ratios (RR) of individual food items manually to better understand the concept. RR give you an idea of which food items could be the culprit(s) of the outbreak.

Now, you will practice performing hypothesis testing and calculating RR in R to investigate the associations of suspicious food items with the disease.

Note that, although this is not an exhaustive list of tests, the following can be useful for this exercise (relevant for comparisons between two groups):

-   **For continuous variables:**

    -   `shapiro.test()` (for checking if normally distributed)

    -   `t.test()` (for normal distributions)

    -   `wilcox.test()` (for non-parametric testing. Used to determine if two numeric samples are from the same distribution, when their populations are not normally distributed or have unequal variance.)

-   **For categorical variables:**

    -   `chisq.test()` (if all comparisons include at least 5 cases)

    -   `fisher.test()` (if there are \< 5 cases for any comparison)

## 3. Questions/Assignments

## 3.1. Install packages and load libraries

```{r}
# Load the required libraries into the current R session:
pacman::p_load(rio, 
               here, 
               tidyverse, 
               skimr,
               plyr,
               janitor,
               lubridate,
               gtsummary, 
               flextable,
               officer,
               epikit, 
               apyramid, 
               scales,
               EpiStats,
               broom)

```

## 3.2. Import your data

```{r, Import_data}
# Import the raw data set:
copdata <- rio::import(here::here("data", "Spetses_clean2_2024.rds"))
```

## 3.3. Hypothesis tests for other variables

Check if the following variables are associated with being a case: age, sex, class and group.

#### a) age

With the Shapiro-Wilk test we check if the variables are following the normal distribution. The null hypothesis is that the data follow a normal distribution, therefore, rejecting the null hypothesis means that the data do not follow the normal distribution. A p-value below the cutoff for rejecting the null hypothesis, e.g., a p-value\<0.05 means that we reject the null hypothesis that the data follow the normal distribution. For age, the p-value is \<0.05, therefore we reject the null hypothesis that the data are normally distributed. As we see in the graph most frequently reported age is \<20 years.

```{r}
# Check if age overall follows a normal distribution:
shapiro.test(copdata$age)
# Can simply have a look at
hist(copdata$age)

# Looking only at the students:
students <- copdata %>% 
  filter(group == "student")
hist(students$age)
```

Age overall (nor within the students' group) is not normally distributed.

We compare the age for cases and non-cases using the Wilcoxon test that is used when the data are not normally distributed. The null hypothesis is that there is no difference in the age between the two groups compared. Given that p-value\>0.05 we do not reject the null hypothesis.

```{r}
# Perform Wilcoxon rank sum test on age and sex:
wilcox.test(age ~ case, 
            data = copdata)
```

#### b) sex

```{r}
copdata %>% 
  select(sex, case) %>% 
  tbl_summary(by = case) %>% 
  add_p()

```

::: {.callout-warning title="Let's stop and... think!" collapse="true"}
What do these results tell you - is there an association between sex and being a case?

Do these results differ from what you expected when looking at the descriptive figures (case proportions stratified by sex, or the age sex pyramid)? If so, why do you think this is?
:::

#### c) class

```{r}
copdata %>% 
  select(class, case) %>% 
  tbl_summary(by = case) %>% 
  add_p()

```

In this case you have a 2x3 contingency table of class against disease status, we can use a chi-square here. P-value of the association between class and age is p-val = 0.09, suggesting that there is an association at the p-val = 0.20 level among at least one of the 3 different classes. Fellows may want to investigate this further, but in the remaining of the study this is not explored further.

A hypothesis could be that one of the classes sat together and, for whatever reason, they ate more of the contaminated food item at those tables. This could be further studied if we had the spatial distribution of where the people sat at the dinner. Another hypothesis is that one of the classes differ from other classes in some characteristics that made them more susceptible or more exposed to the infected food item.

#### d) group

```{r}
copdata %>% 
  select(group, case) %>% 
  tbl_summary(by = case) %>% 
  add_p()
```

P-value of the association between group and age is p-val = 0.2, suggesting that there may be an association at the p-val = 0.20 level. Explanations are like above, with variable class.

#### Let's do all together

```{r}

copdata %>% 
  select(sex, class, group, case) %>% 
  tbl_summary(by = case) %>% 
  add_p()

```

## 3.4. Risk Ratios

The risk ratios of each food item (including the 2x2 table) are reported below. The output of the `CS()` command is two tables: one with the 2x2 table and one with the risk difference, the risk ratio and the attributable fraction among exposed as well as the attributable fraction among the population (and the confidence intervals for all the estimates). The Chi-square and the p-value are also reported. In the second part, a table with all the food items is printed including attack rates for exposed and unexposed as well as risk ratios and the 95% confidence intervals (CI ll and CI ul, for the lower and upper interval) and p-values.

### a) Calculate 95% CI Risk Ratios for food

To see if food items (note these are categorical variables) are associated with being a case, calculate risk ratios and 95% confidence intervals for food items. You can calculate this individually for each food item (using `CSTable()`), or all at once (see hint below).

::: {.callout-tip title="Need a little bit of help?" collapse="true"}
Create a `food_vars` vector containing food variables of interest and use the function `CSTable()` of the EpiStats package.

*Using this hint, you will create a table with the name of exposure variables, the total number of exposed, the number of exposed cases, the attack rate among the exposed, the total number of unexposed, the number of unexposed cases, the attack rate among the unexposed, risk ratios, 95% percent confidence intervals, and p-values. Amazing, isn't it? Have a look at `??CSTable` to learn more.*
:::

```{r}
# You could use the EpiStats package for each food item
CS(copdata, "case", "feta")
CS(copdata, "case", "sardines")
CS(copdata, "case", "eggplant")
CS(copdata, "case", "moussaka")
```

```{r}
# You can save time (and probably typos!) by creating a vector for food variables...
food_vars <- c("feta", "sardines", "eggplant", "moussaka", 
               "orzo", "greeksal", "dessert", "bread",  
               "champagne", "beer", "redwine", "whitewine")

# ...and using EpiStats::CSTable() to run all variables together!
CSTable(copdata, "case", food_vars)
```

### b) Prepare the RR table for publication

::: {.callout-tip title="Need a little bit of help?" collapse="true"}
Use `flextable()` and `set_header_labels()`.
:::

```{r}
rr_tbl <- CSTable(copdata, "case", food_vars) %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  flextable() %>% 
   set_header_labels(
     values = c("Food Item",
                "Total exposed",     
               "Cases exposed", 
               "AR among exposed",    
               "Total unexposed",
               "Cases unexposed",
               "AR among unexposed",
               "RR",         
               "95% lower CI",             
               "95% upper CI",
               "p-value"))
```

::: {.callout-warning title="Let's stop and... think!" collapse="true"}
· What can you infer from this table?

· Considering the relative risks, which food or drink items do you think were most likely to be the vehicle(s) of infection in this outbreak?

· Do you think there are any confounders or effect modifiers? If so, how would you investigate these further?
:::

The interesting results here are that the food items that are most suspicious are orzo, moussaka and champagne. Orzo as such is unlikely to be contaminated, but as you can see in the picture (and from the dinner night), it was served with pesto! Maybe it was the pesto? Who-ho-ho!

Before one jumps into conclusions, consider that this result could be due to confounding! Maybe orzo was "clean" but eaten by all the people who ate the food item that actually was contaminated!(Optional)

## 3.5. (Optional) Dose Response

Check for a dose-response relationship between the food items with the highest RR values (the top 3) and being a case.

#### a) Orzo

Using `epitools::riskratio` function:

```{r}
epitools::riskratio(copdata$orzoD,
                    copdata$case,
                    conf.level = 0.95)
```

Using a binomial regression:

```{r}
# Binomial regression for RRs. 
# The outcome needs to be exponentiated so we can interpret it properly!
binom_orzoD <- glm(case ~ orzoD, data = copdata, 
             family = binomial(link = "log"))

# To get exponentiated:
binom_orzoD_exp <- glm(case ~ orzoD, data = copdata, 
                       family = binomial(link = "log")) %>% 
  tidy(exponentiate = TRUE, 
       conf.int = TRUE)

binom_orzoD_exp
```

::: {.callout-warning title="Let's stop and... think!" collapse="true"}
What do these results tell you?
:::

Results suggest a dose response relationship of having eaten orzo, pointing towards orzoas the potential vehicle. The higher the amount of orzothey ate, the stronger is the association (RR) with getting ill/being a case.

#### b) Moussaka

Using `epitools::riskratio` function:

```{r}
epitools::riskratio(copdata$moussakaD,                     
                    copdata$case,                     
                    conf.level = 0.95)
```

Using a binomial regression:

```{r}

# Let's get the results directly exponentiated
binom_moussakaD_exp <- glm(case ~ moussakaD, data = copdata, 
                       family = binomial(link = "log")) %>% 
  tidy(exponentiate = TRUE, 
       conf.int = TRUE)

binom_moussakaD_exp
```

#### c) Champagne

Using `epitools::riskratio` function:

```{r}
epitools::riskratio(copdata$champagneD,                     
                    copdata$case,                     
                    conf.level = 0.95)
```

Using a binomial regression:

```{r}
# Let's get the results directly exponentiated
binom_champagneD_exp <- glm(case ~ champagneD, data = copdata, 
                       family = binomial(link = "log")) %>% 
  tidy(exponentiate = TRUE, 
       conf.int = TRUE)

binom_champagneD_exp
```

## 3.6 Summary

As a summary of what you've done above, answer these questions:

-   Is the respondents' sex associated with being a case?

-   Is the school class associated with being a case?

-   Which foods increase the risk of being a case?

-   (Optional) Is there a dose-response relationship between the food items and being a case?

-   What do you think is the most likely culprit(s) of this outbreak at this point? Are there any risk factors you would like to highlight?