-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path01-06-one-way-anvoa.Rmd
326 lines (253 loc) · 18 KB
/
01-06-one-way-anvoa.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
# One-Way ANOVA {#one-way-anova}
Next, let's go over the one-way ANOVA, which is used when we have two or more independent groups. Specifically, the one-way ANOVA determines if there is a difference in any of the group level means. Given that the analysis can also be used for two independent groups, this analysis could be used instead of the independent samples *t*-test.
For example, let's say that we were interested in determining if salary was significantly different for professors of different ranks (i.e., assistant professor, associate professor, and professor).
For this example, we will use the dataset: [`datasetSalaries`][Salaries Dataset].
As mentioned in the independent samples *t*-test, when dealing with categorical predictors, it is always good idea to check the coding scheme of the categorical variable.
```{r, include=F, echo=F, eval=T}
contrasts(datasetSalaries$rank) <- contr.treatment(3)
```
```{r}
contrasts(datasetSalaries$rank)
```
## Coding Categorical Variables {#coding-categorical-variables}
When coding categorical variables, the number of contrasts ($m$) is equal to $k-1$ where $k$ is the number of levels (or factors) of an IV. In our example, `rank` has 3 levels (i.e., `Assistant Professor`, `Associate Professor`, and `Professor`) and thus has 2 contrasts [^13].
[^3]: $m_{rank}=k_{rank}-1=3-1=2$
This is the reason why in a one-way ANOVA, we have $k-1$ between degrees of freedom. Thus, contrasts are always being performed, either explicitly if told or using a software package's default coding scheme (e.g., dummy coding for R) when testing a main effect of a categorical IV.
### Dummy Coding
Since `rank` has not yet been explicitly defined by us yet, `rank` is dummy coded (i.e., for each contrast, a level is coded as 1 and all other levels are coded as 0). In dummy coding, one level will always be coded as 0 in both contrasts. By default, the first level of the categorical IV in alphabetical order will be coded as 0. Thus, in our example, `Assistant Professor`, which is the first level of `rank` in alphabetical order is coded as `0`.
When interpreting dummy coding contrasts, for each contrast, it is the level that is coded as 1 compared to the level that is coded as 0 in both contrasts. In our dummy coding scheme, the first contrast is comparing `Associate Professor`, which is coded as 1 in the first contrast, to `Assistant Professor`, which is coded as 0 in both contrasts. The second contrast is comparing `Professor`, which is coded as 1 in the second contrast, to `Assistant Professor`, which is again coded as 0 in both contrasts.
### Orthogonal Coding {#othogon}
However, *a priori* orthogonal contrast coding is preferred because this coding scheme allows us to create our own *a priori* contrasts of interest to answer more specific questions.
For example, let's say that in addition to determining if there is a difference in salary of professors based on their rank that we were also interested in more specific questions of interest such as:
1. Determining if there is a difference in salaries of professors that are tenured (`Associate Professor` and `Professor`) compared to those that are untenured (`Assistant Professor`).
2. Determining if there is a difference in salaries of professors that are `Associate Professors` compared to `Professors`.
```{r}
contrasts(datasetSalaries$rank) <- cbind(
TenuredvAssistant = c(-2, 1, 1),
AssociatevProfessor = c(0, -1, 1)
)
```
To make a coding scheme orthogonal, we need to make sure that 1) the sum of each contrast equals zero and that 2) the sum of each contrast product is 0.
For example, to make the tenured vs. untenured professor contrast, we can code `Assistant Professor` as `-2` and both `Associate Professor` and `Professor` as `1` and assign it the name `TenuredvAssistant`. It's best to use signs that go in the direction of our prediction so that it's reflected in our estimates. Since, we expect that tenured professors have higher salaries, tenured professors are assigned the positive value and assistant professors are assigned the negative value. Notice that the sum of the `TenuredvAssistant` contrast also adds up to 0 [^14].
[^14]: $\Sigma Contrast\_Codes_{TenuredvAssistant} = -2 + 1 + 1 = 0$
To make the next contrast of `Associate Professor` compared to `Professor`, we can code `Assistant Professor` as `0`, `Associate Professor` as `-1`, and `Professor` as `1` and give it the name `AssociatevProfessor`. Notice again that the sum of the `AssociatevProfessor` contrast adds up to 0 [^15].
[^15]: $\Sigma Contrast\_Code_{AssociatevProfessor} = 0 -1 + 1 = 0$
Since both contrasts sum to zero, we can now check if the sum of products for each level across contrasts equals 0. In other words, we multiply the values across each contrast pair for each level and add them together. For example, the product across the contrasts of `Assistant Professor` is $2*0 = 0$, `Associate Professor` is $1*-1=-1$, and `Professor` is $1*1=1$. If we add them together, we get $0-1+1=0$ [^16].
[^16]: $\Sigma Contrast_{TenuredvAssistant}*Contrast_{AssociatevProfessor} = (2*0) + (1*-1) + (1*1) = 0$
To take orthogonal contrast coding a step further and make our estimates readily interpretable as the mean difference of that contrast, we can ensure that the difference of each contrast is 1. To do this, we can divide each contrast code by the number of values that are not coded as 0 for each contrast. For example, since all 3 codes of the `TenuredvAssistant` contrast are not equal to 0, we can divide by 3. Similarly, for the `AssociatevProfessor` contrast, we can divide each contrast code by 2 since 2 of the contrast codes are not equal to 0. Note that this is a rule of thumb and should be verified if contrasts are more complex.
```{r}
contrasts(datasetSalaries$rank) <- cbind(
TenuredvAssistant = c(-2 / 3, 1 / 3, 1 / 3),
AssociatevProfessor = c(0, -1 / 2, 1 / 2)
)
```
```{r}
contrasts(datasetSalaries$rank)
```
## Null and research hypothesis
### Traditional approach
<center>
$H_0: \mu_{Assistant\_Professors} = \mu_{Associate\_Professors} = \mu_{Professors}$
<BR>$H_1: \mu_{Assistant\_Professors} \ne \mu_{Associate\_Professors} \ne \mu_{Professors}$
<BR>or $\mu_{Assistant\_Professors} \ne \mu_{Associate\_Professors} = \mu_{Professors}$
<BR>or $\mu_{Assistant\_Professors} \ne \mu_{Professors} = \mu_{Associate\_Professors}$
<BR>or $\mu_{Associate\_Professors} \ne \mu_{Professors} = \mu_{Assistant\_Professors}$
</center>
<BR>The null hypothesis states that there is no difference in salaries between professors of different ranks. The research hypothesis is stating that the salary of least one rank of professors is different than the others. Thus, the multiple options for a research hypothesis.
### GLM approach
<CENTER>
$Model: Salary = \beta_0 + \beta_1*TenuredvAssistant + \beta_2*AssociatevProfessor + \varepsilon$
<BR>$H_0: \beta_1 = \beta_2 = 0$
<BR>$H_1: \beta_1 \ne 0$ and/or $\beta_2 \ne 0$
</CENTER>
In the model, we now have both contrasts as predictors. The main effect of rank is testing both predictors of `TenuredvAssistant` and `AssociatevProfessor`.
Given our orthogonal contrast coding scheme, the intercept ($\beta_0$) is now the mean 9-month academic salary of professors on average across rank.
The slope of `TenuredvAssistant` ($\beta_1$) represents the mean difference in the 9-month academic salary for professors that are tenured compared to assistant professors.
The slope of `AssociatevProfessor` ($\beta_2$) represents the mean difference in the 9-month academic salary for professors that are associate professors compared to professors.
Thus, the null hypothesis states that there is a difference in salary of tenured versus untenured professors and there is also no difference in salary of associate professors compared to professors. In other words, there is no difference in salary based on a professor's rank. The alternative hypothesis states that there is a difference in either the salary of tenured versus untenured professors or associate professor versus professor, or both. If only one contrast is significant, then that contrast must be strong enough to mask the non-significance of the other contrast.
Given the vagueness of the research hypothesis (i.e., the significance can be a single contrast or both contrasts), we can look at each individual slope to see which is actually significant since we explicitly defined its associated contrasts beforehand.
Thus, our null and research hypotheses for these specific questions would be the following for each slope:
$$H_0: \beta_1 = 0$$
$$H_1: \beta_1 \ne 0$$
$$H_0: \beta_2 = 0$$
$$H_1: \beta_2 \ne 0$$
Each of these null and research hypothesis pairs are also known as 1 degree of freedom tests since we are only testing 1 contrast at a time.
## Statistical analysis
### Traditional approach
To perform a traditional One-Way ANOVA, we can use the `aov()` function. The first argument in the `aov()` function is the formula and the second argument is the name of the dataset. Notice, that in the formula, we do not have to specify both contrasts because we have already applied the coding scheme directly to the categorical IV of `rank`.
```{r}
model <- aov(salary ~ rank, datasetSalaries)
```
```{r}
Anova(model, type = c("III"))
```
Traditionally, if the main of effect of a categorical IV was significant, we would perform a post-hoc test (e.g., Tukey's Honest Significant Difference [HSD]). In our case, `rank` is significant as the *p*-value of `2.2e-16` is less than our alpha of `0.05` and we would perform a post-hoc test to determine where the difference in salary actually comes from.
To perform Tukey's HSD, we could use the `TukeyHSD` function and provide the ANOVA analysis of the model as the input. The Tukey HSD essentially performs multiple independent samples *t*-tests of all possible pairs of the levels of a categorical IV but uses the error term from the ANOVA analysis in its calculation.
```{r}
TukeyHSD(model)
```
Given that this post-hoc test compares all pairs of levels without any *a priori* assumptions (similarly, for other post-hoc tests), is another reason to use the *a priori* orthogonal contrast coding scheme.
### GLM approach
Notice, that in the formula, we do not have to specify both contrasts because we have already applied the coding scheme directly to the categorical IV of `rank`.
```{r}
model <- lm(salary ~ 1 + rank, datasetSalaries)
```
```{r}
Anova(model, type = c("III"))
```
Note that in both analyses, the *F*-statistic of `128.22` with `2` between degrees of freedom (df), `394` within degrees of freedom (df), and *p*-value of `2.2e-16` are identical.
```{r}
summary(model)
```
The estimate for the `(Intercept)` represents the mean 9-month academic salary on average across `rank`.
The estimate for the `rankTenuredvAssistant` represents the mean difference in 9-month academic salary of tenured professors compared to assistant professors. Specifically, the 9-month academic salary of tenured professors was $29,548 more than assistant professors.
The estimate for the `rankAssociatevProfessor` represents the mean difference in 9-month academic salary of associate professors compared to assistant professors. Specifically, the 9-month academic salary of professors was $32,896 more than associate professors.
## Statistical decision
Given the *p*-value of `< 2.2e-16` for `rank` is less than the alpha level ($\alpha$) of 0.05, we will reject the null hypothesis.
Notice both slopes are also significant, and we will also reject the null hypothesis for each.
## APA statement
An Analysis of Variance (ANOVA) using *a priori* orthogonal contrast coding was performed to test if there was a difference in the 9-month academic salary between 1) tenured (i.e., associate professors and professors) compared to untenured professors (i.e., assistant professors) and 2) associate professors compared to professors. There was a significant main effect of rank on salary, *F*(2, 394) = 128.22, *p* < .001. Specific to our contrasts, tenured professors earned significantly more than untenured professors, *b* = $29,548, *t*(1, 394) = 8.892, *p* < .001. Professors also earned significantly more than associate professors, *b* = $32,896, *t*(1, 394) = 9.997, *p* < .001.
## Visualization
### Traditional
```{r fig-one-way-anova-traditional, warning = F, fig.cap="A dot plot of the 9-month academic salaries of professors by their rank within the university (i.e., assistant professor, associate professor, and professor). Respectively for each rank, the dot represents the mean salary and the bars represent the 95% CI. \nNote: The data points are only jittered (dispersed) for easier visualization of all data points."}
# calculate descriptive statistics along with the 95% CI
dataset_summary <- datasetSalaries %>%
group_by(rank) %>%
summarize(
mean = mean(salary),
sd = sd(salary),
n = n(),
sem = sd / sqrt(n),
tcrit = abs(qt(0.05 / 2, df = n - 1)),
ME = tcrit * sem,
LL95CI = mean - ME,
UL95CI = mean + ME
)
# plot
ggplot(datasetSalaries, aes(rank, salary)) +
geom_jitter(alpha = 0.1, width = 0.1) +
geom_errorbar(data = dataset_summary, aes(x = rank, y = mean, ymin = LL95CI, ymax = UL95CI), color = "#3182bd", width = 0.02) +
geom_point(dat = dataset_summary, aes(x = rank, y = mean), color = "#3182bd", size = 3) +
labs(
x = "Rank",
y = "9-Month Academic Salary (USD)"
) +
scale_y_continuous(labels = scales::dollar) +
theme_classic()
```
### Orthogonal Contrast Coding
```{r, warning = F, fig.cap="A dot plot of the 9-month academic salaries of professors by their rank within the university. Respectively for each rank, the dot represents the mean salary and the bars represent the 95% CI. Note: The data points are only jittered (dispersed) for easier visualization of all data points."}
datasetSalaries_long <- datasetSalaries %>%
mutate(professor = 1:nrow(.),
TenuredvAssistant = case_when(
rank == "Assistant Professor" ~ -2,
rank == "Associate Professor" ~ 1,
rank == "Professor" ~ 1),
AssociatevProfessor = case_when(
rank == "Assistant Professor" ~ 0,
rank == "Associate Professor" ~ -1,
rank == "Professor" ~ 1)
) %>%
select(professor, salary, TenuredvAssistant, AssociatevProfessor) %>%
reshape2::melt(., id.vars = c("professor", "salary")) %>%
filter(value != 0) %>%
rowwise() %>%
mutate(rank = case_when(
variable == "TenuredvAssistant" && value == -2 ~ "Assistant",
variable == "TenuredvAssistant" && value == 1 ~ "Tenured",
variable == "AssociatevProfessor" && value == 1 ~ "Professor",
variable == "AssociatevProfessor" && value == -1 ~ "Associate"
))
dataset_summary <- datasetSalaries_long %>%
group_by(variable, value) %>%
summarize(
mean = mean(salary),
sd = sd(salary),
n = n(),
sem = sd / sqrt(n),
tcrit = abs(qt(0.05 / 2, df = n - 1)),
ME = tcrit * sem,
LL95CI = mean - ME,
UL95CI = mean + ME
)
ggplot(datasetSalaries_long, aes(value, salary)) +
geom_jitter(alpha = 0.1, width = 0.3) +
geom_errorbar(data = dataset_summary, aes(x = value, y = mean, ymin = LL95CI, ymax = UL95CI), color = "#3182bd", width = 0.05) +
geom_line(data = dataset_summary, aes(x = value, y = mean)) +
geom_point(data = dataset_summary, aes(x = value, y = mean), color = "#3182bd", size = 3) +
geom_text(aes(label = rank, x = value, y = 0), size = 3) +
facet_wrap(~ variable) +
labs(
x = "Rank",
y = "9-Month Academic Salary (USD)"
) +
scale_y_continuous(labels = scales::dollar) +
scale_x_continuous(breaks = seq(-3, 3, 1), limits = c(-3,3)) +
theme_classic() +
theme(strip.background = element_blank())
```
```{r figure-one-way-anova-orthogonal-contrast, eval = F, echo = F}
dataset_summary_1 <- df_salaries %>%
group_by(TenuredvAssistant) %>%
summarize(
mean = mean(salary),
sd = sd(salary),
n = n(),
sem = sd / sqrt(n),
tcrit = abs(qt(0.05 / 2, df = n - 1)),
ME = tcrit * sem,
LL95CI = mean - ME,
UL95CI = mean + ME
)
dataset_summary_2 <- df_salaries %>%
filter(AssociatevProfessor != 0) %>%
group_by(AssociatevProfessor) %>%
summarize(
mean = mean(salary),
sd = sd(salary),
n = n(),
sem = sd / sqrt(n),
tcrit = abs(qt(0.05 / 2, df = n - 1)),
ME = tcrit * sem,
LL95CI = mean - ME,
UL95CI = mean + ME
)
mean_of_means_1 <- mean(dataset_summary_1$mean)
mean_of_means_2 <- mean(dataset_summary_2$mean)
ggplot(df_salaries, aes(TenuredvAssistant, salary)) +
geom_jitter(alpha = 0.1, width = 0.2) +
geom_errorbar(data = dataset_summary_1, aes(x = TenuredvAssistant, y = mean, ymin = LL95CI, ymax = UL95CI), color = "#3182bd", width = 0.05) +
geom_point(data = dataset_summary_1, aes(x = TenuredvAssistant, y = mean), color = "#3182bd", size = 3) +
geom_line(data = dataset_summary_1, aes(x = TenuredvAssistant, y = mean), color = "#3182bd") +
labs(
x = "Rank (Tenured vs Assistant)",
y = "9-Month Academic Salary (USD)"
) +
scale_y_continuous(labels = scales::dollar) +
scale_x_continuous(breaks = seq(-3, 3, 1), limits = c(-3,3)) +
theme_classic() +
annotate(geom = "text", x = -2, y = 0, label = "Assistant") +
annotate(geom = "text", x = 1, y = 0, label = "Tenured")
figure_associate_vs_prof <- datasetSalaries %>%
mutate(AssociatevProfessor = case_when(
rank == "Assistant Professor" ~ 0,
rank == "Associate Professor" ~ -1,
rank == "Professor" ~ 1)
) %>%
filter(AssociatevProfessor != 0) %>%
ggplot(., aes(AssociatevProfessor, salary)) +
geom_jitter(alpha = 0.1, width = 0.2) +
geom_errorbar(data = dataset_summary, aes(x = AssociatevProfessor, y = mean, ymin = LL95CI, ymax = UL95CI), color = "#3182bd", width = 0.05) +
geom_point(data = dataset_summary, aes(x = AssociatevProfessor, y = mean), color = "#3182bd", size = 3) +
geom_line(data = dataset_summary, aes(x = AssociatevProfessor, y = mean), color = "#3182bd") +
labs(
x = "Rank (Associate vs. Full)",
y = "9-Month Academic Salary (USD)"
) +
scale_y_continuous(labels = scales::dollar) +
scale_x_continuous(breaks = seq(-3, 3, 1), limits = c(-3,3)) +
theme_classic() +
annotate(geom = "text", x = -1, y = 0, label = "Associate") +
annotate(geom = "text", x = 1, y = 0, label = "Professor")
gridExtra::grid.arrange(figure_tenured_vs_assistant, figure_associate_vs_prof, ncol = 2)
```