-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path00-03-datasets.Rmd
265 lines (228 loc) · 9.6 KB
/
00-03-datasets.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
# Datasets
Below are two public (open-access) and real datasets that we will use for the analyses. For each dataset, you will find its description, minor data wrangling (or data manipulation), and descriptive statistics in both numeric and visual formats.
## Salaries Dataset
```{r}
datasetSalaries <- carData::Salaries
```
One of the datasets that we'll use is the `Salaries` dataset within the `carData` package. The dataset consists of nine-month salaries collected from 397 collegiate professors in the U.S. during 2008 to 2009. In addition to salaries, the professor's rank, sex, discipline, years since Ph.D., and years of service was also collected. Thus, there is a total of 6 variables, which are described below.
```{r, echo=F}
kable(as.data.frame(rbind(
c("Variable" = "rank", "Variable Type" = "Categorical", "Description" = "Professor's rank of either assistant professor, associate professor, or professor"),
c("Variable" = "discipline", "Variable Type" = "Categorical", "Description" = "Type of department the professor works in, either applied or theoretical"),
c("Variable" = "yrs.since.phd", "Variable Type" = "Continuous", "Description" = "Number of years since the professor has obtained their PhD"),
c("Variable" = "yrs.service", "Variable Type" = "Continuous", "Description" = "Number of years the professor has served the department and/or university"),
c("Variable" = "sex", "Variable Type" = "Categorical", "Description" = "Professor's sex of either male or female"),
c("Variable" = "salary", "Variable Type" = "Continuous", "Description" = "Professor's nine-month salary (USD)")
)))
```
### Data Wrangling
Before running these analyses within the GLM context, let's clean up the dataset so that it's readable for us rather than computers.
```{r}
# spell out rank variables
# rename discipline variables to its meaningful name
# ensure both rank and discipline are factors
datasetSalaries <- datasetSalaries %>%
mutate(
rank = case_when(
rank == "AssocProf" ~ "Associate Professor",
rank == "AsstProf" ~ "Assistant Professor",
rank == "Prof" ~ "Professor"
),
discipline = case_when(
discipline == "A" ~ "Theoretical",
discipline == "B" ~ "Applied"
)
) %>%
mutate(
rank = as.factor(rank),
discipline = as.factor(discipline)
)
```
### Descriptive Statistics
It's also always a good idea to examine the data numerically and visually. Let's first look at the categorical variables then the continuous variables.
#### Categorical Variables
```{r}
summary(datasetSalaries$rank)
```
To visualize our data, we will be using the function `ggplot()`. We won't be going into detail about plotting since data visualization is out of the book's scope. However, if interested, check out Grolemund's and Wickham's <a href="https://r4ds.had.co.nz/data-visualisation.html" target="_blank">data visualization chapter</a> to learn more about `ggplot()`.
```{r figure-rank}
ggplot(data = datasetSalaries, mapping = aes(x = rank)) +
geom_bar(aes(y = stat(count) / sum(stat(count))), color = "black", fill = "#3182bd") +
theme_classic() +
labs(y = "Proportion", x = "", title = "rank")
```
In this dataset, there are a lot more professors than assistant and associate professors combined.
```{r}
summary(datasetSalaries$discipline)
```
```{r figure-discipline}
ggplot(data = datasetSalaries, mapping = aes(x = discipline)) +
geom_bar(aes(y = stat(count) / sum(stat(count))), color = "black", fill = "#3182bd") +
theme_classic() +
labs(y = "Proportion", x = "", title = "discipline")
```
There are slightly more professors within the applied than the theoretical discipline (i.e., 35 more).
```{r}
summary(datasetSalaries$sex)
```
```{r figure-sex}
ggplot(data = datasetSalaries, mapping = aes(x = sex)) +
geom_bar(aes(y = stat(count) / sum(stat(count))), color = "black", fill = "#3182bd") +
theme_classic() +
labs(y = "Proportion", x = "", title = "sex")
```
There is a little over 9x as many male professors as there are female professors.
#### Continuous Variables
```{r}
datasetSalaries %>%
select(yrs.since.phd, yrs.service, salary) %>%
describe(.) %>%
select(-n, -vars, -trimmed, -mad, -range) %>%
round(., 2)
```
```{r figure-yrs-since-phd}
ggplot(datasetSalaries, mapping = aes(yrs.since.phd)) +
geom_vline(
xintercept = describe(datasetSalaries)["yrs.since.phd", "mean"],
alpha = .5,
linetype = "dashed"
) +
geom_dotplot(binwidth = 1, fill = "#3182bd") +
geom_text(aes(
x = describe(datasetSalaries)["yrs.since.phd", "mean"],
label = paste("M =", round(describe(datasetSalaries)["yrs.since.phd", "mean"], 2)),
y = -.05
), angle = 0) +
theme_classic() +
scale_y_continuous(NULL, breaks = NULL) +
labs(x = "", y = "Frequency", title = "Years since PhD")
```
On average, professors have had their Ph.D. for about 22 years.
```{r figure-yrs-service}
ggplot(data = datasetSalaries, mapping = aes(yrs.service)) +
geom_vline(
xintercept = describe(datasetSalaries)["yrs.service", "mean"],
alpha = .5,
linetype = "dashed"
) +
geom_dotplot(binwidth = 1, fill = "#3182bd") +
geom_text(aes(
x = describe(datasetSalaries)["yrs.service", "mean"],
label = paste("M =", round(describe(datasetSalaries)["yrs.service", "mean"], 2)),
y = -.05
), angle = 0) +
theme_classic() +
scale_y_continuous(NULL, breaks = NULL) +
labs(x = "", y = "Frequency", title = "Years of service")
```
On average, professors have provided a service to either the department or university for about 17 years and 7 months.
```{r figure-salary}
ggplot(data = datasetSalaries, mapping = aes(x = salary)) +
geom_vline(
xintercept = describe(datasetSalaries)["salary", "mean"],
alpha = .5,
linetype = "dashed"
) +
geom_dotplot(binwidth = 3000, fill = "#3182bd") +
geom_text(aes(
x = describe(datasetSalaries)["salary", "mean"],
label = paste("M =", scales::dollar(round(describe(datasetSalaries)["salary", "mean"], 2))),
y = -.05
), angle = 0) +
theme_classic() +
scale_y_continuous(NULL, breaks = NULL) +
scale_x_continuous(labels = scales::dollar) +
labs(x = "", y = "Frequency", title = "salary")
```
On average, a professor's 9-month annual income is $113,706.46.
## Anorexia Dataset
```{r}
datasetAnorexia <- MASS::anorexia
```
Another dataset that we'll use is the `anorexia` dataset within the `MASS` package. The dataset consists of the weight (in lbs.) of 72 female patients with anorexia before and after either cognitive behavioral therapy, family therapy, or no therapy (control condition).
```{r, echo=F}
kable(as.data.frame(rbind(
c("Variable" = "Treatment", "Variable Type" = "Categorical", "Description" = "Treatment of female patient with anorexia, either cognitive behavioral therapy (CBT), family therapy (FT), or no therepy (CONT)"),
c("Variable" = "PreWeight", "Variable Type" = "Continuous", "Weight of female patient with anorexia before treatment in lbs."),
c("Variable" = "PostWeight", "Variable Type" = "Continuous", "Description" = "Weight of female patient with anorexia after treatment in lbs.")
)))
```
### Data Wrangling
Again, we can make the dataset slightly more readable for us.
```{r}
# spell out variable names
# re-order levels to be CBT, FT, then Cont rather than alphabetical order
datasetAnorexia <- datasetAnorexia %>%
mutate(
Treatment = Treat,
PreWeight = Prewt,
PostWeight = Postwt,
Treatment = case_when(
Treatment == "Cont" ~ "Control",
TRUE ~ as.character(Treatment)
),
Treatment = factor(Treatment, levels = c("CBT", "FT", "Control"))
)
```
### Descriptive Statistics
#### Categorical Variables
```{r}
summary(datasetAnorexia$Treatment)
```
```{r figure-treatment}
ggplot(data = datasetAnorexia, mapping = aes(x = Treatment)) +
geom_bar(aes(y = stat(count) / sum(stat(count))), color = "black", fill = "#3182bd") +
theme_classic() +
labs(y = "Proportion", x = "", title = "Treatment")
```
There are more participants within the CBT condition compared to either the FT condition or control group.
#### Continuous Variables
```{r}
datasetAnorexia %>%
select(PreWeight, PostWeight) %>%
describe(.) %>%
select(-vars, -trimmed, -mad, -range) %>%
round(., 2)
```
```{r figure-preweight}
ggplot(data = datasetAnorexia, mapping = aes(x = PreWeight)) +
geom_vline(
xintercept = describe(datasetAnorexia)["PreWeight", "mean"],
alpha = .5,
linetype = "dashed"
) +
geom_dotplot(binwidth = 1, fill = "#3182bd") +
geom_text(aes(
x = describe(datasetAnorexia)["PreWeight", "mean"],
label = paste("M =", round(describe(datasetAnorexia)["PreWeight", "mean"], 2)),
y = -.05
),
angle = 0
) +
theme_classic() +
scale_y_continuous(NULL, breaks = NULL) +
labs(x = "", y = "Frequency", title = "Weight Before Study (lbs)")
```
On average, women with anorexia before treatment weighed 82.41 lbs.
```{r figure-postweight}
ggplot(data = datasetAnorexia, mapping = aes(x = PostWeight)) +
geom_vline(
xintercept = describe(datasetAnorexia)["PostWeight", "mean"],
alpha = .5,
linetype = "dashed"
) +
geom_dotplot(binwidth = 1, fill = "#3182bd") +
geom_text(aes(
x = describe(datasetAnorexia)["PostWeight", "mean"],
label = paste("M =", round(describe(datasetAnorexia)["PostWeight", "mean"], 2)),
y = -.05
),
angle = 0
) +
theme_classic() +
scale_y_continuous(NULL, breaks = NULL) +
labs(x = "", y = "Frequency", title = "Weight After Study (lbs)")
```
On average, women with anorexia after treatment weighed 85.17 lbs, which is about a 2.76 lbs weight gain compared to before treatment.
## Use Your Own Dataset
However, if you have an interesting dataset of your own, we encourage you to also try using that dataset alongside ours.