-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04-dplyr-tutorial.Rmd
387 lines (263 loc) · 12.8 KB
/
04-dplyr-tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
# Brief Introduction to `dplyr`
```{r, echo=FALSE}
library(knitr)
knitr::opts_chunk$set(tidy.opts = list(width.cutoff = 40), tidy = TRUE)
```
The learning curve for R syntax is slow. One of the more difficult aspects that requires some getting used to is subsetting data tables. The `dplyr` package brings these tasks closer to English. The Rmd file for this section is available [here](https://github.com/gurinina/omic_sciences/blob/main/03-dplyr-tutorial.Rmd).
## What is dplyr?
dplyr is a powerful R-package to transform and summarize tabular data with rows and columns. For another explanation of dplyr see the dplyr package vignette: [Introduction to dplyr](http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html)
## Why Is It Useful?
The package contains a set of functions (or "verbs") that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data.
In addition, dplyr contains a useful function to perform another common task which is the "split-apply-combine" concept. We will discuss that in a little bit.
## How Does It Compare To Using Base Functions R?
If you are familiar with R, you are probably familiar with base R functions such as split(), subset(), apply(), sapply(), lapply(), tapply() and aggregate(). Compared to base functions in R, the functions in dplyr are easier to work with, are more consistent in the syntax and are targeted for data analysis around data frames, instead of just vectors.
## How Do I Get dplyr?
To install dplyr:
```{r, eval=FALSE}
install.packages("dplyr")
```
To load dplyr:
```{r, message=FALSE}
library(dplyr)
```
```{r,echo=FALSE}
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleMiceWeights.csv"
filename <- "femaleMiceWeights.csv"
library(downloader)
if (!file.exists(filename)) download(url, filename)
```
Take a look at the dataset we read in:
```{r}
filename <- "femaleMiceWeights.csv"
dat <- read.csv(filename) ## previously downloaded
head(dat) #In R Studio use View(dat)
```
There are two types of diets, which are denoted in the first column. If we want just the weights, we only need the second column. So if we want the weights for mice on the `chow` diet, we subset and filter like this:
```{r,message=FALSE}
chow <- filter(dat, Diet=="chow") #keep only the ones with chow diet
head(chow)
```
And now we can select only the column with the values:
```{r}
chowVals <- select(chow,Bodyweight)
head(chowVals)
```
A nice feature of the `dplyr` package is that you can perform consecutive tasks by using what is called a "pipe". In `dplyr` we use `%>%` to denote a pipe. This symbol tells the program to first do one thing and then do something else to the result of the first.
## Pipe Operator: %>%
dplyr imports this operator from another package (magrittr).This operator allows you to pipe the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of piping is to read the functions from left to right. Hence, we can perform several data manipulations in one line. For example:
```{r}
chowVals <- filter(dat, Diet=="chow") %>% select(Bodyweight)
```
Or even more simply:
```{r}
chowVals <- dat %>% filter(Diet=="chow") %>% select(Bodyweight)
```
Now in this case, we will pipe the dat data frame to the function that will filter one column, then select one column and then pipe the new data frame to the function `head()`, which will return the head of the new data frame.
```{r}
dat %>% filter(Diet=="chow") %>% select(Bodyweight) %>% head
```
You will soon see how useful the pipe operator is when we start to combine many functions.
```{r}
chowVals <- dat %>% filter(Diet=="chow") %>% select(Bodyweight)
```
In the second and every task thereafter, we no longer have to specify the object we are editing since it is whatever comes from the previous call.
Also, note that if `dplyr` receives a `data.frame` it will return a `data.frame`.
```{r}
class(dat)
class(chowVals)
```
For pedagogical reasons, we will often want the final result to be a simple `numeric` vector. To obtain such a vector with `dplyr`, we can apply the `unlist` function which turns `lists`, such as `data.frames`, into `numeric` vectors:
```{r}
chowVals <- filter(dat, Diet=="chow") %>% select(Bodyweight) %>% unlist
chowVals
class( chowVals )
```
To do this in R without `dplyr` the code is the following:
```{r}
chowVals <- dat[ dat$Diet=="chow", colnames(dat)=="Bodyweight"]
```
## Data: Mammals Sleep
The msleep (mammals sleep) data set contains the sleep times and weights for a set of mammals and is available in the dagdata repository on github. This data set contains 83 rows and 11 variables.
Download the msleep data set in CSV format from [here](https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv), and then load into R:
```{r}
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv"
msleep <- read.csv(url)
head(msleep)
```
The columns (in order) correspond to the following:
column name | Description
--- | ---
name | common name
genus | taxonomic rank
vore | carnivore, omnivore or herbivore?
order | taxonomic order
conservation | the conservation status of the mammal
sleep\_total | total amount of sleep, in hours
sleep\_rem | rem sleep, in hours
sleep\_cycle | length of sleep cycle, in hours
awake | amount of time spent awake, in hours
brainwt | brain weight in kilograms
bodywt | body weight in kilograms
## Important dplyr Verbs To Remember
dplyr verbs | Description
--- | ---
`select()` | select columns
`filter()` | filter rows
`arrange()` | re-order or arrange rows
`mutate()` | create new columns
`summarise()` | summarize values
`group_by()` | allows for group operations in the "split-apply-combine" concept
## dplyr Verbs In Action
The two most basic functions are `select()` and `filter()`, which selects columns and filters rows respectively.
## Selecting Columns Using `select()`
Select a set of columns: the name and the sleep\_total columns.
```{r}
sleepData <- select(msleep, name, sleep_total)
sleepData <- msleep %>% select(name,sleep_total)
head(sleepData)
```
To select all the columns *except* a specific column, use the "-" (subtraction) operator (also known as negative indexing):
```{r}
head(select(msleep, -name))
```
To select a range of columns by name, use the ":" (colon) operator:
```{r}
head(select(msleep, name:order))
```
To select all columns that start with the character string "sl", use the function `starts_with()`:
```{r}
head(select(msleep, starts_with("sl")))
```
Some additional options to select columns based on a specific criteria include:
1. `ends_with()` = Select columns that end with a character string
2. `contains()` = Select columns that contain a character string
3. `matches()` = Select columns that match a regular expression
4. `one_of()` = Select column names that are from a group of names
## Selecting Rows Using `filter()`
Filter the rows for mammals that sleep a total of more than 16 hours.
```{r}
msleep %>% filter(sleep_total >= 16)%>% select(name,sleep_total,bodywt) %>% head
```
Filter the rows for mammals that sleep a total of more than 16 hours *and* have a body weight of greater than 1 kilogram.
```{r}
msleep %>% filter(sleep_total >= 16,bodywt >= 1)%>% select(name,sleep_total,bodywt) %>% head
```
You can also filter with AND/OR...
```{r}
msleep %>% filter(sleep_total >= 16 | bodywt >= 1)%>% select(name,sleep_total,bodywt) %>% head
```
Filter the rows for mammals in the Perissodactyla and Primates taxonomic order:
```{r}
msleep %>% filter(order %in% c("Perissodactyla", "Primates"))%>% select(name,genus,order) %>% head
```
**Between, And, Not equal**
Slightly more complicated
Filter the animals that sleep between 2 and 19 hours, and they are NOT domesticated
```{r}
msleep_2_19_notdom <- msleep %>%
filter(between(sleep_total, 2, 19) & conservation != "domesticated")
```
Check if your filter is filtering NA for the variable conservation. If not, filter them.
```{r}
msleep_2_19_notdom_NA <- msleep %>%
filter(between(sleep_total, 2, 19) & conservation != "domesticated" | is.na(conservation))
```
Can anyone think of another way of accomplishing the same thing as between?
You can use the boolean operators (e.g. >, <, >=, <=, !=, %in%) to create the logical tests.
## Arrange Or Re-order Rows Using `arrange()`
To arrange (or re-order) rows by a particular column, such as the taxonomic order, list the name of the column you want to arrange the rows by:
```{r}
msleep %>% arrange(order) %>% select(name,genus,order) %>% head
```
Now we will select three columns from msleep, arrange the rows by the taxonomic order and then arrange the rows by sleep\_total. Finally, show the head of the final data frame:
```{r}
msleep %>%
select(name, order, sleep_total) %>%
arrange(order, sleep_total) %>%
head
```
Same as above, except this time arrange by sleep_total first:
```{r}
msleep %>%
select(name, order, sleep_total) %>%
arrange(sleep_total,order) %>%
head
```
Same as above, except here we filter the rows for mammals that sleep for 16 or more hours, instead of showing the head of the final data frame:
```{r}
msleep %>%
select(name, order, sleep_total) %>%
arrange(order, sleep_total) %>%
filter(sleep_total >= 16)
```
Something slightly more complicated: same as above, except arrange the rows in the sleep\_total column in a descending order. For this, use the function `desc()`
```{r}
msleep %>%
select(name, order, sleep_total) %>%
arrange(order, desc(sleep_total)) %>%
filter(sleep_total >= 16)
```
## Create New Columns Using `mutate()`
The `mutate()` function will add new columns to the data frame. Create a new column called rem_proportion, which is the ratio of rem sleep to total amount of sleep.
```{r}
msleep %>%
mutate(rem_proportion = sleep_rem / sleep_total) %>% select(rem_proportion) %>% head
```
You can many new columns using mutate (separated by commas). Here we add a second column called bodywt_grams which is the bodywt column in grams.
```{r}
msleep %>%
mutate(rem_proportion = sleep_rem / sleep_total,
bodywt_grams = bodywt * 1000) %>%
select(rem_proportion,bodywt_grams) %>% head()
```
## Create summaries of the data frame using `summarise()`
The `summarise()` function will create summary statistics for a given column in the data frame such as finding the mean. For example, to compute the average number of hours of sleep, apply the `mean()` function to the column sleep\_total and call the summary value avg\_sleep.
```{r}
msleep %>%
summarise(avg_sleep = mean(sleep_total))
```
Summarise is different because it doesn't alter your data.frame.
There are many other summary statistics you could consider such `sd()`, `min()`, `max()`, `median()`, `sum()`, `n()` (returns the length of vector), `first()` (returns first value in vector), `last()` (returns last value in vector) and `n_distinct()` (number of distinct values in vector).
```{r}
msleep %>%
summarise(avg_sleep = mean(sleep_total),
min_sleep = min(sleep_total),
max_sleep = max(sleep_total),
total = n())
```
For the summaries, you also need to be wary of the NAs, here you can use na.rm = TRUE. e.g. if instead of sleep_total you were looking at sleep_rem, which has NA values...
```{r}
msleep %>%
summarise(avg_sleep = mean(sleep_rem),
min_sleep = min(sleep_rem),
max_sleep = max(sleep_rem),
total = n())
```
## Group operations using `group_by()`
The `group_by()` verb is an important function in dplyr. As we mentioned before it's related to concept of "split-apply-combine". We literally want to split the data frame by some variable (e.g. taxonomic order), apply a function to the individual data frames and then combine the output.
Let's do that: split the msleep data frame by the taxonomic order, then ask for the same summary statistics as above. We expect a set of summary statistics for each taxonomic order.
```{r}
m = msleep %>%
group_by(order) %>%
summarise(avg_sleep = mean(sleep_total),
min_sleep = min(sleep_total),
max_sleep = max(sleep_total),
total = n())
```
What are the dimensions of m? Why? What does the total column refer to?
```{r}
m
```
Note that if you wanted to keep all of the data intact, you would simply use mutate and add the `ungroup` function at the end:
```{r}
msleep %>%
group_by(order) %>%
mutate(avg_sleep = mean(sleep_total),
min_sleep = min(sleep_total),
max_sleep = max(sleep_total),
total = n()) %>% ungroup
```
Finally, there is glimpse for a quick summary:
```{r}
msleep %>% glimpse
```