-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathplyr-notes.Rmd
240 lines (172 loc) · 7.22 KB
/
plyr-notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
# plyr: Split-Apply-Combine for Mortals
An introduction to the plyr R package.
`plyr` is an `R` package that makes it simple to split data apart, do
stuff to it, and mash it back together. This is a common
data-manipulation step. Importantly, `plyr` makes it easy to control the
input and output data format with a consistent syntax.
Or, from the documentation:
"`plyr` is a set of tools that solves a common set of problems: you need
to break a big problem down into manageable pieces, operate on each
piece and then put all the pieces back together. It's already possible
to do this with split and the apply functions, but `plyr` just makes it
all a bit easier..."
This is a very quick introduction to `plyr`. For more details see Hadley
Wickham's introductory guide [The split-apply-combine strategy for data
analysis](http://www.jstatsoft.org/v40/i01). There's quite a bit of discussion
online in general, and especially on
[stackoverflow.com](http://stackoverflow.com/questions/tagged/plyr).
## Why use `apply` functions instead of `for` loops?
1. The code is cleaner (once you're familiar with the concept). The
code can be easier to code and read, and less error prone because you
don't have to deal with subsetting and you don't have to deal with saving
your results.
2. `apply` functions can be faster than `for` loops, sometimes dramatically.
## `plyr` basics
`plyr` builds on the built-in `apply` functions by giving you control
over the input and output formats and keeping the syntax consistent
across all variations. It also adds some niceties like error processing,
parallel processing, and progress bars.
The basic format is two letters followed by `ply()`. The first letter
refers to the format in and the second to the format out.
The three main letters are:
1. `d` = data frame
2. `a` = array (includes matrices)
3. `l` = list
So, `ddply` means: take a data frame, split it up, do something to it,
and return a data frame. I find I use this the majority of the time
since I often work with data frames. `ldply` means: take a list, split
it up, do something to it, and return a data frame. This extends to all
combinations. In the following table, the columns are the input formats
and the rows are the output format:
object type |data frame |list |array
------------|------------|---------|--------
data frame | `ddply` | `ldply` | `adply`
list | `dlply` | `llply` | `alply`
array | `daply` | `laply` | `aaply`
I've ignored some less common format options:
1. `m` = multi-argument function input
2. `r` = replicate a function `n` times.
3. `_` = throw away the output
For plotting, you might find the underscore (`_`) option useful. It will
do something with the data (say add line segments to a plot) and then
throw away the output (e.g., `d_ply()`).
## Base R `apply` functions and `plyr`
`plyr` provides a consistent and easy-to-work-with format for `apply`
functions with control over the input and output formats. Some of the
functionality can be duplicated with base `R` functions (but with less
consistent syntax). Also, few `R` `apply` functions work directly with
data frames as input and output and data frames are a common object
class to work with.
Base `R` `apply` functions (from a presentation given by Hadley Wickham):
object type | array | data frame | list | nothing
-------------------|-------------|-------------|-------------|--------
array | `apply` | . | . | .
data frame | . | `aggregate` | `by` | .
list | `sapply` | . | `lapply` | .
n replicates | `replicate` | . | `replicate` | .
function arguments | `mapply` | . | `mapply` | .
## A general example with `plyr`
Let's take a simple example. We'll take a data frame, split it up by
`year`, calculate the coefficient of variation of the `count`, and
return a data frame. This could easily be done on one line, but I'm
expanding it here to show the format a more complex function could take.
```{r}
set.seed(1)
d <- data.frame(year = rep(2000:2002, each = 3),
count = round(runif(9, 0, 20)))
print(d)
```
```{r}
library(plyr)
ddply(d, "year", function(x) {
mean.count <- mean(x$count)
sd.count <- sd(x$count)
cv <- sd.count/mean.count
data.frame(cv.count = cv)
})
```
## `transform` and `summarise`
It is often convenient to use these functions within one of the `**ply`
functions. `transform` acts as it would normally as the base `R` function and
modifies an existing data frame. `summarise` creates a new condensed data
frame.
```{r}
ddply(d, "year", summarise, mean.count = mean(count))
```
```{r}
ddply(d, "year", transform, total.count = sum(count))
```
Bonus function: `mutate`. `mutate` works like `transform` but lets you
build on columns.
```{r}
ddply(d, "year", mutate, mu = mean(count), sigma = sd(count),
cv = sigma/mu)
```
## Plotting with `plyr`
You can use `plyr` to plot data by throwing away the output with an
underscore (`_`). This is a bit cleaner than a for loop since you don't
have to subset the data manually.
```{r d_ply_plot, fig.width=6, fig.height=2.5, fig.cap=""}
par(mfrow = c(1, 3), mar = c(2, 2, 1, 1), oma = c(3, 3, 0, 0))
d_ply(d, "year", transform, plot(count, main = unique(year), type = "o"))
mtext("count", side = 1, outer = TRUE, line = 1)
mtext("frequency", side = 2, outer = TRUE, line = 1)
```
## Nested chunking of the data
The basic syntax can be easily extended to break apart the data based on
multiple columns:
```{r}
baseball.dat <- subset(baseball, year > 2000) # data from the plyr package
x <- ddply(baseball.dat, c("year", "team"), summarize,
homeruns = sum(hr))
head(x)
```
## Other useful options
### Dealing with errors
You can use the `failwith` function to control how errors are dealt
with.
```{r}
f <- function(x) if (x == 1) stop("Error!") else 1
safe.f <- failwith(NA, f, quiet = TRUE)
# llply(1:2, f)
llply(1:2, safe.f)
```
### Parallel processing
In conjunction with a package such as `doParallel` you can run your function
separately on each core of your computer. On a dual core machine this make
your code up to twice as fast. Simply register the cores and then set
`.parallel = TRUE`. Look at the `elapsed` time in these examples:
```{r}
x <- c(1:10)
wait <- function(i) Sys.sleep(0.1)
system.time(llply(x, wait))
```
```{r}
system.time(sapply(x, wait))
```
```{r}
library(doParallel)
registerDoParallel(cores = 2)
system.time(llply(x, wait, .parallel = TRUE))
```
## So, why would I *not* want to use `plyr`?
`plyr` can be slow --- particularly if you are working with very large
datasets that involve a lot of subsetting. Hadley is working on this and
an in-development version of `plyr`, `dplyr`, can run much faster (<https://github.com/hadley/dplyr>).
However, it's important to remember that typically the speed that you
can write code and understand it later is the rate-limiting step.
A couple faster options:
Use a base `R` `apply` function:
```{r}
system.time(ddply(baseball, "id", summarize, length(year)))
```
```{r}
system.time(tapply(baseball$year, baseball$id,
function(x) length(x)))
```
Use the `data.table` package:
```{r}
library(data.table)
dt <- data.table(baseball, key = "id")
system.time(dt[, length(year), by = list(id)])
```