plyr-notes.Rmd

# plyr: Split-Apply-Combine for Mortals

An introduction to the plyr R package.

`plyr` is an `R` package that makes it simple to split data apart, do
stuff to it, and mash it back together. This is a common
data-manipulation step. Importantly, `plyr` makes it easy to control the
input and output data format with a consistent syntax.

Or, from the documentation:

"`plyr` is a set of tools that solves a common set of problems: you need
to break a big problem down into manageable pieces, operate on each
piece and then put all the pieces back together. It's already possible
to do this with split and the apply functions, but `plyr` just makes it
all a bit easier..."

This is a very quick introduction to `plyr`. For more details see Hadley
Wickham's introductory guide [The split-apply-combine strategy for data
analysis](http://www.jstatsoft.org/v40/i01). There's quite a bit of discussion
online in general, and especially on
[stackoverflow.com](http://stackoverflow.com/questions/tagged/plyr).

## Why use `apply` functions instead of `for` loops?

1.  The code is cleaner (once you're familiar with the concept). The
    code can be easier to code and read, and less error prone because you
    don't have to deal with subsetting and you don't have to deal with saving
    your results.

2.  `apply` functions can be faster than `for` loops, sometimes dramatically.

## `plyr` basics

`plyr` builds on the built-in `apply` functions by giving you control
over the input and output formats and keeping the syntax consistent
across all variations. It also adds some niceties like error processing,
parallel processing, and progress bars.

The basic format is two letters followed by `ply()`. The first letter
refers to the format in and the second to the format out.

The three main letters are:

1.  `d` = data frame

2.  `a` = array (includes matrices)

3.  `l` = list

So, `ddply` means: take a data frame, split it up, do something to it,
and return a data frame. I find I use this the majority of the time
since I often work with data frames. `ldply` means: take a list, split
it up, do something to it, and return a data frame. This extends to all
combinations. In the following table, the columns are the input formats
and the rows are the output format:

object type |data frame  |list     |array
------------|------------|---------|--------
data frame  | `ddply`    | `ldply` | `adply`
list        | `dlply`    | `llply` | `alply`
array       | `daply`    | `laply` | `aaply`

I've ignored some less common format options:

1.  `m` = multi-argument function input
2.  `r` = replicate a function `n` times.
3.  `_` = throw away the output

For plotting, you might find the underscore (`_`) option useful. It will
do something with the data (say add line segments to a plot) and then
throw away the output (e.g., `d_ply()`).

## Base R `apply` functions and `plyr`

`plyr` provides a consistent and easy-to-work-with format for `apply`
functions with control over the input and output formats. Some of the
functionality can be duplicated with base `R` functions (but with less
consistent syntax). Also, few `R` `apply` functions work directly with
data frames as input and output and data frames are a common object
class to work with.

Base `R` `apply` functions (from a presentation given by Hadley Wickham):

object type        | array       | data frame  | list        | nothing
-------------------|-------------|-------------|-------------|--------
array              | `apply`     | .           | .           | .
data frame         | .           | `aggregate` | `by`        | .
list               | `sapply`    | .           | `lapply`    | .
n replicates       | `replicate` | .           | `replicate` | .
function arguments | `mapply`    | .           | `mapply`    | .

## A general example with `plyr`

Let's take a simple example. We'll take a data frame, split it up by
`year`, calculate the coefficient of variation of the `count`, and
return a data frame. This could easily be done on one line, but I'm
expanding it here to show the format a more complex function could take.

```{r}
set.seed(1)
d <- data.frame(year = rep(2000:2002, each = 3),
  count = round(runif(9, 0, 20)))
print(d)
```


```{r}
library(plyr)
ddply(d, "year", function(x) {
  mean.count <- mean(x$count)
  sd.count <- sd(x$count)
  cv <- sd.count/mean.count
  data.frame(cv.count = cv)
})
```


## `transform` and `summarise`

It is often convenient to use these functions within one of the `**ply`
functions. `transform` acts as it would normally as the base `R` function and
modifies an existing data frame. `summarise` creates a new condensed data
frame.

```{r}
ddply(d, "year", summarise, mean.count = mean(count))
```


```{r}
ddply(d, "year", transform, total.count = sum(count))
```

Bonus function: `mutate`. `mutate` works like `transform` but lets you
build on columns.

```{r}
ddply(d, "year", mutate, mu = mean(count), sigma = sd(count),
  cv = sigma/mu)
```


## Plotting with `plyr`

You can use `plyr` to plot data by throwing away the output with an
underscore (`_`). This is a bit cleaner than a for loop since you don't
have to subset the data manually.

```{r d_ply_plot, fig.width=6, fig.height=2.5, fig.cap=""}
par(mfrow = c(1, 3), mar = c(2, 2, 1, 1), oma = c(3, 3, 0, 0))
d_ply(d, "year", transform, plot(count, main = unique(year), type = "o"))
mtext("count", side = 1, outer = TRUE, line = 1)
mtext("frequency", side = 2, outer = TRUE, line = 1)
```

## Nested chunking of the data

The basic syntax can be easily extended to break apart the data based on
multiple columns:

```{r}
baseball.dat <- subset(baseball, year > 2000) # data from the plyr package
x <- ddply(baseball.dat, c("year", "team"), summarize,
  homeruns = sum(hr))
head(x)
```

## Other useful options

### Dealing with errors

You can use the `failwith` function to control how errors are dealt
with.

```{r}
f <- function(x) if (x == 1) stop("Error!") else 1
safe.f <- failwith(NA, f, quiet = TRUE)
# llply(1:2, f)
llply(1:2, safe.f)
```

### Parallel processing

In conjunction with a package such as `doParallel` you can run your function
separately on each core of your computer. On a dual core machine this make
your code up to twice as fast. Simply register the cores and then set
`.parallel = TRUE`. Look at the `elapsed` time in these examples:


```{r}
x <- c(1:10)
wait <- function(i) Sys.sleep(0.1)
system.time(llply(x, wait))
```


```{r}
system.time(sapply(x, wait))
```


```{r}
library(doParallel)
registerDoParallel(cores = 2)
system.time(llply(x, wait, .parallel = TRUE))
```


## So, why would I *not* want to use `plyr`?

`plyr` can be slow --- particularly if you are working with very large
datasets that involve a lot of subsetting. Hadley is working on this and
an in-development version of `plyr`, `dplyr`, can run much faster (<https://github.com/hadley/dplyr>).
However, it's important to remember that typically the speed that you
can write code and understand it later is the rate-limiting step.

A couple faster options:

Use a base `R` `apply` function:

```{r}
system.time(ddply(baseball, "id", summarize, length(year)))
```


```{r}
system.time(tapply(baseball$year, baseball$id,
  function(x) length(x)))
```


Use the `data.table` package:

```{r}
library(data.table)
dt <- data.table(baseball, key = "id")
system.time(dt[, length(year), by = list(id)])
```