title | description | prev | next | type | id |
---|---|---|---|---|---|
Chapter 5: Data wrangling with R |
In this chapter, you will learn to use tidyr and dplyr from the tidyverse in R for tidying your data set and data maniplulation. |
/chapter4 |
/chapter6 |
chapter |
5 |
The tidyverse is a collection of R packages that transform and visualize data. They aim to be cohesive with each other and provide simple building block functions for you to use in complex applications. All packages of the tidyverse share very similar syntax and ideologies.
The tidyverse includes the following packages:
dplyr
: grammar of data manipulationtidyr
: reshape your dataggplot2
: grammar of graphicsreadr
: read rectangular datatibble
: re-imagining of data framestringr
: working with string data
and many more...
There are many other packages with more specialized functions in the tidyverse. In this chapter, we focuses on two packages: tidyr and dplyr.
dplyr makes manipulating data easy. tidyr helps you to create "tidy data" (data where each variable is in a column, each observation is a row and each value is a cell). It helps with adding, renaming, and removing columns, computing new variables, joining data sets together and summarising your data. You can use them to solve the most common data manipulation challenges.
You can install the core tidyverse packages with:
install.packages("tidyverse")
Then, to load the package:
library(tidyverse)
You can also load specific library separately, for example,
library(dplyr)
.
Pipes
%>%
is a "pipe"-like operator with which you may pipe a value forward
into the next function. Instead of writing f(x)
, it allows you to
instead write x %>% f()
.
Packages in the tidyverse are designed around this pipe, which semantically changes your code in a way that makes it more intuitive to both read and write.
We use pipes when we want to chain a number of operations to manipulate our data. For example,
# Read in data using readr
read_csv("market.csv") %>%
# Create a new column called revenue using dplyr
mutate(revenue = sales * price) %>%
# Plot the data using ggplot2
ggplot(aes(y = revenue, x = product)) +
geom_col()
You will be introduced to the functionality of these commands in this chapter and in data visualization. So don't worry too much if you don't know what does each of these command do.
The benefit of using pipes here is that we can add layers of operations on top of one another. This is very useful when we are performing complex operations on our data sets. The equivalent code for this without the pipe would be:
ggplot(
mutate(
read_csv("market.csv"),
revenue = sales * price
),
aes(y = revenue, x = product)
) +
geom_col()
You can see how the pipe simplifies both our writing, reading, and understanding of this sequence of code. It is much more natural to think of the code first reading in the data, computing a new column called revenue, and then plotting revenue over time with a column chart in ggplot. The alternative reading (without pipes) is to plot the computed revenue column of the read in data with a column chart.
All code in R can be written using a pipe, but in some circumstances it
doesn't make sense to do so. For example, mathematical notation is
typically written in a single line, rather than in a sequence of steps.
Suppose you had a numerical vector x <- c(100, 200, 300)
and wanted to
do some mathematical calculations on it: (log(x) + 3)^2
. The
equivalent code using pipes would be x %>% log() %>% {.+3} %>% {.^2}
-
you first log the vector x
, then add 3, and square the result. It
gives the same result, but is harder to read and write for this
circumstance.
x <- c(100,200,300)
(log(x) + 3)^2
## [1] 57.83861 68.86207 75.75583
x %>% log() %>% {.+3} %>% {.^2}
## [1] 57.83861 68.86207 75.75583
< >
Problem 1
In this problem, we will use the iris
data set. The iris
dataset is
a built-in dataset in R that contains measurements on 4 different
attributes (in centimeters) for 50 flowers from 3 different species:
setosa, versicolor,virginica.
Create a new dataframe from the iris
data set that meets the following
criteria: contains only the Sepal.Length
column and a new column
called petal_area
containing a value that approximates the area of the
petal. (Petal.Length
times Petal.Width
). Only the rows where
petal_area
is greater than 5 should be shown in the final dataframe.
The final dataframe should be called new_iris
.
You would want to perform the following tasks in this particular order:
(1) Add the new column called petal_area
(2) Filter the variable based on the condition stated above
(3) Select only the relevant variables in the final dataframe
Hint: You can use mutate()
, filter()
and select()
to help with the
task.
Problem 2
Consider the data sets size
and color
.
size <- tibble(
product_name = c('mini', 'pro', 'maxi'),
diameter = c(23, 27, 32)
)
color <- tibble(
product_name = c('mini','light', 'pro'),
hue = c('red','green', 'silver')
)
We would like to merge the tables so that the result looks like:
# A tibble: 4 × 3
product_name hue diameter
<chr> <chr> <dbl>
1 mini red 23
2 light green NA
3 pro silver 27
4 maxi NA 32
Which of the following commands does the operation?
Try again!
Try again!
Try again!
That's right!
< >
Problem 1
We have used table1
, table2
and table3
as example in the slides.
Which table is tidy?
table1
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 × 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
That's correct!
Try again! Each cell in a tidy data format should hold only a single
value. The column rate
contains data with two values (one divided by
the other) and is converted to characters.
Try again! Each cell in a tidy data format should hold only a single
value. The column rate
in Table 3 contains data with two values (one
divided by the other) and is converted to characters.
Try again! Each cell in a tidy data format should hold only a single
value. The column rate
in Table 3 contains data with two values (one
divided by the other) and is converted to characters. Table 2 contains
two variables under the same column 'type'.
Problem 2
Consider the following simple table smoke
, which command do I need to
transform smoke
to smoke2
?
smoke <- tribble(
~smoke, ~employed, ~unemployed,
"yes", 23, 10,
"no", 20, 12
)
smoke2
# A tibble: 4 × 3
smoke employment_status count
<chr> <chr> <dbl>
1 yes employed 23
2 yes unemployed 10
3 no employed 20
4 no unemployed 12
Try again!
That's correct!
Try again!
Try again!
Problem 3
Which of the following is an appropriate way to check if there are missing observations in your data?
That's not very effective nor efficient. Use R command to help you!
That's not very effective nor efficient. Use R command to help you!
That's correct!
That's incorrect.
Congratulations! This is the end of Data Wrangling!
Here is a useful cheat sheet from the RStudio with commands that we introduced and many more: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf