workshop/training/060-data_objects/README.Rmd at master · bi-sdal/workshop · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
---
title: "README.md"
author: "Daniel Chen"
date: ""
output:
  md_document:
    variant: markdown_github
    toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Working with dataframes

```{r}

# load the ggplot2 library to get the diamonds dataset
library(ggplot2)

# look at the diamonds dataset
diamonds
```


You can see here that it prints out a portion of the dataset,
this is because this is an `tibble` object, which is an extension of the core
`data.frame` object in R.

If you work with any package that Hadley Whickham has written,
be prepared to see tibbles.

They (for the most part) will work exactly the same as `data.frame` objects


## Subset columns
```{r, results='hide'}
# get 1 column from the dataframe
diamonds$carat
```

I'm not showing the output for the code becuse there are too many values to print to the screen
The results of using the `$` notation for subsetting will be a `vector` of values

```{r}
# get 1 column using bracket notation
# bracket noation [row, columns]
diamonds[, 'carat']
```

You can visually see the difference between the `[ ]` and `$` notation of subsetting.
`[ ]` will return a dataframe (or in this case, a tibble) back

```{r}
# get multiple columns from the dataset
diamonds[, c('carat', 'price')]
```

## Subset rows
```{r}
# various ways of getting rows from data

diamonds[1, ]
diamonds[10, ]
diamonds[c(1, 2), ]
diamonds[c(11:20), ]
diamonds[11:20, ]
```

## Subset rows and columns

```{r}
# getting rows and columns from your data
diamonds[11:20, c('cut', 'color')]
```

## Getting a glimpse of your data

Tibbles are nice in that they don't print all the rows out to the screen by default.
If you end up working with other dataframe-like objects, they might print out more rows of data you will care to look at

```{r}
# get the first 6 rows of the dataset
head(diamonds)

# get the last 6 rows of the dataset
tail(diamonds)
```

```{r}
# get numner of rows and columns
dim(diamonds)

nrow(diamonds)
ncol(diamonds)

# get the number of columns from dim
dim(diamonds)[2]
```

# cars dataset
```{r}
# look at the built in r dataset, cars
head(cars)
```

```{r}
# use the class function to see what you have
class(diamonds)
class(cars)
```

# mtcars datset

```{r}
# the mtcars datset
head(mtcars)
```

```{r}
# get a specific value out of the data
mtcars[1, 'wt']
```

## getting just the columns names and row names
```{r}
names(mtcars)
row.names(mtcars)
```

```{r}
# specifying row and columns by name
mtcars["Mazda RX4", 'wt']
```

## Creating new columns
```{r}
# create new column called cars
mtcars$cars <- row.names(mtcars)

mtcars$cyl_wt <- mtcars$cyl / mtcars$wt

```

## Getting summary statistics on your data
summary(mtcars)

## Using `attach`

The reason why you would *not* want to use `attach` in your code is becuase it mucks up the variables
you have access to in your R code.

It can be very easy to forget which what the column names are in your dataset.

```{r}
# dont do this
attach(mtcars)
mpg
detach(mtcars)

# do this idead, if you want
with(mtcars, mpg)
```

# Conditonal Subsetting

We can perform conditional subsetting by using the way we subset dataframes: `[ , ]`

```{r}
summary(mtcars)
```

Let's subset our data and look for all the cars where the `mpg` is less than `15.43`

```{r}
mtcars[mtcars$mpg < 15.43, ]
```

The reason why this works is becuase the row subsetter returns a vector of `TRUE`/`FALSE` values

```{r}
mtcars$mpg < 15.43
```

When you give a dataframe row or column slicer a vector of booleans (i.e., TRUE/FALSE),
it will return the rows and/or columns that have a TRUE.

## Subsetting a single column with `[ ]`

Subsetting and slicing dataframs can be weird.
when you subset a `data.frame` that ends up with 1 column it will come back as a `vector` and not a `data.frame`

Note, `tibble` objects don't do this.

You can prevent this automatic conversion to a vector by passing in the `drop = FALSE` into the `[ ]` subsetter

```{r}
mtcars[mtcars$mpg < 15.43, 'cyl', drop = FALSE]
```

Otherwise you will have to manually convert the vector back to a `data.frame` by using `as.data.frame`

```{r}
as.data.frame(mtcars[mtcars$mpg < 15.43, 'cyl'])
```

But then you'll have to end up changing the column name

```{r}
bool_df <- as.data.frame(mtcars[mtcars$mpg < 15.43, 'cyl'])
names(bool_df) <- 'cyl'
bool_df
```

## Multiple boolean conditions

You can use the `|` or `&` to have an OR or AND operator in your boolean condition, respectively.

```{r}
mtcars[mtcars$mpg < 15.43 | mtcars$cyl == 4, ]
```

```{r}
mtcars[mtcars$mpg < 15.43 & mtcars$cyl == 4, ]
```


# NA Missing Values

Missing values in R are coded as an `NA` value

```{r}
NA
```


You can't use the `==` operator to check for missing values

```{r}
NA == NA
```

```{r}
NA == TRUE
```

```{r}
NA == FALSE
```

```{r}
NA == 3
```


You have to use a special function to check for missing values

```{r}
is.na(NA)
```