rwdir/explore.qmd at main · utdata/rwdir · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# Exploring a new dataset {#explore}

For those unfamiliar with exploring data, starting the process can be paralyzing.

How do I explore when I don't know what I'm looking for? Where do I start?

Every situation is different, but there are some common techniques and some common sense that you can bring to every project.

## Start by listing questions

It's likely you've acquired data because you needed it to add context to a story or situation. Spend a little time at the beginning brainstorming as list of questions you want to answer. (You might ask a colleague to participate: the act of describing the data set will reveal questions for both of you.) I like to start my R notebooks with this list.

## Understand your data

Before you start working on your data, make sure you understand what all the columns and values mean. Look at your data dictionary, or talk to the data owner to make sure you understand what you are working with.

To get a quick summary of all the values, you can use a function called `summary()` to give you some basic stats for all your data. Here is an example from the Billboard Hot 100 data we used in a class assignment.

![Summary of billboard data](images/explore-summary.png)

A `summary()` will show you the data type for each column, and then for number values it will show you the min, max, median, mean and other stats.

## Learn more through cleaning

The act of inspecting your data, cleaning data types and poking around for problems can help you learn more about it. Some tips to make sure your data is ready to analyze:

- Column names should be short, descriptive and clear. Save typing by making them all lowercase with parts separated by underscores. (You can use the `clean_names()` functions for this). Use patterns for related variables like `_date` or `_address` parts.
- Fix all your data types. Make sure numbers are numbers (can use `parse_number()` to fix) and dates into real dates (see [lubridate](https://lubridate.tidyverse.org/)) and maybe create date parts for grouping. Make sure number-oriented IDs are text if they are supposed to start with zero. (ZIP codes are like this ... not really a number, but category.)
- Remove the cruft of columns you don't need using `select()`. Make sure you understand what you are removing, but feel secure that you can come back and add them back if you find you need them.

### Cleaning up categorical data

If you are going to count or summarize rows based on categorical data, you should make sure the values in that column are clean and free of typos and values that might better be combined.

Some strategies you might use:

-   Create a `count()` of the column to show all the different values and how often they show up.
-   You might want to use `mutate()` to create a new column and then update the values there, perhaps using ["case_" functions](https://dplyr.tidyverse.org/reference/case_match.html) like  `case_match()` or `case_when()` the rename specific values in your data.

If you find you have hundreds of values to clean, there are some other tools like [OpenRefine](http://openrefine.org/) you can learn fairly quickly to help.

## Pay attention to the shape of your data

Is your data long or wide?

Wide data adds new observations as columns, with the headers describing the observation. Official reports and Excel files from agencies are often in this format:

| Country       | 2018       | 2017       |
|---------------|------------|------------|
| United States | 20494050 | 19390604 |
| China         | 13407398 | 12237700 |

Long data is where each row in the data is a single observation, and each column is an attribute that describes that observation. Data-centric languages and applications like R and Tableau typically prefer this format.

| Country       | Year | GDP        |
|---------------|------|------------|
| United States | 2018 | 20494050 |
| United States | 2017 | 19390604 |
| China         | 2018 | 13407398 |
| China         | 2017 | 12237700 |

The shape of the data will determine how you go about analyzing it. They are both useful in different ways. Wide data allows you to calculate columns to show changes. Visualization programs will sometimes want a long format to more easily categorize values based on the attributes.

You can [pivot](https://tidyr.tidyverse.org/articles/pivot.html) your data with [`pivot_longer()`](https://tidyr.tidyverse.org/reference/pivot_longer.html) and [`pivot_wider`](https://tidyr.tidyverse.org/reference/pivot_wider.html) to change the shape of your data.

## Counting and aggregation

A large part of data analysis is counting and sorting, or filtering and then counting and sorting. Depending on the program you are using you may approach it differently but think of these concepts:

### Counting rows based on a column

If you are just counting the number of rows based on the values within a column (or columns), then `count()` is the key. When you use `count()` like this, a new column called `n` is created to hold the count of the rows. You can rename `n` with the `name = "new_name"` argument, and you can change the sorting to descending order using the `sort = TRUE` argument.

In this example, we are counting the number of rows for each princess in our survey data, the arranging them in descending order.

``` r
survey |>
  count(princess, name = "votes", sort = TRUE)
```

| princess                      | votes |
|-------------------------------|------:|
| Mulan                         |    14 |
| Rapunzel (Tangled)            |     7 |
| Jasmine (Aladdin)             |     6 |
| Ariel (Little Mermaid)        |     5 |
| Tiana (Princess and the Frog) |     2 |
| Aurora (Sleeping Beauty)      |     1 |
| Belle (Beauty and the Beast)  |     1 |
| Merida (Brave)                |     1 |
| Snow White                    |     1 |

### Sum, mean and other aggregations

If you want to aggregate values in a column, like adding together values, or to find a mean or median, then you will want to use the GSA combination: `group_by()` on your columns of interest, then use `summarize()` to aggregate the data in the manner you choose, like `sum()`, `mean()` or the number of rows `n()`. You can then use `arrange()` to order the result however you want.

Here is an example where we use `group_by` and `summarize()` to add together values in our mixed beverage data. In this case, we had multiple rows for each name/address group, but we wanted to add together `total_receipts()` for each group.

``` r
receipts |>
  group_by(location_name, location_address) |>
  summarize(
    total_sales = sum(total_receipts)
  ) |>
  arrange(desc(total_sales))
```

| location_name        | location_address                  | total_sales |
|----------------------|-----------------------------------|------------:|
| WLS BEVERAGE CO      | 110 E 2ND ST                      |    35878211 |
| RYAN SANDERS SPORTS  | 9201 CIRCUIT OF THE AMERICAS BLVD |    20714630 |
| W HOTEL AUSTIN       | 200 LAVACA ST                     |    15435458 |
| ROSE ROOM/ 77 DEGREE | 11500 ROCK ROSE AVE               |    14726420 |
| THE DOGWOOD DOMAIN   | 11420 ROCK ROSE AVE STE 700       |    14231072 |

The result will have all the columns you included in the group, plus the columns you create in your summarize statement. You can summarize more than one thing at a time, like the number of rows `numb_rows = n()` and average of the values `average = mean(column_name)`.

### Creating columns to show difference

Sometimes you need to perform math on two columns to show the difference between them. Use `mutate()` to create the column and do the math. Here's a pseudo-code example:

``` r
new_or_reassigned_df <- df |>
  mutate(
    new_col_name = (part_col / total_col) * 100
  )
```


## Time as a variable

If you have dates in your data, then you almost always want to see change over time for different variables.

-   Summarize records by year or month as appropriate and create a Bar or Column chart to show how the number of records for each time period.
-   Do you need to see how different categories of data have changed over time? Consider a line chart that shows those categories in different colors.
-   If you have the same value for different time periods, do might want to see the change or percent change in those values. You can create a new column using `mutate()` to do the math and show the difference.

## Explore the distributions in your data

We didn't talk about histograms in class, but sometimes you might want see the "distribution" of values in your data, i.e. how the values vary within the column. Are many of the values similar? A histogram can show this.

Here is an example of a histogram use wells data exploring the borehole_depth (how deep the well is). Each bar represents the number of wells broken down in 100ft depth increments (set with `binwidth=100`). So the first bar shows that most of the wells (more than 7000) are less than 100 feet deep.

``` r
wells |>
  ggplot(aes(x = borehole_depth)) +
  geom_histogram(binwidth = 100)
```

![Borehole depth histogram](images/explore-histogram.png)

While there are wells deeper than 1000 feet, they are so few they don't even show on the graphic.

You'll rarely use a histogram as a graphic with a story because they are more difficult to explain to readers. But they do help you to understand how much values differ within a column.

### More on histograms

If you google around, you might see other ways to create a histogram, including `hist()` and `qplot()`. You might stick with the ggplot's `geom_histogram()` since you already are familiar with the syntax.

-   [Tutorial on histograms](https://www.datacamp.com/community/tutorials/make-histogram-ggplot2) using ggplot from DataCamp.
-   [R Cookbook](http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/) on histograms.

## Same ideas using spreadsheets

Check out [this resource by David Eads](https://training.npr.org/2015/10/15/what-to-do-with-a-big-pile-of-data/) on the same topic, with some more specifics about Google Sheets.