rwdir/categorization.qmd at main · utdata/rwdir · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
title: "Categorization"
---

## Overview

Sometimes we have data where a variable has too many unique values to group/sum properly. It could be because we have a more specific need, or because the data is "dirty" with misspellings and minor differences which mean the same thing.

I want to outline several ways to deal with this, with growing complexity (and flexibility). We'll end with an example from the military surplus data.

All of these solutions will use `mutate()` because you are changing or creating data. But the methods I'm considering are:

- `if_else()` that requires only one test. We'll make it `TRUE` based on a certain circumstance, and `FALSE` if not. I'll also show how you could use values other than T/F.
- `case_match()` is where you are looking inside a single column and changing one value to another, but we can do this with any number of values. They just all have to be 1-to-1. If this, then that.
- `case_when()` allows you to perform a series of tests on multiple variables, setting new values based on the results.

I'll use the [Starwars characters](https://dplyr.tidyverse.org/reference/starwars.html) dataset for my examples.

## Setup

We need our basic libraries:

```{r}
#| label: setup
#| message: false
#| warning: false

library(tidyverse)
library(janitor)
```

The starwars character data is included with Tidyverse, but I'm going to use a simplified version without the series data about movies, etc.

```{r}
starwars <- starwars |> select(name:species)

starwars
```

## if_else(): Creating a flag variable

I actually have a good example of this in the [Denied Cleaning](denied-cleaning.qmd#create-an-audit-benchmark-column) chapter when we make the audit benchmark column, but I'll include a JedR version here.

Within our starwars data, we have a variable called `species` that has a number of values.

```{r}
starwars |> count(species)
```

Most of the characters are "Human", but let's say I want to do a series of analysis based on whether a species is "Human" vs all the other options. I can create a _flag_ variable (True/False) based on whether that. So, I want `TRUE` if "Human" and `FALSE` if not.

The [`if_else()`](https://dplyr.tidyverse.org/reference/if_else.html) function is perfect for this.

```{r}
sw_human <- starwars |>
  mutate( #<1>
    human = if_else(species == "Human", TRUE, FALSE), # <2>
    .after = species # <3>
  )

# selecting specific variables so we can see them easily
sw_human |> select(name, species, human)
```

1. I'm using `mutate()` to creat the new variable.
2. `if_else()` takes three arguments. The first is the test, the second is the value to insert if it is true (and i'm using an actual `TRUE` value here) and the third is to insert if it is false (and I'm using `FALSE`)

So now I can easily count how many characters are Human vs Not Human.

```{r}
sw_human |>
  count(human, name = "cnt_human")
```

I don't have to use real `TRUE` and `FALSE` values here. I can insert anything.

```{r}
starwars |>
  mutate(
    human_text = if_else(species == "Human", "Human", "Not Human")
  ) |>
  select(name, species, human_text)
```

## case_match(): Recategorize values

But what if the test isn't so simple ... that it isn't either one or the other.

Using [case_match()](https://dplyr.tidyverse.org/reference/case_match.html) we can make 1v1 switches for some values within a variable, and then choose what to do with the rest of them en masse.

One thing about `case_match()` ... we are only affecting values in a single variable. It's good for cleaning those, but not very flexible beyond that because we can't consider other variable in our tests. It's just "change this into that."

Let's say I want to update "Yoda's species" to "Yoda", keep "Human" as such and then make everything else "Other".

```{r}
sw_species_simple <- starwars |>
  mutate(
    new_species = case_match( # <1>
      species, # <2>
      "Yoda's species" ~ "Yoda", # <3>
      "Human" ~ "Human", # <4>
      .default = "Other" # <5>
    )
  )

# selecting specific variables to see the results
sw_species_simple |> select(name, species, new_species)
```

1. We set the name of the new variable first, then set the value to the result of the `case_match()` function. I could replace the same variable, but then I wouldn't be able to inspect the changes.
2. `case_match()`works on the values from single variable, so you have to define which column you are working with. We are using the `species` variable here.
3. Here we change the "Yoda's species" value to just "Yoda".
4. Here we set "Human" as itself so we can preserve it. Otherwise it would also be changed to "Other"
5. Here we set what all the other values we have not specified should be changed to. We set them to "Other."

Using this method, those `species` values that were `NA` are also changed to "Other" since they didn't fit the other two roles.

If we count on the new variable, this is what we get.

```{r}
sw_species_simple |>
  count(new_species, name = "cnt_species_simple")
```

An option that can make this real useful is I can choose to only a couple of values in the variable and then leave the others as-is. I'm going to do this all at once to save time and just show the result.

Here I change "Yoda's species" to just "Yoda" and all the `NA`s to "Other", but leave the rest as it was, using the existing `species` values.

```{r}
starwars |>
  mutate(
    newer_species = case_match(
      species,
      "Yoda's species" ~ "Yoda", # <1>
      NA ~ "Other", # <2>
      .default = species # <3>
    )
  ) |>
  slice(10:20) |> # <4>
  select(name, species, newer_species) # <5>
```

1. We are changing "Yoda's species" to "Yoda".
2. We set all the NA values to "Other".
3. We set the remain rows to be their original `species` values.
4. I'm using `slice()` here just to show you the row that includes "Yoda" so you can see it. It's just a display thing.
5. Also a display thing: I'm selecting the relevant variables so we can see them.

## case_when(): More power

If we need more logic, we can use [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) to consider tests in any column to affect values in a single one.

This example is a bit contrived, but hopefully you can follow. We want Luke Skywalker and the Lars family to be classified as "Lars Farmers", along with the famous droids who worked there.

We'll start with this: Let's create a new variable call `lars_farm` that defines the following:

- Anyone Lars family and Luke Skywalker will be "Lars Farmers".
- We'll also add R2D2 and C-3PO to the same "Lars Farmers" group.
- Everyone else will get their original homeworld.


```{r}
sw_lars <- starwars |>
  mutate(
    lars_farm = case_when( # <1>
      str_detect(name, "Lars|Luke Skywalker") ~ "Lars Farmers", # <2>
      name %in% c("R2-D2", "C-3PO") ~ "Lars Farmers", # <3>
      .default = homeworld # <4>
    )
  )

# select relevant variables
sw_lars |> select(name, homeworld, lars_farm)
```

1. One difference between `case_match()` and `case_when()` is we can consider any test from any column. Here we create the new variable `lars_farm` to and set it to use the results from the `case_when()` function.
2. Here we use use `str_detect()` to look for "Luke Skywalker" and any name that includes "Lars". We set those to "Lars Farmers".
3. Here we add the two droids by specifically looking for those names and also set it to "Lars Farmers".
4. We set everyone else to use their `homeworld` value.

## Military surplus: control_type

In the [Military Surplus Cleaning chapter](leso-cleaning.qmd#controlled) we created a `control_type` variable, creating a category to designate if our items have to be returned to the Department of Defense for disposal. Refer back to that chapter about why were need to do this.

I skipped the detailed explanation of the code there because it is a more advanced tactic and would mire the flow of the lesson. In this appendix, I go into the details for those who might be interested.

### Catching up with the data

I'll start here with the Military Surplus data after the shipment totals have been calculated.

```{r}
leso_total <- read_rds("data-processed/rwdir-leso-total.rds")

leso_total |> head()

leso_total |> glimpse()
```

### Categorization logic with case_when()

We will use the `mutate()` function to create a new column called `control_type`. We've used mutate before, but this time we will fill in values in the new column based on other data inside each row. `case_when()` allows us to create a test (or a number of tests) and then mark the new value based on the answer. Once new data has been written the function evaluates the next row, so we write the most specific rules first.

I usually approach this by thinking of the logic first, then writing some code, then testing it. Sometimes my logic is faulty and I have to try again, which is why we test the results. Know this could go on for many cycles.

Here is the basic logic:

-   We want to create a new column to denote if the item is controlled. In that column we want it to be TRUE when an item is controlled, and FALSE when it is not.
-   We know that items with "AIRPLANE" are always controlled, no matter their demil designations.
-   Otherwise we know that items that have a `demil_code` of "A", OR a `demil_code` of "Q" AND a `demil_id` of "6", are non-controlled.
-   Everything else is controlled.

I've noted this logic in a specific order for a reason: It's the order that we write the logic in the code based on how the function `case_when()` works. This process is powerful and can get complicated depending on the logic needed. This example is perhaps more complicated than I like to use when explaining this concept, but this is real data and we *need* this, so here we go.

Here is the code and annotations.

```{r control-engine}
leso_control <- leso_total |> # <1>
  mutate( # <2>
    control_type = case_when( # <2>
      str_detect(item_name, "AIRPLANE") ~ TRUE, # <3>
      (demil_code == "A" | (demil_code == "Q" & demil_ic == 6)) ~ FALSE, # <4>
      .default = TRUE # <5>
    )
  )

leso_control |> glimpse() # <6>
```

1.  Our first line creates a new tibble `leso_control` and fills it with the result of the rest of our expression. We start with the `leso_total` tibble.
2.  We mutate the data and start with the name of new column: `control_type`. We are filling that column with the result of the `case_when()` function for each row. Within the `case_when()` we are making the determination if the item is controlled or not. The left side of the `~` is the test, and the right side of `~` is what we enter into the column if the test passes. But we have more than one test:
3.  The first test is we use the [`str_detect()`](https://stringr.tidyverse.org/reference/str_detect.html) function to look inside the `item_name` field looking for the term "AIRPLANE". If the test finds the term, then the `control_type` field gets a value of `TRUE` and we move to the next row. If not, it moves to the next rule to see if that is a match. (We could fill this column with any text or number we want, but we are using `TRUE` and `FALSE` because that is the most basic kind of data to keep. If the item is controlled, set the value is TRUE. If not, it should be set to FALSE.)
4.  Our second rule has two complex tests and we want to mark the row FALSE if either are true. (Remember, this is based on what the DLA told me: items with A or Q6 are non-controlled.) Our `case_when()` logic first looks for the value "A" in the `demil_code` field. If it is yes, then it marks the row FALSE. If no it goes to the next part: Is there a "Q" in the `demil_code` field AND a "6" in the `demil_ic` field? Both "Q" and "6" have to be there to get marked as FALSE. If both fail, then we move to the next test.
5.   The last test is our catch-all. If none of the other rules apply, then we set the default value as `TRUE`, which means it is controlled. So our default in the end is to mark everything TRUE if any of the other rules don't mark it first.
6.   Lastly we glimpse at the data just so we can see the column was created.

As I said, there was a lot of trial and error to figure that out, but I'll show some tests here to show that we did what we were intending.

This shows airplanes are marked as controlled with `TRUE`.

```{r check-airplane}

# showing the results and some columns that determined them
leso_control |>
  select(item_name, demil_code, demil_ic, control_type) |>
  filter(str_detect(item_name, "AIRPLANE"))
```

This shows how many items are marked TRUE vs FALSE for each `demil_code` and `demil_ic` combination. I used it to check that most A records were FALSE, along with Q6.

```{r check-type}
leso_control |>
  count(demil_code, demil_ic, control_type, name = "cnt") |> # <1>
  pivot_wider(names_from = control_type, values_from = cnt) # <2>
```

1. This is the count shortcut using three variables. I name counted column `cnt`. It's hard to visualize that result here but it counts how many times each unique combination of `demil_code`, `demil_ic` and `control_type`.
2. Here I use pivot_wider() to show the TRUE and FALSE counts on the same row. Just makes it easier to see.

We're done with this extended explanation.