index.html

<!DOCTYPE html>
<html lang="" xml:lang="">
  <head>
    <title>Strings and regex in R</title>
    <meta charset="utf-8" />
    <meta name="date" content="2020-11-16" />
    <link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
    <link rel="stylesheet" href="custom-fonts.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">
class: center, middle, inverse, title-slide

# Strings and regex in R
## Maria Novosolov
### 2020-11-16

---


# When does it come up?

- data stored as notes
- non-uniformly formatted data
- filenames
- almost everywhere

---

# What are strings?

Strings (character types) = pretty much anything surrounded by quotes

**Double quotes** are the preferred style unless your text contains double quotes

.center[
![](https://media.giphy.com/media/xTiN0z1dTpeedhds8o/giphy.gif)
]

---

.center[.img-med[
![](https://media.giphy.com/media/HUjc0sbfgaInK/giphy.gif)
]]

### `"Keeps away the nargles."`  

--
### `'Luna whispered, "Keeps away the nargles."'`  

--
### `"Luna's eyes widened, and she whispered, \"Keeps away the nargles.\""`


---

# What's with the backslash?

.center[.img-big[
![](https://media.giphy.com/media/gCwXh8EHKKfMk/giphy.gif)
]]

---

# Escaped characters

Special characters that may have an alternate meaning

In our previous case: is `"` literally a double-quote or is it marking a character string?  

--

Some other examples:

`\n` = new line  
`\t` = tab  
`\\` = backslash  

---

# Check your escapes with `writeLines()`


```r
writeLines("Nitwit! \n\tBlubber! \n\t\tOddment! \n\U1F9D9\t\t\tTweak!")
```

```
## Nitwit! 
## 	Blubber! 
## 		Oddment! 
## 🧙			Tweak!
```

_You can also write emojis this way!_


---

# UTF-8 

Character-encoding system of choice but not fully supported by Windows (yet); this is how you can write in Hebrew, English, Chinese, and emojis in one sentence

It is also possible to use letter code to write non-Latin alphabet languages. For example:

Hebrew with letter codes:


```r
writeLines("\U05DB\U05EA\U05D1") 
```

```
## כתב
```

```r
# write the letters-codes left-to-right but it prints right-to-left
```

---
class: exercise, center, middle

Use `writeLines()` to write your name, degree, and favorite day of the week. Add an emoji in the end. Each one should be in a new line and tab every other line.

---

class: inverse

# stringr

.center[
![](https://stringr.tidyverse.org/logo.png)

https://stringr.tidyverse.org  

_A consistent, simple and easy-to-use set of wrappers around the `stringi` package._
]

---

# General pattern

.center[.xlarge[
str_.gray[verb](.gray[string], .gray[...]) 
]] 

.large[
string = character string or vector

... = additional arguments include `pattern` to match or replacement string
]
---

# Functions for the day

.large[
- single-input functions: `str_verb(string)`
- multi-input functions: `str_verb(string, pattern, ...)`
- `*_all` variants
- `str_glue()` for interpolating strings
- `str_view()` for checking regex
]

If you're already familiar with the base R equivalents, check out [this vignette](https://stringr.tidyverse.org/articles/from-base.html) for the "translations"

---

# Let's get some data


```r
spells &lt;- read_csv("data/spells.csv") # from the rcorpora pkg

(incantation &lt;- spells$incantation[1:5])
```

```
## [1] "Accio"     "Aguamenti" "Alohomora" "Anapneo"   "Aparecium"
```

.center[![](https://media.giphy.com/media/kSjGBipdCXdHW/giphy.gif)]

---

.center[.xlarge[
str_.gray[verb](.gray[string]) 
]] 


```r
# often a useful first step to avoid dealing with capitals
str_to_lower(incantation) 
```

```
## [1] "accio"     "aguamenti" "alohomora" "anapneo"   "aparecium"
```

```r
str_length(incantation)
```

```
## [1] 5 9 9 7 9
```

---

.center[.xlarge[
str_.gray[verb](.gray[string], .gray[pattern]) 
]] 


```r
str_detect(incantation, "o") # similar to grepl
```

```
## [1]  TRUE FALSE  TRUE  TRUE FALSE
```

```r
str_subset(incantation, "o")
```

```
## [1] "Accio"     "Alohomora" "Anapneo"
```

```r
str_count(incantation, "o")
```

```
## [1] 1 0 3 1 0
```

---

.center[.xlarge[str_.gray[verb]_all(.gray[string], .gray[...]) ]]


```r
str_extract_all(incantation, ".o")
```

```
## [[1]]
## [1] "io"
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "lo" "ho" "mo"
## 
## [[4]]
## [1] "eo"
## 
## [[5]]
## character(0)
```

---

class: exercise

# Your turn

How many `object`s are there among the effect descriptions?   
Replace `object` with a different noun. 


```r
spells
```

```
## # A tibble: 91 x 3
##   incantation effect                     type 
##   &lt;chr&gt;       &lt;chr&gt;                      &lt;chr&gt;
## 1 Accio       Summons an object          Charm
## 2 Aguamenti   Shoots water from wand     Charm
## 3 Alohomora   Opens locked objects       Charm
## 4 Anapneo     Clears the target's airway Spell
## 5 Aparecium   Reveals invisible ink      Spell
## # … with 86 more rows
```

---

.center[.xlarge[
str_glue(.gray[...]) 
]] 


```r
# works similarly to `paste()` or `sprintf()`
str_glue('Hermione shouted, "{incantation}!"')
```

```
## Hermione shouted, "Accio!"
## Hermione shouted, "Aguamenti!"
## Hermione shouted, "Alohomora!"
## Hermione shouted, "Anapneo!"
## Hermione shouted, "Aparecium!"
```

---

class: inverse, center, middle

.xlarge[regex]

## regular expressions

---

class: exercise

# How would you extract out amounts of money?

&gt;"We'll bet 37 Galleons, 15 Sickles, 3 Knuts"  
"George and I invented them - 7 Sickles each, a bargain!"  
"True, both of them had paid 2 Sickles for a S.P.E.W. badge"  
"And 1000 Galleons prize money!"

--
Pseudocode:


```r
money_types &lt;- c("Galleons", "Sickles", "Knuts") 

sentences %&gt;% 
  extract(number_before(money_types), money_types) 
```

---

class: exercise

&gt;"We'll bet 37 Galleons, 15 Sickles, 3 Knuts"  
"George and I invented them - 7 Sickles each, a bargain!"  
"True, both of them had paid 2 Sickles for a S.P.E.W. badge"  
"And 1000 Galleons prize money!"

&gt;"I pull down about 100 sacks of galleons a year!"

--

Pseudocode:


```r
money_types &lt;- c("Galleons", "Sickles", "Knuts") %&gt;% 
* to_lower_case()

sentences %&gt;% 
* to_lower_case() %&gt;%
* extract(closest_number_before(money_types)), money_types)
```

---

# regex helpers

.center[.xlarge[`str_view_all()`]] 

and `str_view()` are your friends

Other resources:

- [regex101](https://regex101.com) - interpret regex
- [regexplain](https://www.garrickadenbuie.com/project/regexplain/) - interpret and write regex (lots of cheatsheets)
- [rex](https://github.com/kevinushey/rex) package - "friendly" regex
- [rebus](https://cran.r-project.org/web/packages/rebus/rebus.pdf) package - "friendly" regex
- [regexpal](https://www.regexpal.com/) - interpret regex

---

# Key expressions 

_Note:_ There are often several ways of writing the same regex. For this presentation, I chose my favorite style, which I find the most flexible

`^` = start  
`$` = end  
`.` = anything  


```r
str_subset(c("football", "baseball", "ballroom"), "ball$")
```

```
## [1] "football" "baseball"
```

---

.center[.xlarge[[.gray[one of]]]]

brackets = _one of_ the characters specified within the brackets (in the example, an `o`, `n`, `e`, space, or `f`)

`[a-z]` = any lower case letters  
`[0-9]` = any number from 0 to 9 (also `\\d`)

variations on the theme:  
`[09-]` = 0, 9, or -

---

class: exercise

# Guess the regex

.center[.xlarge[any vowel]]


---

class: exercise

# Guess the regex

.center[.xlarge[starts with a capital letter]]


---

class: exercise

# Guess the regex

.center[.xlarge[a vowel and the two characters next to it]]


---

# Numbers of things

`*` = 0 or more  
`+` = 1 or more  

`{n}` = exactly `n` number of times  
`{n,}` = at least `n` number of times  
`{n,m}` = between `n` and `m` times

---

class: exercise

# Guess the regex

.center[.xlarge[at least one number]]


---

class: exercise

# Guess the regex

.center[.xlarge[ends with `jpg` or `jpeg`]]


---

class: exercise

# Guess the regex

.center[.xlarge[at least 2 #s in a row]]

---

class: inverse, center, middle

# How do we say NOT?

---

class: center, middle

.img-big[
![](https://media.giphy.com/media/gbErpwcLlizvi/giphy.gif)
]

---

_Inside_ of brackets, a caret ("hat"/`^`) means **NOT**

_Outside_ of brackets and at the beginning of a string, it means **begins with**

.center[.img-big[![](https://media.giphy.com/media/FvJ4fmyI6gHLi/giphy.gif)]]

---

class: exercise

# Which is which?

.center[.xlarge[`^[a-z]` versus `[^a-z]`]]

---

class: inverse, center, middle

# Working in a dataframe

---


```r
spells
```

```
## # A tibble: 91 x 3
##   incantation effect                     type 
##   &lt;chr&gt;       &lt;chr&gt;                      &lt;chr&gt;
## 1 Accio       Summons an object          Charm
## 2 Aguamenti   Shoots water from wand     Charm
## 3 Alohomora   Opens locked objects       Charm
## 4 Anapneo     Clears the target's airway Spell
## 5 Aparecium   Reveals invisible ink      Spell
## # … with 86 more rows
```

---


```r
spells %&gt;% 
* mutate(effect = str_to_lower(effect))
```

```
## # A tibble: 91 x 3
##   incantation effect                     type 
##   &lt;chr&gt;       &lt;chr&gt;                      &lt;chr&gt;
## 1 Accio       summons an object          Charm
## 2 Aguamenti   shoots water from wand     Charm
## 3 Alohomora   opens locked objects       Charm
## 4 Anapneo     clears the target's airway Spell
## 5 Aparecium   reveals invisible ink      Spell
## # … with 86 more rows
```

---


```r
spells %&gt;% 
  mutate(effect = str_to_lower(effect)) %&gt;% 
* mutate(effect = str_split(effect, " "))
```

```
## # A tibble: 91 x 3
##   incantation effect    type 
##   &lt;chr&gt;       &lt;list&gt;    &lt;chr&gt;
## 1 Accio       &lt;chr [3]&gt; Charm
## 2 Aguamenti   &lt;chr [4]&gt; Charm
## 3 Alohomora   &lt;chr [3]&gt; Charm
## 4 Anapneo     &lt;chr [4]&gt; Spell
## 5 Aparecium   &lt;chr [3]&gt; Spell
## # … with 86 more rows
```

### Creates a _list-column_; supports vector results of different lengths

---


```r
spells %&gt;% 
  mutate(effect = str_to_lower(effect)) %&gt;% 
  mutate(effect = str_split(effect, " ")) %&gt;% 
* unnest(effect)
```

```
## # A tibble: 351 x 3
##   incantation effect  type 
##   &lt;chr&gt;       &lt;chr&gt;   &lt;chr&gt;
## 1 Accio       summons Charm
## 2 Accio       an      Charm
## 3 Accio       object  Charm
## 4 Aguamenti   shoots  Charm
## 5 Aguamenti   water   Charm
## # … with 346 more rows
```

### _Note:_ this is very similar to the `unnest_tokens()` function in the [`tidytext`](https://www.tidytextmining.com/) package

---


```r
spells %&gt;% 
  mutate(effect = str_to_lower(effect)) %&gt;% 
  mutate(effect = str_split(effect, " ")) %&gt;% 
  unnest(effect) %&gt;% 
* count(effect, sort = TRUE)
```

```
## # A tibble: 231 x 2
##   effect      n
##   &lt;chr&gt;   &lt;int&gt;
## 1 a          11
## 2 an         11
## 3 to          9
## 4 object      7
## 5 objects     7
## # … with 226 more rows
```

---

class: inverse, center, middle 

# A few more miscellaneous tricks

.x-large[![](https://media.giphy.com/media/54Yiwc8TSvoJO/giphy.gif)]

---

.center[.xlarge[(.gray[capture groups])]]

Extract groups within a pattern


```r
phone_numbers &lt;- c("058 222 1234", "054-121 1221")

str_match(phone_numbers, "([0-9]+)[ -]([0-9]+)[ -]([0-9]+)")
```

```
##      [,1]           [,2]  [,3]  [,4]  
## [1,] "058 222 1234" "058" "222" "1234"
## [2,] "054-121 1221" "054" "121" "1221"
```

```r
# same as "(\\d+)[ -](\\d+)[ -](\\d+)"
```

---

.pull-left[
## look behinds `(?&lt;=)` 

![](https://static.boredpanda.com/blog/wp-content/uploads/2016/10/newborn-baby-harry-potter-photo-shoot-kayla-glover-4.jpg)
]

.pull-right[
## look aheads `(?=)`

![](https://imagesvc.meredithcorp.io/v3/mm/image?url=https%3A%2F%2Fewedit.files.wordpress.com%2F2015%2F01%2Fharry-potter_510.jpg&amp;w=400&amp;c=sc&amp;poi=face&amp;q=85)
]

---

class: inverse, center, middle 
 
Some more practice
---
str_extract("It does not do to dwell on dreams and forget to live.", )

---

str_extract("It does not do to dwell on dreams and forget to live.", )

---

# Bonus question

How do you extract the currency conversions from this text?


```r
text &lt;- c("There are 29 Knuts in 1 silver Sickle",
          "and there are 493 Knuts in 1 golden Galleon.")
```


--

Shortcut if you only care about the first number...


```r
parse_number(text)
```

```
## [1]  29 493
```

---

# Summary

- strings/characters = things that are quoted in R
- backslashes are used for "escape characters" (`\n`, `\\`)
- use `writeLines()` to show how strings will display
- `stringr` provides a simple, consistent interface to work with strings in R
- regular expressions describe the logic of a pattern in text
- check regex in R with `str_view()`/`str_view_all()`
- use `unnest()` to expand list-columns

Cheatsheets: [Basic regular expressions in R](https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf), [Working with strings with stringr](https://resources.rstudio.com/rstudio-cheatsheets/stringr-cheat-sheet)

Santa Barbara Eco-Data-Science [text workshop materials](https://github.com/eco-data-science/text_workshop) (covers pdftools, sentiment analysis, etc.)

---

# You are now ready to face the world of strings and regex

.center[.xlarge[
![](https://media.giphy.com/media/ffynNaSYx2yTC/giphy.gif)]]
    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script src="https://platform.twitter.com/widgets.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "github",
"highlightLines": true,
"ratio": "16:9"
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();
(function() {
  "use strict"
  // Replace <script> tags in slides area to make them executable
  var scripts = document.querySelectorAll(
    '.remark-slides-area .remark-slide-container script'
  );
  if (!scripts.length) return;
  for (var i = 0; i < scripts.length; i++) {
    var s = document.createElement('script');
    var code = document.createTextNode(scripts[i].textContent);
    s.appendChild(code);
    var scriptAttrs = scripts[i].attributes;
    for (var j = 0; j < scriptAttrs.length; j++) {
      s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
    }
    scripts[i].parentElement.replaceChild(s, scripts[i]);
  }
})();
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
  const hlines = d.querySelectorAll('.remark-code-line-highlighted');
  const preParents = [];
  const findPreParent = function(line, p = 0) {
    if (p > 1) return null; // traverse up no further than grandparent
    const el = line.parentElement;
    return el.tagName === "PRE" ? el : findPreParent(el, ++p);
  };

  for (let line of hlines) {
    let pre = findPreParent(line);
    if (pre && !preParents.includes(pre)) preParents.push(pre);
  }
  preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>