Rscraping/Rscraping.Rmd at master · epspi/Rscraping · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
title       : "Scraping with R"
subtitle    : Test subtitle
author      : "Eugene Pyatigorsky"
job         : Ahalogy
framework   : revealjs
revealjs    : {theme: night}
highlighter : highlight.js
hitheme     : Github
widgets     : []
mode        : selfcontained
knit        : slidify::knit2slides
---

```{r setup, eval = T, include=FALSE}
knitr::opts_chunk$set(cache=T, eval = T, warning =  FALSE, message = F)
```

```{r, echo = F, message = F}
require(dplyr)
require(rvest)
require(httr)
require(knitr)
require(tidyr)
require(ggplot2)
require(shiny)
require(htmlwidgets)
setwd("~/Repos/Rscraping/")
```

# Scraping with **R**

<br><br><br><br>
<footer>
Cincinnati R Users Group<br>
June 21, 2016  <br><br>
**Eugene Pyatigorsky**
<br><br>
This presentation and supporting materials available at:<br>
https://github.com/epspi/Rscraping
</footer>

---

## Agenda
* Overview of packages
* A look at how to scrape
* Working example
* Best practices

--- .chapter

# Overview of packages

--- &vertical
## rvest

Most of the work will be done by Hadley's package `rvest`

* Based on Python's `beautifulsoup`
* Extracts elements from the dom using CSS or XPath

.fragment **e.g.<br>`rvest::read_table()`**

***

##  httr

This is (Hadley's) wrapper for `curl`

* Really useful for making customized calls to APIs
* Can also be used for writing your own APIs

.fragment **e.g.<br>`httr::GET("some_endpoint", config)`**

--- .chapter

# How to Scrape: An Example

--- &vertical
## Let's ask Bing about the R Users Group

```{r}
lnk <- 'http://www.bing.com/search?q=Cincinnati+R+users+group&go=Submit&qs=n&form=QBLH&pq=cincinnati+r+users+g&sc=0-20&sp=-1&sk=&cvid=4A13A7CB066B419B9F7BD75777D68F09'

read_html(lnk) %>%
    html_nodes("h2 a") %>%
    html_text
```

***

## Common CSS Selectors

* `#` for "id="
* `.` for "class="

.fragment **OR you can use SelectorGadget for Chrome**
<br>
https://chrome.google.com/webstore/detail/selectorgadget/

--- .chapter


# A Working Site

--- &vertical

[Cincinnati Foreclosures - A Real Estate Scraper](http://cincyreal.followthenumbers.com)

--- .chapter


# Best Practices

--- &vertical

## Authentication

Use APIs instead of scraping whenever possible. There isn't a lot of documentation for `rvest` and cookie-based authentication can be tricky.

***

## Automation

* The real power of `R` and `rvest` shines when used with `shiny` (npi).
* Put your scraping code in a standalone R script and automate with `cron`.

***

## End