-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathgeneric_scraper.qmd
More file actions
121 lines (89 loc) · 2.58 KB
/
Copy pathgeneric_scraper.qmd
File metadata and controls
121 lines (89 loc) · 2.58 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
title: "CFB Stats Generic Scraper"
---
```{r}
#| label: setup
#| message: false
library(rvest)
library(dplyr)
library(janitor)
```
## Generic Scraper Function
Pass any URL path from [cfbstats.com/2025/national/index.html](https://cfbstats.com/2025/national/index.html)
along with a year. The path is the portion of the URL **after** the year segment.
For example, for this URL:
```
https://cfbstats.com/2025/leader/national/team/offense/split01/category09/sort01.html
```
The path argument would be:
```
/leader/national/team/offense/split01/category09/sort01.html
```
```{r}
#| label: scraper-function
#' Scrape any stat leaderboard from cfbstats.com
#'
#' @param path Character. The URL path after the year, starting with "/".
#' Example: "/leader/national/team/offense/split01/category09/sort01.html"
#' @param year Integer. Season year (e.g., 2025). Available years: 2016–2025.
#' @return A data frame of the stats table with a `year` column appended.
scrape_cfbstats <- function(path, year) {
# Strip leading slash if present so sprintf doesn't double up
path <- sub("^/", "", path)
url <- sprintf("https://cfbstats.com/%d/%s", year, path)
page <- tryCatch(
read_html(url),
error = function(e) stop("Failed to fetch: ", url, "\n", e$message)
)
tbl_node <- html_element(page, "table.leaders")
if (is.na(tbl_node)) {
stop("No table with class 'leaders' found at: ", url)
}
tbl_node |>
html_table(header = TRUE) |>
clean_names() |> # snake_case column names via janitor
mutate(year = year, .before = 1)
}
```
## Example Usage
### Single page + year
```{r}
#| label: example-single
scoring_offense <- scrape_cfbstats(
path = "/leader/national/team/offense/split01/category09/sort01.html",
year = 2025
)
head(scoring_offense)
```
### Loop over multiple years
```{r}
#| label: example-multi-year
rushing_defense_multi <- lapply(2022:2025, function(yr) {
Sys.sleep(0.5) # be polite to the server
scrape_cfbstats(
path = "/leader/national/team/defense/split01/category01/sort01.html",
year = yr
)
}) |>
bind_rows()
glimpse(rushing_defense_multi)
```
### Player stats example
```{r}
#| label: example-player
passing_leaders <- scrape_cfbstats(
path = "/leader/national/player/split01/category02/sort01.html",
year = 2025
)
head(passing_leaders)
```
### Try another
I tried to use copilot to complete but it didn't get the url right, but I fixed it manually.
```{r}
total_offense <- scrape_cfbstats(
path = "leader/national/team/offense/split01/category10/sort01.html",
year = 2025
)
# show the result
head(total_offense)
```