-
Notifications
You must be signed in to change notification settings - Fork 15
/
Copy pathREADME.Rmd
574 lines (399 loc) · 22.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
---
output: github_document
---
```{r, echo = FALSE, warning=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = '#>',
fig.path = 'README-',
error = TRUE,
eval = TRUE,
fig.height = 5
)
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(forcats))
suppressPackageStartupMessages(library(survivoR))
suppressPackageStartupMessages(library(paletteer))
suppressPackageStartupMessages(library(glue))
suppressPackageStartupMessages(library(rvest))
good_pal <<- c("#ffffff", "#f2fbd2", "#c9ecb4", "#93d3ab", "#35b0ab")
bad_pal <<- rev(c("#ef6351", "#f38375", "#f7a399", "#fbc3bc", "#ffe3e0", "white"))
max_eps <- viewers |>
filter(episode_title != "Reunion") |>
nrow()
max_seasons <- season_summary |>
nrow()
# in progress seasons
in_prog_eps <- survivoR::castaways |>
filter(version_season == "US43") |>
pull(episode) |>
max(na.rm = TRUE)
n_people <- survivoR::castaways |>
distinct(version_season, castaway_id) |>
nrow()
version <- str_replace(readLines("DESCRIPTION")[4], 'Version: ', 'v')
url <- "https://cran.r-project.org/web/packages/survivoR/index.html"
cran_version <- read_html(url) |>
html_text() |>
str_extract("Version:\n[:digit:]{1}\\.[:digit:]+\\.[:digit:]+") |>
str_remove("Version:\n")
cran_version <- paste0("v", cran_version)
```
<img src='https://cranlogs.r-pkg.org/badges/survivoR'/><img src='https://cranlogs.r-pkg.org/badges/grand-total/survivoR'/><img src='https://www.r-pkg.org/badges/version/survivoR'/>
# survivoR <img src='dev/images/hex-flame-final.png' align="right" height="240" />
`r max_seasons` seasons. `r n_people` people. 1 package!
survivoR is a collection of data sets detailing events across `r max_seasons` seasons of Survivor US, Australia, South Africa, New Zealand and UK. It includes castaway information, vote history, immunity and reward challenge winners, jury votes, advantage details and a lot more.
For analysis and updates you can follow me on Bluesky [@danoehm.bsky.social](https://bsky.app/profile/danoehm.bsky.social) or Threads [@_survivordb](https://www.threads.net/@_survivordb)
# Installation
Install from CRAN (**`r cran_version`**) or Git (**`r version`**).
If Git > CRAN I'd suggest install from Git. We are constantly improving the data sets so the github version is likely to be slightly improved.
```{r, eval = FALSE}
install.packages("survivoR")
```
```{r, eval = FALSE}
devtools::install_github("doehm/survivoR")
```
# News: survivoR 2.3.5
<img src='https://img.shields.io/badge/col-new-green'/>
* Adding complete US47 data
* Adding new `castaway_scores` dataset
* Adding new `add_*` functions:
* `add_alive()`: Adds a logical flag if the castaway is alive at the start or end of an episode
* `add_bipoc()`: Adds a BIPOC to the data frame. If any African American (or Canadian), Asian American, Latin American, or Native American is `TRUE` then BIPOC is `TRUE`.
* `add_castaway()`: Adds castaway to a data frame. Input data frame must have `castaway_id`.
* `add_demogs()`: Add demographics that includes age, gender, race/ethnicity, and LGBTQIA+ status to a data frame with castaway_id.
* `add_finalist()`: Adds a winner flag to the dataset.
* `add_full_name()`: Adds full name to the data frame. Useful for plotting and making tables.
* `add_gender()`: Adds gender to a data frame
* `add_jury()`: Adds a jury member flag to the data set.
* `add_lgbt()`: Adds the LGBTQIA+ flag to the data frame.
* `add_result()`: Adds the `result` and `place` to the data frame.
* `add_tribe()`: Adds tribe to a data frame for a specified stage of the game e.g. `original`, `swapped`, `swapped_2`, etc.
* `add_tribe_colour()`: Add tribe colour to the data frame. Useful for preparing the data for plotting with ggplot2.
* `add_winner()`: Adds a winner flag to the data set.
* Adding new `filter_*` functions:
* `filter_alive()`: Filters a given dataset to those that are still alive in the game at the start or end of a user specified episode.
* `filter_final_n()`: Filters to the final `n` players e.g. the final 5.
* `filter_finalist()`: Filters a dataset to the finalists of a given season.
* `filter_jury()`: Filters a dataset to the jury members of a given season.
* `filter_new_era()`: Filters a dataset to all New Era seasons.
* `filter_us()`: Filter a dataset to a specified set of US season or list of seasons. A shorthand version of `filter_vs()` for the US seasons.
* `filter_vs()`: Filters a data set to a specified version season or list of version seasons.
* `filter_winner()`: Filters a data set to the winners of a given season.
# [The Sanctuary](https://gradientdescending.com/the-sanctuary/)
[The Sanctuary](https://gradientdescending.com/the-sanctuary/) is the survivoR package's companion. It holds interactive tables and charts detailing the castaways, challenges, vote history, confessionals, ratings, and more. Confessional counts from [myself](https://twitter.com/danoehm), [Carly Levitz](https://twitter.com/carlylevitz), [Sam](https://twitter.com/survivorfansam), Grace.
[<img src='dev/images/flame.png' height="240"/>](https://gradientdescending.com/the-sanctuary/)
### Confessional timing
Included in the package is a confessional timing app to record the length of confessionals while watching the episode.
To launch the app, first install the package and run,
```{r, eval = FALSE}
library(survivoR)
launch_confessional_app()
```
<a href='https://github.com/doehm/survivoR/tree/master/inst'><img src='dev/images/conf-app-gif.gif'></a>
To try it out online 👉 [Confessional timing app](https://danieloehm.shinyapps.io/survivorDash/)
More info [here](https://github.com/doehm/survivoR/tree/master/inst).
# Dataset overview
There are 19 data sets included in the package:
1. `advantage_movement`
2. `advantage_details`
3. `boot_mapping`
4. `castaway_details`
5. `castaway_scores`
6. `castaways`
7. `challenge_results`
8. `challenge_description`
9. `challenge_summary`
10. `confessionals`
11. `jury_votes`
12. `season_summary`
13. `survivor_auction`
14. `tribe_colours`
15. `tribe_mapping`
16. `episodes`
17. `vote_history`
18. `auction_details`
19. `screen_time`
20. `season_palettes`
See the sections below for more details on the key data sets.
<details>
<summary><strong>Season summary</strong></summary>
## Season summary
A table containing summary details of each season of Survivor, including the winner, runner ups and location.
```{r}
season_summary
```
</details>
<details>
<summary><strong>Castaways</strong></summary>
## Castaways
This data set contains season and demographic information about each castaway. It is structured to view their results for each season. Castaways that have played in multiple seasons will feature more than once with the age and location representing that point in time. Castaways that re-entered the game will feature more than once in the same season as they technically have more than one boot order e.g. Natalie Anderson - Winners at War.
Each castaway has a unique `castaway_id` which links the individual across all data sets and seasons. It also links to the following ID's found on the `vote_history`, `jury_votes` and `challenges` data sets.
* `vote_id`
* `voted_out_id`
* `finalist_id`
```{r}
castaways |>
filter(season == 45)
```
## Castaway details
A few castaways have changed their name from season to season or have been referred to by a different name during the season e.g. Amber Mariano; in season 8 Survivor All-Stars there was Rob C and Rob M. That information has been retained here in the `castaways` data set.
`castaway_details` contains unique information for each castaway. It takes the full name from their most current season and their most verbose short name which is handy for labelling.
It also includes gender, date of birth, occupation, race, ethnicity and other data. If no source was found to determine a castaways race and ethnicity, the data is kept as missing rather than making an assumption.
`african_american`, `asian_american`, `latin_american`, `native_american`, `race`, `ethnicity`, and `bipoc` data is complete only for the US. `bipoc` is `TRUE` when any of the `*_american` fields are `TRUE`. These fields have been recorded as per the (Survivor wiki)[https://survivor.fandom.com/wiki/Main_Page]. Other versions have been left blank as the data is not complete and the term 'people of colour' is typically only used in the US.
I have deprecated the old field `poc` in order to be more inclusive and to make using the race/ethnicity fields simpler.
```{r}
castaway_details
```
## Castaway scores
I have created a measure for challenge success, vote history or tribal council success and advantage success. For more details please see follow the links:
[Challenge score methodology](https://gradientdescending.com/the-sanctuary/full-challenges-list-all.html#details)
[Vote history mothodology](https://gradientdescending.com/the-sanctuary/full-vote-list.html#details)
```{r}
castaway_scores
```
</details>
<details>
<summary><strong>Vote history</strong></summary>
## Vote history
This data frame contains a complete history of votes cast across all seasons of Survivor. This allows you to see who who voted for who at which Tribal Council. It also includes details on who had individual immunity as well as who had their votes nullified by a hidden immunity idol. This details the key events for the season.
There is some information on split votes to help calculate if a player engaged in a split vote but ultimately hit their target. There are events which influence the vote e.g. Extra votes, safety without power, etc. These are recorded here as well.
```{r}
vh <- vote_history |>
filter(
season == 45,
episode == 9
)
vh
```
```{r}
vh |>
count(vote)
```
</details>
<details>
<summary><strong>Challenges</strong></summary>
## Challenge results
Note: From v1.1 the `challenge_results` dataset has been improved but could break existing code. The old table is maintained at `challenge_results_dep`
There are 3 tables `challenge_results`, `challenge_description`, and `challenge_summary`.
### Challenge results
A tidy data frame of immunity and reward challenge results. The winners and losers of the challenges are found recorded here.
```{r}
challenge_results |>
filter(season == 45) |>
group_by(castaway) |>
summarise(
won = sum(result == "Won"),
lost = sum(result == "Lost"),
total_challenges = n(),
chosen_for_reward = sum(chosen_for_reward)
)
```
The `challenge_id` is the primary key for the `challenge_description` data set. The `challange_id` will change as the data or descriptions change.
## Challenge description
_Note: This data frame is going through a massive revamp. Stay tuned._
This data set contains the name, description, and descriptive features for each challenge where it is known. Challenges can go by different names so have included the unique name and the recurring challenge name. These are taken directly from the [Survivor Wiki](https://survivor.fandom.com/wiki/Category:Recurring_Challenges). Sometimes there can be variations made on the challenge but go but the same name, or the challenge is integrated with a longer obstacle. In these cases the challenge may share the same recurring challenge name but have a different challenge name. Even if they share the same names the description could be different.
The features of each challenge have been determined largely through string searches of key words that describe the challenge. It may not be 100% accurate due to the different and inconsistent descriptions but in most part they will provide a good basis for analysis.
If any descriptive features need altering please let me know in the [issues](https://github.com/doehm/survivoR/issues).
```{r}
challenge_description
challenge_description |>
summarise_if(is_logical, ~sum(.x, na.rm = TRUE)) |>
glimpse()
```
See the help manual for more detailed descriptions of the features.
### Challenge Summary
The `challenge_summary` table is solving an annoying problem with `challenge_results` and the way some challenges are constructed. You may want to count how many individual challenges someone has won, or tribal immunities, etc. To do so you'll have to use the `challenge_type`, `outcome_type`, and `results` fields. There are some challenges which are combined e.g. `Team / Individual` challenges which makes this not a straight process to summarise the table.
Hence why `challenge_summary` exisits. The `category` column consists of the following categories:
* All: All challenge types
* Reward
* Immunity
* Tribal
* Tribal Reward
* Tribal Immunity
* Individual
* Individual Reward
* Individual Immunity
* Team
* Team Reward
* Team Immunity
* Duel
There is obviously overlap with the categories but this structure makes it simple to summarise the table how you desire e.g.
```{r challenge summary}
challenge_summary |>
group_by(category, version_season, castaway) |>
summarise(
n_challenges = n(),
n_won = sum(won)
)
```
How to add the challenge scores to challenge summary.
```{r challenge summary scores}
challenge_summary |>
group_by(category, version_season, castaway_id, castaway) |>
summarise(
n_challenges = n_distinct(challenge_id),
n_won = sum(won),
.groups = "drop"
) |>
left_join(
castaway_scores |>
select(version_season, castaway_id, starts_with("score_chal")) |>
pivot_longer(c(-version_season, -castaway_id), names_to = "category", values_to = "score") |>
mutate(
category = str_remove(category, "score_chal_"),
category = str_replace_all(category, "_", " "),
category = str_to_title(category)
) |>
select(category, version_season, castaway_id, score),
join_by(category, version_season, castaway_id)
)
```
See the R docs for more details on the fields. Join to `challenge_results` with `version_season` and `challenge_id`.
</details>
<details>
<summary><strong>Jury votes</strong></summary>
## Jury votes
History of jury votes. It is more verbose than it needs to be, however having a 0-1 column indicating if a vote was placed or not makes it easier to summarise castaways that received no votes.
```{r jury votes}
jury_votes |>
filter(season == 45)
```
```{r jury votes sum}
jury_votes |>
filter(season == 45) |>
group_by(finalist) |>
summarise(votes = sum(vote))
```
</details>
<details>
<summary><strong>Advantages</strong></summary>
## Advantage Details
This dataset lists the hidden idols and advantages in the game for all seasons. It details where it was found, if there was a clue to the advantage, location and other advantage conditions. This maps to the `advantage_movement` table.
```{r}
advantage_details |>
filter(season == 45)
```
## Advantage Movement
The `advantage_movement` table tracks who found the advantage, who they may have handed it to and who the played it for. Each step is called an event. The `sequence_id` tracks the logical step of the advantage. For example in season 41, JD found an Extra Vote advantage. JD gave it to Shan in good faith who then voted him out keeping the Extra Vote. Shan gave it to Ricard in good faith who eventually gave it back before Shan played it for Naseer. That movement is recorded in this table.
```{r}
advantage_movement |>
filter(advantage_id == "USEV4102")
```
</details>
<details>
<summary><strong>Confessionals</strong></summary>
## Confessionals
A dataset containing the number of confessionals for each castaway by season and episode. There are multiple contributors to this data. Where there are multiple sets of counts for a season the average is taken and added to the package. The aim is to establish consistency in confessional counts in the absence of official sources. Given the subjective nature of the counts and the potential for clerical error no single source is more valid than another. So it is reasonable to average across all sources.
Confessional time exists for a few seasons. This is the total cumulative time for each castaway in seconds. This is a much more accurate indicator of the 'edit'.
```{r}
confessionals |>
filter(season == 45) |>
group_by(castaway) |>
summarise(
count = sum(confessional_count),
time = sum(confessional_time)
)
```
The confessional index is available on this data set. The index is a standardised measure of the number of confessionals the player has received compared to the others. It is stratified by tribe so it measures how many confessionals each player gets proportional to even share within tribe e.g. an index of 1.5 means that player as received 50% more than others in their tribe.
The tribe grouping is important since the tribe that attends tribal council typical get more screen time, which is fair enough. I don't think we should expect even share across everyone in the pre-merge stage of the game.
The index is cumulative with episode, so the players final index is the index in their final episode.
```{r}
confessionals |>
filter(season == 45) |>
group_by(castaway) |>
slice_max(episode) |>
arrange(desc(index_time)) |>
select(castaway, episode, confessional_count, confessional_time, index_count, index_time)
```
</details>
<details>
<summary><strong>Screen time</strong></summary>
## Screen time [EXPERIMENTAL]
This dataset contains the estimated screen time for each castaway during an episode. Please note that this is still in the early days of development. There is likely to be misclassification and other sources of error. The model will be refined over time.
An individuals' screen time is calculated, at a high-level, via the following process:
1. Frames are sampled from episodes on a 1 second time interval
2. MTCNN detects the human faces within each frame
3. VGGFace2 converts each detected face into a 512d vector space
4. A training set of labelled images (1 for each contestant + 3 for Jeff Probst) is processed in the same way to determine where they sit in the vector space. TODO: This could be made more accurate by increasing the number of training images per contestant.
5. The Euclidean distance is calculated for the faces detected in the frame to each of the contestants in the season (+Jeff). If the minimum distance is greater than 1.2 the face is labelled as "unknown". TODO: Review how robust this distance cutoff truly is - currently based on manual review of Season 42.
6. A multi-class SVM is trained on the training set to label faces. For any face not identified as "unknown", the vector embedding is run into this model and a label is generated.
7. All labelled faces are aggregated together, with an assumption of 1-5 full second of screen time each time a face is seen and factoring in time between detection capping at a max of 5 seconds.
<img src='dev/images/cast-detect1.png' align="center"/>
```{r}
screen_time |>
filter(version_season == "US45") |>
group_by(castaway_id) |>
summarise(total_mins = sum(screen_time)/60) |>
left_join(
castaway_details |>
select(castaway_id, castaway = short_name),
by = "castaway_id"
) |>
arrange(desc(total_mins))
```
Currently it only includes data for season 42. More seasons will be added as they are completed.
</details>
<details>
<summary><strong>Boot mapping</strong></summary>
## Boot mapping
A mapping table to detail who is still alive at each stage of the game. It is useful for easy filtering to say the final \code{n} players.
```{r}
# filter to season 45 and when there are 6 people left
# 18 people in the season, therefore 12 boots
still_alive <- function(.version, .season, .n_boots) {
survivoR::boot_mapping |>
filter(
version == .version,
season == .season,
final_n == 6,
game_status %in% c("In the game", "Returned")
)
}
still_alive("US", 45, 6)
```
</details>
<details>
<summary><strong>Episodes</strong></summary>
## Episodes
Episodes is an episode level table. It contains the episode information such as episode title, air date, length, IMDb rating and the viewer information for every episode across all seasons.
```{r episodes}
episodes |>
filter(season == 45)
```
</details>
<details>
<summary><strong>Survivor Auction</strong></summary>
## Survivor Auction
There are 2 data sets, `survivor_acution` and `auction_details`. `survivor_auction` simply shows who attended the auction and `auction_details` holds the details of the auction e.g. who bought what and at what price.
```{r auction}
auction_details |>
filter(season == 45)
```
</details>
# Issues
Given the variable nature of the game of Survivor and changing of the rules, there are bound to be edges cases where the data is not quite right. Before logging an issue please install the git version to see if it has already been corrected. If not, please log an issue and I will correct the datasets.
New features will be added, such as details on exiled castaways across the seasons. If you have a request for specific data let me know in the issues and I'll see what I can do.
# Showcase
## Survivor Dashboard
[**Carly Levitz**](https://twitter.com/carlylevitz) has developed a fantastic [dashboard](https://public.tableau.com/app/profile/carly.levitz/viz/SurvivorCBSData-Acknowledgements/Tableofcontents) showcasing the data and allowing you to drill down into seasons, castaways, voting history and challenges.
[<img src='dev/images/dash.png' align="center"/>](https://public.tableau.com/app/profile/carly.levitz/viz/SurvivorCBSData-Acknowledgements/Tableofcontents)
## Data viz
This looks at the number of immunity idols won and votes received for each winner.
[<img src='dev/images/torches_png.png' align="center"/>](https://gradientdescending.com/survivor/torches_png.png)
# Contributors
A big thank you to:
#### Package contributor and maintainers
* [**Carly Levitz**](https://twitter.com/carlylevitz) for ongoing data collection and curation
#### Data contributors
* [**Sam**](https://twitter.com/survivorfansam) for contributing to the confessional counts
* [**Dario Mavec**](https://github.com/dariomavec) for developing the face detection model for estimating total screen time
* [**Matt Stiles**](https://github.com/stiles/survivor-voteoffs?tab=readme-ov-file) for collecting and contributing the acknowledgment features on the `castaways` data frame.
* **Camilla Bendetti** for collating the personality type data for each castaway.
* **Uygar Sozer** for adding the filming start and end dates for each season.
* **Holt Skinner** for creating the castaway ID to map people across seasons and manage name changes.
* **Kosta Psaltis** for the original race data.
# References
Data was sourced from [Wikipedia](https://en.wikipedia.org/wiki/Survivor_(American_TV_series)) and the [Survivor Wiki](https://survivor.fandom.com/wiki/Main_Page). Other data, such as the tribe colours, was manually recorded and entered by myself and contributors.
<!-- Torch graphic in hex: [Fire Torch Vectors by Vecteezy](https://www.vecteezy.com/free-vector/fire-torch) -->