-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathsearch.json
More file actions
728 lines (728 loc) · 299 KB
/
search.json
File metadata and controls
728 lines (728 loc) · 299 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
[
{
"objectID": "shinyapps.html#learning-objectives",
"href": "shinyapps.html#learning-objectives",
"title": "11 Shiny apps",
"section": "11.1 Learning objectives",
"text": "11.1 Learning objectives\nThis chapter introduces you to ShinyApps using R code to create and publish data, research materials, or informative tutorials. Chapter 11 is divided into two parts, 11a and 11b. In chapter 11a, you will learn:\n\nwhat a ShinyApp is and how to find resources,\nhow to use RStudio and shinyapps.io to publish your application and,\nhow to recognize the main components of a Shiny application.\n\nChapter 11b will take a deeper look at many of the commonly used functions for input and output so that you can begin to customize and build your own applications."
},
{
"objectID": "shinyapps.html#what-is-shiny",
"href": "shinyapps.html#what-is-shiny",
"title": "11 Shiny apps",
"section": "11.2 What is Shiny?",
"text": "11.2 What is Shiny?\nShiny is an R package that builds interactive web applications using R and RStudio to compile HTML files. The HTML files can be used locally, (just opened up using your browser from a local directory), or placed on a server to make public for anyone to use. Its not just for data, either! They are used as an education tool, a map for finding resources, business intelligence (BI) dashboards, tracking activities, calculators, file type conversion, and of course, for displaying the results of data analysis through interactive visualizations.\nYou are not restricted to just the web browser. You can configure your shiny app to export other document types, like PDFs, Word, or Excel.\nThe most important aspects of Shiny are that it is interactive and it is responsive. Interactivity provides an opportunity for your viewers to engage with your materials, rather than just read it. Responsive applications resize automatically to adjust to the device being used. This means you design once, and it can be viewed on a laptop, desktop, tablet, or phone.\nIn Shiny, we code in R. In the following sections, code will be shown in cells:\n\n# this is written in R\nprint(\"Hello lovely people!\")\n\n[1] \"Hello lovely people!\""
},
{
"objectID": "shinyapps.html#what-can-it-do",
"href": "shinyapps.html#what-can-it-do",
"title": "11 Shiny apps",
"section": "11.3 What can it do?",
"text": "11.3 What can it do?\nThe Shiny Gallery is really helpful when learning as the examples provided include their code, so you can see how they created the interfaces. Below, is an example of where we’re heading with this chapter and where we’ll be going in the next chapter which will lead you through the steps to creating a dashboard.\n\n\n\nAn example of a simple application from this chapter.\n\n\n\n11.3.1 Examples of dashboards and data visualization websites\nA dashboard is one that uses data that has been placed on the server, (or you have uploaded) or streams real-time and shows a variety of graphs, charts, and tables. In our examples case, we’ll use standard datasets from R as well as some from the fivethirtyeight GitHub repository. You can see the all the R data sets by typing the following in the console:\ndata()\nThe term ‘dashboard’ does apply to specific libraries made for Shiny that use a template HTML file. We are calling any website for the purpose of showing, exploring, or manipulating data with multiple visualization methods - a dashboard. So, we use it rather casually, rather than specifying specific libraries."
},
{
"objectID": "shinyapps.html#how-does-it-integrate-with-other-things",
"href": "shinyapps.html#how-does-it-integrate-with-other-things",
"title": "11 Shiny apps",
"section": "11.4 How does it integrate with other things?",
"text": "11.4 How does it integrate with other things?\n\n\n\n\n\n\n11.4.1 RStudio\nThis is the IDE, the integrated development environment, for R, (among many other programming languages). We’ll be using RStudio to create our shiny apps, test the code, preview results, and manage files.\n\n\n11.4.2 Shiny\nShiny apps are built on Bootstrap which is an open-source, template HTML file with CSS and Javascript that makes it adapt to different browsers and devices. The Shiny app code does all the work to place the elements you design into an HTMLL files and automatically manages all file dependencies. We’ll briefly discuss files when getting started with your first app.\n\n\n11.4.3 shinyapps.io\nWhen running a shiny application from RStudio, you can publish directly to a server. Thankfully, the R community has a server location where you can host up to 5 apps for free at shinyapps.io. Of course, you can pay for more space, but for now, this works for us! Later on, we’ll walk you through publishing to the server directly from RStudio. Its surprisingly easy!"
},
{
"objectID": "shinyapps.html#how-do-i-get-started",
"href": "shinyapps.html#how-do-i-get-started",
"title": "11 Shiny apps",
"section": "11.5 How do I get started?",
"text": "11.5 How do I get started?\nThis link to the tutorials at shiny.rstudio.com can also guide you through steps to creating your own apps, including more detailed information on how to change elements to meet your needs.\nInstall the package down below in the console. You only need to do this once. If you uninstall and re-install your RStudio, you may need to reinstall packages.\ninstall.packages(\"shiny\")\n\n11.5.1 Starting a new app\nfrom Files>New File > Shiny Web app OR select new file from the pull down menu in the top toolbar.\n\n\n\nStarting up a new Shiny app.\n\n\nThis brings up a dialog box. Enter your project name. This will become the name of the folder that is created in the location you select.\n\n\n\nSelecting your new directory location.\n\n\nOnce you select Create you will see the template shiny app. This has three sections which we will cover next.\n\n\n\nThe starting template for a Shiny app.\n\n\n\n\n11.5.2 Basic anatomy of a shiny app\nThere are three parts of a shiny app: the ui, the server, and the shinyApp() function that calls the ui and server.\nThe ui (the user interface object) specifies the layout: where items are placed on the webpage.\nThe server function is the instructions for how shiny builds your app.\nThe shinyApp() function creates the actual app from the ui and server components\n\n\n\n\n\n\nWarning\n\n\n\n\nPast versions of shiny used multi-page layouts with the ui and server components saved as separate files, such as ui.R and server.R. While this is still supported, the rest of this chapter will assume that we’re using the single-page version in which the ui and server are together in one app.R file.\n\n\n\nThe UI and the server both have inputs and outputs that will be assigned and called. We’ll go over those in the Inputs & Widgets section later on.\n\n\n11.5.3 Set up a working directory\n\nFirst set up a new working directory. In RStudio you can do this in the Files tab (usually on right lower panel). You can navigate where you want a directory, create, or modify existing directories.\n\nUnder More, select Set as working directory. You can also do this in the console with:\nsetwd(\"~/your_folder_name\")\n\nDownload the following file from github and Save Page As... from your browser to this directory. Then rename the file from FirstAppDemo.txt to FirstAppDemo.R\nLet’s open up a shiny app file, called FirstAppDemo.R. You can see there is the ui, the server, and the shinyApp() function.\n\n\n\n\nFigure 1: Image of a basic shiny app components\n\n\nWe’ll be going through each of these items in more detail, but for now, you can see the basic structure.\nTo run the app, use the Run icon shown below. This will run in the RStudio viewer by default, but you can also run in your browser with the Run external option checked.\n\n\n\nFigure 2: Image of the run icon. Note the pull down menu arrow on the right.\n\n\nThat’s it to run a basic Shiny app!!! The shiny library comes with many examples you can explore as well as even more sophisticated examples on the Gallery page.\n\n\n\nFigure 3: Examples of shiny apps and functions. Image source: https://shiny.rstudio.com/tutorial/written-tutorial/lesson1/\n\n\nThat’s it! you’ve created a new app. Its simple right now, but we’ll explore more options later for designing something that works for your needs. Let’s discuss how we publish next so that you get a sense of the entire workflow.\n\n\n11.5.4 Publishing on shinyapps.io\nNow we can start publishing online for the world to experience your brilliant research! A free account at shinyapps.io provides space for 5 applications and 25 active hours of server time. (what are active hours?)\n\nCreate a shinyapps.io account. You will need your login info to set up your RStudio to publish directly to shinyapps.io. Click here to set up a shinyapps.io account. I recommend connecting via your GitHub account, if you have one. Otherwise, just start a new account.\nBack in RStudio, we need to connect to your shinyapp.io account. To the right of the Run icon, select the pull down menu.\n\nSelect Connect… and then choose shinyapps.io\n\nFollow the instructions to find the token and paste into the space provided.\n\nSo exciting! Now, you are ready to publish to shinyapps.io!\n\nCheck all the files you want uploaded to the shinyapps.io server. In our case with this demo, it should only be one file, FirstDemoApp.R.\n\n\nThat’s it! You’ve done it!! Now sit back and watch the cash and attention role in! If you want to find your app on the shinyapps.io dashboard, you can login here or below.\n\n\n\n\n\nAt this point, you know enough to open existing apps on Shiny Gallery, copy and modify their code, and publish them on shinyapps.io. In chapter 11b, we’ll be going over many of the inputs and outputs with an exercise that will help you design apps based on your needs."
},
{
"objectID": "shinyapps.html#conclusion",
"href": "shinyapps.html#conclusion",
"title": "11 Shiny apps",
"section": "11.6 Conclusion",
"text": "11.6 Conclusion\nIn this chapter, we’ve covered what shiny is and how to get started. You can open existing Shiny apps, and start exploring their code, or adapting them to your own needs.\nThere is an extensive amount of tutorials and documentation on making apps with Shiny and you can find some of those in the References section below. Chapters in this course are a living document, so if you have suggestions for materials you found helpful, please send them along to us.\nThe next section of chapter 11 moves into a more detailed view of the inputs and outputs, layout options, and discusses reactivity in more depth."
},
{
"objectID": "shinyapps.html#references",
"href": "shinyapps.html#references",
"title": "11 Shiny apps",
"section": "11.7 References:",
"text": "11.7 References:\nI took inspiration and guidance from many sources which I’ve tried to include here:\nThe Bookdown library for R, at bookdown.org, provides a tutorial on how to build Shiny apps.\nMastering Shiny by Hadley Wickham https://mastering-shiny.org/index.html\nW3 color resources: https://www.w3schools.com/colors/default.asp"
},
{
"objectID": "working_with_strings.html#learning-objectives",
"href": "working_with_strings.html#learning-objectives",
"title": "4 Working with strings",
"section": "4.1 Learning objectives",
"text": "4.1 Learning objectives\n\nTransform strings\nCombine strings\nSplit strings\nSubset strings"
},
{
"objectID": "working_with_strings.html#dealing-with-messy-or-unstructured-data",
"href": "working_with_strings.html#dealing-with-messy-or-unstructured-data",
"title": "4 Working with strings",
"section": "4.2 Dealing with messy or unstructured data",
"text": "4.2 Dealing with messy or unstructured data\nIn the last chapter, we focused on importing data into tibbles and then reshaping them to fit the tidy data criteria. In most cases, we had data with some structure, which we transformed into a different structure. This week, we look at working with strings for three reasons: cleaning messy data, filtering rows based on part of string matches, and extracting data from text.\n\n4.2.1 Cleaning messy data\nSometimes, you may have data with a correct tidy structure, but the data itself is not clean and contains errors, unnecessary characters, or unwanted spelling or formatting variants. We need to clean that data before we can produce our analysis or report. Here is an example:\n\n\n\nfull_names (messy)\nfull_names (tidy)\n\n\n\n\nColin Conrad, PhD\nColin Conrad\n\n\nMACDONALD, Betrum\nBertrum MacDonald\n\n\nDr. Louise Spiteri\nLouise Spiteri\n\n\nMongeon, Philippe\nPhilippe Mongeon\n\n\njennifer grek-martin\nJennier Grek-Martin\n\n\n\n\n\n4.2.2 Filtering rows based on string matches\nIn the last chapter, we learned how to filter rows of a tibble based on the value contained in a cell or based on the row number. This week, we will add to our toolbox some string matching functions that check if a string of characters is found within a larger string of characters. One example could be retrieving a set of course codes starting with INFO or MGMT in a vector containing the course codes of all offerings of Dalhousie University.\n\n\n4.2.3 Extracting data from text\nSometimes, you may have to deal with unstructured data such as a long character string containing data elements we wish to extract. This string, for example:\n\nI am taking several courses offered at SIM this Winter. There is INFO6270 (Introduction to Data Science) and also INFO6540 and the information policy one, which I think has the course code INFO6610.\n\nMaybe you had the brilliant idea to use a free text field in a survey to collect information about the courses that students are taking this Winter, and you now have three thousand responses that look like this one. This unstructured data needs to be structured before it can be analyzed, and in this specific example, and R can help! This kind of task can be relatively simple but can get quite complex. In this chapter, we will not do very complex data extractions from strings."
},
{
"objectID": "working_with_strings.html#the-stringr-package",
"href": "working_with_strings.html#the-stringr-package",
"title": "4 Working with strings",
"section": "4.3 The stringr package",
"text": "4.3 The stringr package\nThe stringr package (https://stringr.tidyverse.org) is part of the tidyverse and contains a collection of functions that perform all kinds of operations on strings. Let’s go through some of those tasks and some code examples.\n\n4.3.1 Transforming strings\n\n4.3.1.1 change string character case\nOne simple transformation you may want to perform on a string is changing its case. This is very easily done with the str_to_lower(), str_to_upper(), str_to_sentence(), and str_to_title() functions.\n\n\n\nStatement\nOutput\n\n\n\n\nstr_to_lower(“HeLlO WoRlD!”)\nhello world!\n\n\nstr_to_upper(“HeLlO WoRlD!”)\nHELLO WORLD!\n\n\nstr_to_sentence(“HeLlO WoRlD!”)\nHello world!\n\n\nstr_to_title(“HeLlO WoRlD!”)\nHello World!\n\n\n\n\n4.3.1.1.1 Vector example\n\n# I create a vector with character strings\nvector <- c(\"I like coding with R\",\"i like coding in R\",\"R IS AMAZING!\",\"I LoVe R\")\n\n# I convert them all to lowercase.\nstr_to_lower(vector)\n\n[1] \"i like coding with r\" \"i like coding in r\" \"r is amazing!\" \n[4] \"i love r\" \n\n\n\n\n4.3.1.1.2 Tibble example\n\n# I create a tibble with inconsistent strings\nt <- tibble(comments = c(\"I like coding with R\",\"i like coding in R\",\"R IS AMAZING!\",\"I LoVe R\"))\n\n# I use the mutate() and str_to_lower function to modify the messy column and make the strings consistent. \nt %>% \n mutate(comments = str_to_lower(comments))\n\n# A tibble: 4 × 1\n comments \n <chr> \n1 i like coding with r\n2 i like coding in r \n3 r is amazing! \n4 i love r \n\n\n\n\n\n4.3.1.2 Replacing parts of strings\nThe functions str_replace() and str_replace_all() modify strings by replacing a pattern with another. The difference between the two is that str_replace() will only replace the first instance of the pattern in the string, while str_replace_all() will replace all the instances.\n\n4.3.1.2.1 Vector example\n\n# I create a vector with two strings.\nnames <- c(\"dr Mike Smit\",\"dr Sandra Toze\")\n\n# I replace the first instance of the pattern \"dr\" with \"doctor\". \nnames %>% \n str_replace(\"dr\",\"doctor\")\n\n[1] \"doctor Mike Smit\" \"doctor Sandra Toze\"\n\n\nLet’s see what happens if I use the same example but use str_replace_all() instead of str_replace().\n\n# I create a vector with two strings.\nnames <- c(\"dr Mike Smit\",\"dr Sandra Toze\")\n\n# I replace ALL instances of the pattern \"dr\" with \"doctor\". \nnames %>% \n str_replace_all(\"dr\",\"doctor\")\n\n[1] \"doctor Mike Smit\" \"doctor Sandoctora Toze\"\n\n\nThe second string got messed up because the second “dr” pattern in Sandra also got replaced with the pattern “doctor”.\n\n\n\n4.3.1.3 Removing parts of strings\nThe str_remove() and str_remove_all() are the equivalent of str_replace(\"some pattern\", \"\") and str_replace_all(\"some pattern\", \"\"). They can make our code a little cleaner by not requiring that we specify that we want to replace a given pattern with nothing.\n\n# I create a vector with names\nnames <- c(\"dr Mike Smit\",\"dr Sandra Toze\")\n\n# I remove the first instance of the pattern \"dr\" from the names.\nnames %>% \n str_remove(\"dr\")\n\n[1] \" Mike Smit\" \" Sandra Toze\"\n\n\n\n4.3.1.3.1 Tibble example\n\n# I create a tibble with professor names.\nt <- tibble(names = c(\"dr Mike Smit\",\"dr Sandra Toze\"))\n\n# I remove all instance of the pattern \"dr\" in the names. \nt %>% \n mutate(names = str_remove_all(names, \"dr\"))\n\n# A tibble: 2 × 1\n names \n <chr> \n1 \" Mike Smit\"\n2 \" Sana Toze\"\n\n\nWe can see that again removing all the “dr” patterns from the strings caused a problem because the pattern is also found in the name “Sandra”.\n\n\n\n\n4.3.2 Removing extra spaces\nThe str_squish() function is a quick and easy way to remove unwanted spaces before or after a string, as well as consecutive spaces within a string.\n\nmessy_string <- \" My cat just stepped on the spacebar as I was writing this \"\n\n# Let's print the string to see what it looks like\nmessy_string\n\n[1] \" My cat just stepped on the spacebar as I was writing this \"\n\n# Let's squish it!\nstr_squish(messy_string)\n\n[1] \"My cat just stepped on the spacebar as I was writing this\"\n\n\nThe str_trim()function is similar to str_squish() but allows you to specify which types of extra spaces you wish to remove. However, it only handles trailing spaces at the beginning or end of strings and cannot remove extra spaces extra spaces in the middle of a string.\n\nstring <- \" hello world \"\n\n# remove spaces at the beginning\nstring %>% \n str_trim(\"left\")\n\n[1] \"hello world \"\n\n# remove spaces at the end\n# remove spaces at the beginning\nstring %>% \n str_trim(\"right\")\n\n[1] \" hello world\"\n\n# remove spaces at the beginning and at the end\nstring %>% \n str_trim(\"both\")\n\n[1] \"hello world\"\n\n\n\n\n4.3.3 Combine strings\nWe already learned how to use the unite() function of the tidyr package to concatenate multiple data frame columns into one. However, the unite() function works only with data frames as input, which can be limiting. The stringr package offers a str_c() function that works with vectors, so it’s good to know how to use both functions.\n\n4.3.3.0.1 Vector example\n\n# I create a vector with first names\nfirst_names = c(\"Bertrum\", \"Colin\", \"Louise\")\n\n# I create a vector with last names\nlast_names = c(\"MacDonald\", \"Conrad\", \"Spiteri\")\n\n# I combined my vectors into a new vector with full names\nfull_names <- str_c(first_names, last_names, sep = \" \")\n\n# I print the vector\nprint(full_names)\n\n[1] \"Bertrum MacDonald\" \"Colin Conrad\" \"Louise Spiteri\" \n\n\nAnother advantage of the str_c() over the unite() function is that it is more flexible in terms of the strings that get concatenated. You could combine the content of two vectors and add any pattern you want to any string.\n\n# I create a tibble with two columns containing first and last names.\nmy_tibble = tibble(first_name = c(\"Bertrum\", \"Colin\", \"Louise\"),\n last_name = c(\"MacDonald\", \"Conrad\", \"Spiteri\"))\n\n# I add a column to my tibble with full_names\nmy_tibble %>% \n mutate(full_name = str_c(first_name, last_name, sep=\" \"))\n\n# A tibble: 3 × 3\n first_name last_name full_name \n <chr> <chr> <chr> \n1 Bertrum MacDonald Bertrum MacDonald\n2 Colin Conrad Colin Conrad \n3 Louise Spiteri Louise Spiteri \n\n# I add a column to my tibble with full_names and include the Dr. pattern at the beginning of the name.\nmy_tibble %>% \n mutate(full_name = str_c(\"Dr.\", first_name, last_name, sep=\" \"))\n\n# A tibble: 3 × 3\n first_name last_name full_name \n <chr> <chr> <chr> \n1 Bertrum MacDonald Dr. Bertrum MacDonald\n2 Colin Conrad Dr. Colin Conrad \n3 Louise Spiteri Dr. Louise Spiteri \n\n\n\n\n\n4.3.4 Splitting strings\nThe str_split() function does the same thing as the separate() function that we learned about in chapter 3. They have slightly different syntax and arguments, but the main difference between the two functions is that str_split() works with vectors and returns a list, while separate() works with data frames and returns a data frame. In other words, if you want to split a string contained in a data frame column, you need to use separate(), and if you want to split a character vector into a list of character vectors. the n argument of str_split() allows us to specify the length of the returned vector. The basic syntax is str_split(character_vector, separator).\n\ncourses = c(\"INFO5500, INFO6540, INFO6270\",\n \"INFO5500\",\n \"INFO5530, INFO5520\")\n\n# str_split separates the vectors based on a specified delimiter.\n# the outcome is a list of three vectors with 3, 1 and 2 elements.\ncourses %>% \n str_split(\", \")\n\n[[1]]\n[1] \"INFO5500\" \"INFO6540\" \"INFO6270\"\n\n[[2]]\n[1] \"INFO5500\"\n\n[[3]]\n[1] \"INFO5530\" \"INFO5520\"\n\n\nWe can also specify the maximum number of pieces we want to split the string into.\n\n# Here I split the courses vector into a list of vectors that can have a maximum of 2 elements.\ncourses %>% \n str_split(\", \",n=2)\n\n[[1]]\n[1] \"INFO5500\" \"INFO6540, INFO6270\"\n\n[[2]]\n[1] \"INFO5500\"\n\n[[3]]\n[1] \"INFO5530\" \"INFO5520\"\n\n\nWe can also specify the exact number of pieces we want to split the string into with str_split_fixed(). This function does not return a vector but a matrix.\n\n# I split the courses vector into a matrix with 4 columns.\ncourses %>% \n str_split_fixed(\", \",n=4)\n\n [,1] [,2] [,3] [,4]\n[1,] \"INFO5500\" \"INFO6540\" \"INFO6270\" \"\" \n[2,] \"INFO5500\" \"\" \"\" \"\" \n[3,] \"INFO5530\" \"INFO5520\" \"\" \"\" \n\n\n\n4.3.4.1 str_flatten\nThe str_flatten() function takes a character vector of length x and concatenates all the elements into a character vector of length 1 (a single string) with a specified separator between the elements. In a sense, it is the opposite of a str_split(). It’s basic syntax is str_flatten(vector, separator)\n\n4.3.4.1.1 Vector example\n\nx <- c(\"a\",\"b\",\"c\")\nstr_flatten(x,\"|\")\n\n[1] \"a|b|c\"\n\n\n\n\n4.3.4.1.2 tibble example\nUsing str_flatten() in a tibble is tricky (we need to use the group_by() function that we briefly mentioned in the previous chapter but haven’t thoroughly explored yet) but also counterintuitive since it likely means that we are taking a tibble in a tidy format and making it untidy.\n\n# Here is a tibble\nmy_tibble <- tibble(instructor = c(\"Mongeon, Philippe\", \"Mongeon, Philippe\", \"Mongeon, Philippe\",\"Spiteri, Louise\",\"Spiteri, Louise\"),\n course = c(\"INFO5500\",\"INO6540\",\"INFO6270\",\"INFO6350\",\"INFO6480\"))\n\nprint(my_tibble)\n\n# A tibble: 5 × 2\n instructor course \n <chr> <chr> \n1 Mongeon, Philippe INFO5500\n2 Mongeon, Philippe INO6540 \n3 Mongeon, Philippe INFO6270\n4 Spiteri, Louise INFO6350\n5 Spiteri, Louise INFO6480\n\n\nNow I want to flatten my course column so that I have all the courses taught by the same instructor in a single row and separated with a “|”.\n\nmy_tibble %>% \n group_by(instructor) %>% \n mutate(course = str_flatten(course, \" | \")) %>% \n unique()\n\n# A tibble: 2 × 2\n# Groups: instructor [2]\n instructor course \n <chr> <chr> \n1 Mongeon, Philippe INFO5500 | INO6540 | INFO6270\n2 Spiteri, Louise INFO6350 | INFO6480 \n\n\n\n\n\n\n\n\nImportant\n\n\n\nThe unique() function at the end of the previous code removes the duplicates that are typically created with the str_flatten() function. You can try it yourself and see what happens when you don’t include the unique() step at the end.\n\n\n\n\n\n\n4.3.5 Subsetting strings\n\n4.3.5.1 str_sub\nWe can retrieve, for example, the first three characters of a string (e.g., a postal code) with the str_sub() function. It’s basic syntax is str_sub(string, start, end)\n\n4.3.5.1.1 Vector example\n\npostal_code <- \"B3H 4R2\"\n\n# get the first three characters of the postal code\npostal_code %>% \n str_sub(1,3)\n\n[1] \"B3H\"\n\n\nYou can also retrieve the last characters of the string using negative numbers. Let’s get the last three characters of the postal code.\n\npostal_code %>% \n str_sub(-3,-1)\n\n[1] \"4R2\"\n\n\n\n\n4.3.5.1.2 Tibble example\n\n# I create my tibble \nt <- tibble(postal_code = c(\"B3H 4R2\", \"B3H 7K7\"))\n\n# I print my tibble\nt\n\n# A tibble: 2 × 1\n postal_code\n <chr> \n1 B3H 4R2 \n2 B3H 7K7 \n\n# I add two new columms with the first three digits and the last 3 digits of the postal code. \nt <- t %>% \n mutate(first_three_digits = str_sub(postal_code, 1, 3),\n last_three_digits = str_sub(postal_code, -3, -1)) \n\n# I print my new tibble\nt\n\n# A tibble: 2 × 3\n postal_code first_three_digits last_three_digits\n <chr> <chr> <chr> \n1 B3H 4R2 B3H 4R2 \n2 B3H 7K7 B3H 7K7 \n\n\nNoticed how I created two new columns with the same mutate()? You can mutate as many things as you want in a single mutate() function. You simply need to add a comma to separate each mutation.\n\n\n\n4.3.5.2 str_subset\nThe str_sub() function should not be confused with the str_subset() functions that returns the element of a vector that contain a string. It’s basic syntax is str_subset(character_vector, string_to_find)\n\n# I create a vector with course codes\ncourse_codes <- c(\"INFO5500\", \"BUSI6500\", \"MGMT5000\", \"INFO6270\")\n\n# I print a vector of course codes that contain the pattern \"INFO\"\nstr_subset(course_codes, \"INFO\")\n\n\n\n\n\n\n\nCaution\n\n\n\nNote that you should not try to use the str_subset() function with a tibble. It is possible, but requires the combination of multiple functions, and it’s not something that you are likely to need to do anyways.\n\n\n\n\n\n4.3.6 Locating a pattern in a string\nThe str_locate() function allows you to find the position of a pattern in a string. This can be useful, for instance, in combination with str_sub() if you want to extract the part of a string that comes before or after the pattern. Let’s explore the str_locate() function with a few examples.\n\n4.3.6.0.1 Vector examples\n\n# I create a string with an email\nemail <- \"info@somewebsite.ca\"\n\n# I locat the @ character\nemail %>% \n str_locate(\"@\")\n\n start end\n[1,] 5 5\n\n\nYou can see that str_locate() returns a matrix with the beginning and the end of the “@” pattern in the email. If we want to get the part of the strings that come before the “@”, then we can do this:\n\n# I get the first part of the email\nstr_sub(email, 1,str_locate(email,\"@\")[,1]-1)\n\n[1] \"info\"\n\n\nWe did three things there:\n\nWe used 1 as the first argument of str_sub() to specify that we want to extract a subset of the email starting with the 1st character.\nWe used [,1] to obtain the first column in the matrix, which is where our pattern starts (the 5th position).\nWe subtracted 1 because we don’t want to print characters 1 to 5, which would be “info@” but characters 1 to 4.\n\nSo our statement, in English, would read like this: “extract the subset of the email string that starts at the first position and ends one position before where the”@” pattern is located”.\nWe can get the part that comes after the pattern “@” like this:\n\nemail %>% \n str_sub(str_locate(email,\"@\")[,2]+1,-1)\n\n[1] \"somewebsite.ca\"\n\n\nThis reads as “give me the subset of the email string that starts one position after the location of the”@” pattern (str_locate(email,\"@\")[,2]+1), and ends with the last character of the string (-1)“. Note that the”,-1” part is optional since, by default, the str_sub() function will output the rest of the string when no end position is provided.\n\n\n4.3.6.0.2 Tibble example\nLet’s just repeat the same example but working with a tibble.\n\n# We create a tibble than contains some emails\nmy_tibble <- tibble(emails = c(\"info@somewebsite.ca\",\"support@datascienceisfun.com\"))\n\n# We print the tibble\nprint(my_tibble)\n\n# A tibble: 2 × 1\n emails \n <chr> \n1 info@somewebsite.ca \n2 support@datascienceisfun.com\n\n# We remove the part of the emails after the @\nmy_tibble %>% \n mutate(emails = str_sub(emails, 1, str_locate(emails,\"@\")[,1]-1))\n\n# A tibble: 2 × 1\n emails \n <chr> \n1 info \n2 support\n\n# We remove the part of the emails before the @\nmy_tibble %>% \n mutate(emails = str_sub(emails, str_locate(emails,\"@\")[,2]+1))\n\n# A tibble: 2 × 1\n emails \n <chr> \n1 somewebsite.ca \n2 datascienceisfun.com\n\n# Let's make this a bit more complex, and print only the part between the \"@\" and the \".\"\nmy_tibble %>% \n mutate(emails = str_sub(emails, # strint to subset\n str_locate(emails,\"@\")[,2]+1, # starting position\n str_locate(emails,\"\\\\.\")[,1]-1)) # ending position\n\n# A tibble: 2 × 1\n emails \n <chr> \n1 somewebsite \n2 datascienceisfun\n\n\n\n\n\n4.3.7 Testing strings\nRather than extracting parts of strings, or modifying strings, you may just want to test to see if a strings contains a specific pattern and get a logical (TRUE, FALSE) in return.\n\n4.3.7.1 str_detect\nThe str_detect() function allows us to identify strings that contain a specific pattern. It’s syntax is str_detect(character_vector, string_to_detec).\n\n4.3.7.1.1 Vector example\n\npostal_code = c(\"B3H 1H5\",\"B3H 382\",\"H2T 1H2\",\"J8P 9R2\")\nstr_detect(postal_code, \"B3H\")\n\n[1] TRUE TRUE FALSE FALSE\n\n\nThis can be useful if we want to filter a tibble based on pattern matches. Here’s an example where we have a list of postal codes and would like to keep only those who are in Halifax.\n\n\n4.3.7.1.2 Tibble example\n\n# I create a tibble with postal codes\nmy_tibble <- tibble(postal_code = c(\"B3H 1H5\",\"B3H 382\",\"H2T 1H2\",\"J8P 9R2\"))\n\n# I print the rows that for which the postal code contains the pattern \"B3H\"\nmy_tibble %>% \n filter(str_detect(postal_code,\"B3H\"))\n\n# A tibble: 2 × 1\n postal_code\n <chr> \n1 B3H 1H5 \n2 B3H 382 \n\n\n\n\n\n4.3.7.2 str_starts and str_ends\nThe str_starts() and str_ends() functions do the same thing as str_detect(), but look for the pattern specifically at the beginning or the end of the strings.\n\n# I create a tibble with postal codes\nt <- tibble(postal_code = c(\"B3H 1H5\",\"B3H 382\",\"H2T 1H2\",\"J8P 9R2\"))\n\n# I print the postal codes that begin with \"B3H\"\nt %>% \n filter(str_starts(postal_code, \"B3H\"))\n\n# A tibble: 2 × 1\n postal_code\n <chr> \n1 B3H 1H5 \n2 B3H 382 \n\n# I print the postal codes that end with \"1H2\"\nt %>% \n filter(str_ends(postal_code, \"1H2\"))\n\n# A tibble: 1 × 1\n postal_code\n <chr> \n1 H2T 1H2 \n\n\n\n\n\n4.3.8 Regular expressions (regex)\nRegular expressions are a powerful way to search for patterns in text. A full understanding of regex is far beyond the scope of this course, but you should at least be aware of them. Below is a very superficial introduction to regular expressions. The cheat sheet for the stringr (https://github.com/rstudio/cheatsheets/blob/main/strings.pdf) package is a great place to look for guidance on using regular expressions (as well as all other functions in the stringr package, several of which that I didn’t mention in this chapter but might still be useful). It shows a list of the basic character classes, and all the operators that you can use to search for patterns in strings, so remember that it’s there to help you.\n\n4.3.8.1 Literal expressions\nIn the code examples above, we used several functions of the stringr package to search for patterns in strings (e.g., searching for the pattern “INFO” in a vector of strings.). “INFO” is a literal expression. We can also search for more than one pattern combined with the Boolean operator OR (represented by “|” in a search pattern).\n\n\n4.3.8.2 Character classes\nCharacter classes allow you to search for a range of characters or types of patterns using character classes (e.g., numbers, punctuation, symbols, letters, or a user specified set or range of characters). These classes are represented by square brackets “[ ]”.\n\n4.3.8.2.1 Example: remove unwanted characters from strings\nYou can use regular expressions to filter out of a string all the non-alphanumeric characters like this:\n\nmessy_string <- \" what-is%going*on/with!my(keyboard)\"\n\nmessy_string %>% \n str_replace_all(\"[^[:alnum:]]\",\" \") \n\n[1] \" what is going on with my keyboard \"\n\n\nWhy does this work? Because:\n\n[:alnum:] is a character class containing all characters that are alphabetical or numerical (letters and numbers).\n[^] means everything but.\n\nSo the statement reads: replace everything but alphanumeric characters with a space.\n\n\n4.3.8.2.2 Example: find sequences of character belonging to specific classes\nWe can search for specific sequences of character classes, which can be useful to retrieve things like postal codes from a string.\n\n# We create a vector with an address\naddress <- c(\"5058 King St, Halifax, NS H2T 1J2\",\"427 Queen Avernue, Halifax, NS, B3H1H4\") \n\n# We extract the postal code from the address \naddress %>% \n str_extract(\"[:alpha:][:digit:][:alpha:] ?[:digit:][:alpha:][:digit:]\")\n\n[1] \"H2T 1J2\" \"B3H1H4\" \n\n\nThe pattern [:alpha:][:digit:][:alpha:] reads as “any letter, followed by any number, followed by any letter”. The [:digit:][:alpha:][:digit:] patters reads as any number, followed by any letter, followed by any number.\nYou might have noticed that then there is a space and a question mark between my two sets of three character classes. This reads as 0 or 1 space (see the quantifiers section in the stringr cheatsheet). This allows queries to extract postal codes that are written with no space between the two sets of three characters.\n\n\n4.3.8.2.3 Example: search for spelling variants\nAnother convenient way of using character classes is when you want to match a word in a text that is or isn’t capitalized. Here’s an example.\n\n# We create a tibble with 2 strings\nmy_tibble <- tibble(text = c(\"Information management is great\", \"I love information management\", \"Wayne Gretzy was the best hockey player of all times\"))\n\n# We print the tibble\nprint(my_tibble)\n\n# A tibble: 3 × 1\n text \n <chr> \n1 Information management is great \n2 I love information management \n3 Wayne Gretzy was the best hockey player of all times\n\n# We select the texts that contain \"information management\" or \"Information management\".\nmy_tibble %>% \n filter(str_detect(text, \"[Ii]nformation management\"))\n\n# A tibble: 2 × 1\n text \n <chr> \n1 Information management is great\n2 I love information management \n\n\n\n\n4.3.8.2.4 Example: combining multiple search terms with “|” (boolean OR)\nInstead of using character classes, we could combine multiple search teams with the “|” that represents the Boolean operator OR.\n\nmy_tibble %>% \n filter(str_detect(text, \"information management|Information management\"))\n\n# A tibble: 2 × 1\n text \n <chr> \n1 Information management is great\n2 I love information management \n\n\nThis works, but even with just two variants, you can already tell that it makes longer statements to write.\n\n\n4.3.8.2.5 Example: searching for a range of character\n\n# I create a tibble containing letters from a to g\nmy_tibble <- tibble(letters = c(\"a\",\"b\",\"c\",\"d\",\"e\",\"f\",\"g\"))\n\n# I retrieve rows that contain letters from a to f\nmy_tibble %>% \n filter(str_detect(letters,\"[a-f]\"))\n\n# A tibble: 6 × 1\n letters\n <chr> \n1 a \n2 b \n3 c \n4 d \n5 e \n6 f \n\n\nAgain, we could have used “a|b|c|d|e|f” but this is less efficient. Here’s a similar example where we have lowercase and uppercase letters.\n\n# I create a tibble containing letters from a to g in lowercase and uppercase.\nmy_tibble <- tibble(letters = c(\"a\",\"b\",\"c\",\"d\",\"e\",\"f\",\"g\",\n \"A\",\"B\",\"C\",\"D\",\"E\",\"F\",\"G\"))\n\n# I retrieve rows that contain the letters a to d in lowercase or uppercase\nmy_tibble %>% \n filter(str_detect(letters, \"[a-dA-D]\"))\n\n# A tibble: 8 × 1\n letters\n <chr> \n1 a \n2 b \n3 c \n4 d \n5 A \n6 B \n7 C \n8 D \n\n\n\n\n\n4.3.8.3 Beware of the dot, it’s a wild card\nWhen matching character patterns, the “.” means any character.\n\nstring <- \"This is a string\" \n\n# I extract every character\nstr_extract_all(string, \".\")[[1]]\n\n [1] \"T\" \"h\" \"i\" \"s\" \" \" \"i\" \"s\" \" \" \"a\" \" \" \"s\" \"t\" \"r\" \"i\" \"n\" \"g\"\n\n# I replace every character with a space\nstring %>% \n str_replace_all(\".\",\" \")\n\n[1] \" \"\n\n\n\n\n\n4.3.9 Dealing with special characters in strings\nHere are some of the characters that you might come across when working with strings in R. When you want to insert these characters in a string, you need to precede them with the escape character “\\”. Here is a table adapted from the stringr cheatsheet.\n\n\n\nString\nRepresents\nHow to search in a pattern\n\n\n\n\n\\.\n.\n\\\\.\n\n\n\\!\n!\n\\\\!\n\n\n\\?\n?\n\\\\?\n\n\n\\(\n(\n\\\\(\n\n\n\\)\n)\n\\\\)\n\n\n\\{\n{\n\\\\{\n\n\n\\}\n}\n\\\\}\n\n\n\\n\nnewline\n\\\\n\n\n\n\\t\ntab\n\\\\t\n\n\n\\\\\nbackslash \\\n\\\\\\\\\n\n\n\\’\napostrophe ’\n\\\\’\n\n\n\\”\nquotation mark ”\n\\\\”\n\n\n\\`\nbacktick `\n\\\\`\n\n\n\nHere are just a few example to so you can see how R deals with these special characters.\n\nstring <- \"Dear diary\\nWhat is wrong with me\\nMy code never works as I entend\"\n\n# If we just print the string, we see it exactly as written.\nprint(string)\n\n[1] \"Dear diary\\nWhat is wrong with me\\nMy code never works as I entend\"\n\n\nThe writeLines() function can be used to print the string where escaped characters are interpreted.\n\nwriteLines(string)\n\nDear diary\nWhat is wrong with me\nMy code never works as I entend\n\n\nLet’s read a text file (.txt) in R and see what happens.\n\nurl <- \"https://pmongeon.github.io/info6270/files/boring_story.txt\"\n\n# reads the file and produces a vector with one element for each line\nread_lines(url)\n\n[1] \"This is a \\\"story\\\" that I wrote just for the INFO6270 course.\" \n[2] \"It's a bit of a boring story, but it's just an example. So please forgive me.\"\n[3] \"...and they were happy ever after.\\tThe end.\" \n\n# reads the file and procudes a vector with a single element containing the entire content\nread_file(url) \n\n[1] \"This is a \\\"story\\\" that I wrote just for the INFO6270 course.\\nIt's a bit of a boring story, but it's just an example. So please forgive me.\\n...and they were happy ever after.\\tThe end.\"\n\n# Let's read the whole file and print it with writeLines()\nread_file(url) %>% \n writeLines()\n\nThis is a \"story\" that I wrote just for the INFO6270 course.\nIt's a bit of a boring story, but it's just an example. So please forgive me.\n...and they were happy ever after. The end.\n\n\n\n\n4.3.10 Summary\nThis chapter introduced you to the stringr package and the general principles of manipulating and matching character patterns in R. The goal was to give you enough of the basics so that you can fix small issues with strings in the data that you might encounter in this course, and in your professional or personal lives."
},
{
"objectID": "shinyapps_part2.html#learning-objecives",
"href": "shinyapps_part2.html#learning-objecives",
"title": "12 ShinyApps",
"section": "12.1 Learning objecives",
"text": "12.1 Learning objecives\nIn this chapter, we’ll be looking in more detail at the input and output functions using examples. This chapter introduces you to:\n\ninput functions for text, numbers, dates, choices, file uploads and actions,\noutput functions for text, tables, plots, images, and download formats,\nan introduction to themes to change the appearance of your application, and\nan explanation of reactivity and how its used.\n\nWe’ll be using some of the dataset standards included in R, such as mtcars and iris, as example datasets. Feel free to explore the datasets on your own, or even replace with your own data as you play with these inputs.\n\n\n\n\n\n\nTip\n\n\n\nYou can see all the datasets included in R by typing data() in the console.\nYou may also find the RStudio and Shiny cheatsheets helpful."
},
{
"objectID": "shinyapps_part2.html#inputs-widgets",
"href": "shinyapps_part2.html#inputs-widgets",
"title": "12 ShinyApps",
"section": "12.2 Inputs & Widgets",
"text": "12.2 Inputs & Widgets\nInputs are the ways users can enter, filter, or select information within your app. Widgets are a type of input that requires a different mode of input than text, such as a slider or button. There is a basic format to the inputs & widgets. First there is the input type, in this case, a textAreaInput(). In the first position is the inputID parameter, followed by the label parameter. These two will be consistent with all inputs. After the inputID and the label, arguments that may be unique to each follow. In this case, there is an argument for the number of rows. In general, all inputs, (including widgets) keep this order of inputID, label, arguments.\ntextAreaInput(\"story\", \"Tell me about yourself\", rows = 3)\nWhile we will address a few of these, you can find all of them and their code on the shiny gallery.\nThe syntax of the widget is the same as other inputs. Let’s look at the code below:\n\nThe inputID has some rules, just like any variable in R. It must be:\n\nFirst, a string consisting of letters, numbers, and/or underscores. Other characters like spaces, symbols, dashes, periods, etc., won’t work.\nSecond, it must be unique as you will call this in output functions.\n\nYou can find all of these in the input section in the shiny references documentation here. The following selection are some of the more commonly used ones to get you started. Inputs and outputs will be shown in context of the ui or server component. You can build an app as you go along. You may want to start with the basic structure first and try different inputs as your read through the chapter.\nlibrary(shiny)\n\nui <- fluidPage(\n\n)#close fluidPage\n\nserver <- function(input, output) {\n\n} #close server function\n\nshinyApp(ui, server)\n\n\n\n\n\n\nTip\n\n\n\nIf you ever want to know more about a function, you can always use the help section in RStudio or use the console to place a ? before the function, such as ?fluidPage.\n\n\n\n12.2.1 Text input\n\n12.2.1.0.1 textInput()\nThis is for small amounts of text, like asking for someone’s name, address, or what type of donut they like. You can find the documentation on this input here. To format text, for size, emphasis, color, etc, please see the Formatting text section at the end of the Outputs section.\nui <- fluidPage(\n textInput(\"input_1\", \"What's your favorite donut?\"),\n )#close fluidPage\n\n\n12.2.1.0.2 passwordInput()\nThis is for entering passwords. You can find more info here on its arguments.\nui <- fluidPage(\n textInput(\"input_1\", \"What's your favorite donut?\"),\n passwordInput(\"pword_1\", \"If a donut was your password, what would it be?\")\n)#close fluidPage\n\n\n12.2.1.0.3 textAreaInput()\nThis one is better for longer sections of text, like bio’s for websites, brief passages, comments, special instructions, etc. You can find more info here on its arguments.\nui <- fluidPage(\n textInput(\"input_1\", \"What's your favorite donut?\"),\n passwordInput(\"pword_1\", \"If a donut was your password, what would it be?\"),\n textAreaInput(\"bio\", \"Please describe yourself as a donut\", rows = 3)\n \n)#close fluidPage\nSo, let’s see these inputs as a complete application.\nlibrary(shiny)\n\nui <- fluidPage(\n textInput(\"input_1\", \"What's your favorite donut?\"),\n passwordInput(\"pword_1\", \"If a donut was your password, what would it be?\"),\n textAreaInput(\"bio\", \"Please describe yourself as a donut\", rows = 3)\n)#close fluidPage\n\nserver <- function(input, output){\n}\n\n# Run the application \nshinyApp(ui = ui, server = server)\n\n\n\n12.2.2 Number inputs\nHere are three inputs for numbers. You can find documentation here for the arguments.\n\n12.2.2.0.1 numericInput()\nui <- fluidPage(\n numericInput(\"num_1\", \"Enter the quantity of donuts\", value = 0, min = 0, max = 12)\n)#close fluidPage\n\n\n12.2.2.0.2 sliderInput()\nSlider inputs can be used to select a single number or specify a range. Note the list argument passed in the second sliderInput() function named num_3. Documentation is here.\nui <- fluidPage(\n sliderInput(\"num_2\", \"Enter the maximum number you can eat in one go\", value = 6, min = 0, max = 12),\n sliderInput(\"num_3\", \"Enter the range of donuts you have been known to eat\", value=c(3,9), min=0, max=12 )\n)#close fluidPage\n\n\n12.2.2.0.3 dateInput() and dateRangeInput()\nFor single date entry, use the dateInput() function. For a range of dates, use the dateRangeInput(). Easy, right? There are format options for date inputs, such as format, language, and value which defines the starting date. The default starting date is today’s date on your system. You can use the help section to find out more or the documentation here for dateInput() and here for dateRangeInput().\nui <- fluidPage(\n dateInput(\"order_1\", \"What date do you want to order donuts?\"),\n dateRangeInput(\"delivery_1\",\"Between what dates do you want the donuts delivered?\")\n\n)#close fluidPage\n\n\n\n12.2.3 Choices from a list\n\n12.2.3.0.1 selectInput()\nThis provides a drop down list based on a list. In the following example, the list has been defined first, but this list could also be passed within the selectInput() function.\nflavors <- c(\"chocolate\", \"plain\", \"raspberry\", \"maple\", \"unicorn\", \"creme-filled\", \"sprinkles\", \"chef's choice\")\n\nui <- fluidPage(\n selectInput(\"flavor_1\", \"What flavor of donut would you like?\", flavors, multiple=TRUE)\n\n)#close fluidPage\n\n\n12.2.3.0.2 radioButtons()\nRadio buttons provide a specified list of options that can be chosen. It is possible to change the text to other display types like images, icons, or HTML using the choiceNames and choiceValues aguments.\nflavors <- c(\"chocolate\", \"plain\", \"raspberry\", \"maple\", \"unicorn\", \"creme-filled\", \"sprinkles\", \"chef's choice\")\n\nui <- fluidPage(\n radioButtons(\"flavor_button\", \"What is your second favorite flavor?\", flavors)\n\n)#close fluidPage\n\n\n12.2.3.0.3 checkboxInput() and checkboxGroupInput()\nAn alternative to radio buttons is check boxes which can be used for lists, surveys, or yes/no decions. For checkboxInput(), the value argument is a boolean TRUE or FALSE that determines if its automatically checked. checkboxGroupInput() also lets you select multiple choices, which the radio button does not.\nflavors <- c(\"chocolate\", \"plain\", \"raspberry\", \"maple\", \"unicorn\", \"creme-filled\", \"sprinkles\", \"chef's choice\")\n\nui <- fluidPage(\n checkboxInput(\"choice_1\", \"Eat here\", value=TRUE),\n checkboxInput(\"choice_2\", \"Take home\"),\n checkboxGroupInput(\"multiple_choice\", \"What flavors would you like?\", flavors, selected = NULL)\n\n)#closed fluidPage\n\n\n\n\n\n\nTip\n\n\n\nBy now, you should be using the ?function, (such as ?checkboxGroupInput) in the console or searching for the function in the help page, (usually on the right in RStudio). Or use cheatsheets such as this one for shiny. This is a normal part of workflow and will help you add arguments to control input behaviours.\n\n\n\n\n\n12.2.4 Action buttons\nActions are usually used with the observeEvent() or eventReactive() functions that trigger a server side function. However, action buttons can alse be used for simple tasks without reactivity.\n\n\n\n\n\n\nNote\n\n\n\nClient side? Server side? What does this mean?\nThe client-server relationship is the basic framework for how the internet works. Your laptop, (tablet, desktop, or phone) are considered clients and request information from a server when you ‘go to a webpage’. This is also loosely termed front-end (client) and back-end (server). In our case, when you go to a webpage that is run by Shiny, your device (the client) requests information from the server, and the webpage loads once. If you refresh, it loads again. These computations happen on the server. Servers are faster, routinely maintained, and always on, so running computations on the server can be desirable. We’ll learn more about this in the Reactivity section.\n\n\n\n12.2.4.0.1 actionButton()\nThere are different button types already formatted for you. These include btn-primary, btn-success, btn-info, btn-warning, or btn-danger. You can modify these with sizes, such as btn-lg, btn-sm, btn-xs.\nui <- fluidPage(\n fluidRow(\n actionButton(\"btn_1\", \"Place your order\", class = \"btn-primary\"),\n actionButton(\"btn_2\", \"Reset your order\", class = \"btn-warning\"),\n actionButton(\"btn_3\", \"Preview your order\", class = \"btn-info\"),\n actionButton(\"btn_4\", \"Pay for your order\", class = \"btn-warning\")\n ),#close fluidRow\n fluidRow(\n actionButton(\"btn_5\", \"I can't eat any more donuts\", class = \"btn-block\")\n )#close fluidRow\n\n)#close fluidPage\nYou can also pass icons from the FontAwesome library to buttons and connect to web links. Shiny currently only supports the v4 FontAwesome library, but if you want to use later versions, (its currently up to 6.3) then you can try the fontawesome R library.\nui <- fluidPage(\n actionButton(inputId='link_1', label=\"See our location\", \n icon = icon(\"heart\"), \n onclick =\"window.open(\"https://goo.gl/maps/PrUi3qKEc3WFg9Hz7\", '_blank')\")\n )\n)#close fluidPage\nThere are many more inputs available on the Shiny reference documentation, these most common ones will get you started making apps. OK! Enough with the donuts! Next, let’s go over types of outputs and how these work with inputs using a standard R dataset."
},
{
"objectID": "shinyapps_part2.html#outputs",
"href": "shinyapps_part2.html#outputs",
"title": "12 ShinyApps",
"section": "12.3 Outputs",
"text": "12.3 Outputs\nOutputs are paired functions with one assigned in the ui to define spaces where outputs will be seen and the other in the server function to render the result. They include a unique ID in the first position of its arguments.\nOutput ID’s are called from the server side preceded by output$outputID in which outputID is the ID (like a variable name) you’ve assigned it. You’ll see these are always calling a render* function, such as renderText(). As an example:\n\nSome important new fuctions are called. In the server() function, you see renderText(). This calls the values you assigned in input_1 and places it where you assigned it in output_1. In addition to text, you can also render tables, data tables, plots, images, and text. *Output() and render*() functions work together, with *Output() in the ui to show where output will be displayed, and render*() in the server to produce the desired information.\n\n12.3.1 Text output\n\n12.3.1.0.1 textOutput() & renderText()\nThis outputs regular text. You can see the placeholder textOutput() in the ui, and the renderText() function passed in the server. Here, we use the input of input_1 as the output for text.\nlibrary(shiny)\n\nui <- fluidPage(\n textInput(\"input_1\", \"What's your favorite donut?\"),\n h4(\"Your favorite donut is: \"),\n textOutput(\"text\")\n \n)#close fluidPage\n\nserver <- function(input, output) {\n output$text <- renderText({input$input_1})\n \n}#close server\n\nshinyApp(ui, server)\n\n\n12.3.1.0.2 verbatimTextOutput() & renderPrint()\nThis creates a console like output in the application. Lets add the dataset summary. renderPrint() prints the results of expressions, where renderText() prints text together in a string.\nui <- fluidPage(\n textOutput(\"text\"),\n verbatimTextOutput(\"code\")\n \n)#close fluidPage\n\nserver <- function(input, output) {\n output$text <- renderText({\n \"Hello. The following is a summary of a standard R dataset, mtcars\"\n })\n output$code <- renderPrint({\n summary(mtcars)\n })\n}#close server\n\n\n\n12.3.2 Tables\nOh, tables! Tables are powerful. There are two types of tables, static tables of data, and dynamic tables that are interactive.\n\n12.3.2.0.1 tableOutput() & renderTable()\nStatic tables are great for summaries or concise results. They’re good at preserving data just the way you made it.\nui <- fluidPage(\n tableOutput(\"static\")\n)#close fluidPage\n\nserver <- function(input, output) {\n output$static <- renderTable(head(mtcars))\n}#close server\n\n\n12.3.2.0.2 dataTableOutput() & renderDataTable()\nDataTables are much more dynamic and can be customized in numerous ways.\nui <- fluidPage(\n dataTableOutput(\"dynamic\")\n\n)#close fluidPage\n\nserver <- function(input, output) {\n output$dynamic <- renderDataTable(mtcars, options=list(pageLength = 6)\n )#cloe renderDataTable\n}#close server\nDataTables are more appropriate for larger dataframes where someone may need to explore, filter, and sort data. You can find more information on modifying DataTables here.\n\n\n\n\n\n\nImportant\n\n\n\nDataTables refers to both functions in Shiny and from the DT library. Unless the library(DT) is called, references to dataTables are for the version in the Shiny library. While dataTables in the Shiny library provides only server-side tables, the DT package can provide both server-side and client-side tables. This may be an important consideration if you are trying to reduce how many working hours your application runs on shinyapps.io, for example.\n\n\n\n\n\n12.3.3 DataTables\nIn this section, we are going to explore importing data and displaying it in a DataTable using the DT library. This is different than the data.table function included in the Shiny library. The DT library makes a lovely table that can be searched, filtered, and sorted which is great for data exploration. You can find more info here and here for the DT documentation.\nlibrary(shiny)\nlibrary(DT)\n\nui <- fluidPage(\n h2(\"Some data about flowers\"),\n DT::dataTableOutput(\"table_1\")\n\n)#close fluidPage\n\nserver <- function(input, output) {\n output$table_1 = DT::renderDataTable({\n iris\n })#close output\n\n}#close server\n\nshinyApp(ui, server)\n\n\n\n\n\n\nImportant\n\n\n\nYou can in the code above that the DT library has been called. Its also good practice to call directly from the library for functions that are present in more than one library. In this case, renderDataTable() exists in both Shiny and DT. To make it more human readable and prevent conflicts, its advised to call the function directly from the library as in the following:\nDT::renderDataTable()\n\n\n\n\n12.3.4 Computation output\nBut what if we wanted to perform some operations on the data, such as to use the summarizing functions you did in earlier chapters on the mtcars dataset? In the code below, we first import some data from a remote location, then create a DataTable for exploration. Then we summarize the data.\nlibrary(shiny)\nlibrary(DT)\nlibrary(tidyverse)\n\n\npath <- \"https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv\"\n\n# This will read the first sheet of the Excel file\ncomics_data <- read_csv(path)\n\nui <- fluidPage(\n sidebarLayout(\n sidebarPanel(\n h2(\"How to use DataTables from the DT library\"),\n br(),\n p(\"To the right is a dataset displayed as a DataTable\"), \n p(\"This dataset is from the fivethirtyeight GitHub repository.\")\n ),#close sidebarPanel\n mainPanel(\n h2(\"the dataset\"),\n br(),\n \n DT::dataTableOutput(\"table_1\"), \n \n \n )#close mainPanel\n )#close sidebarLayout\n)#close fluidPage\n\nserver <- function(input, output) {\n data_to_display <- comics_data\n\n output$table_1 <- renderDataTable({\n (data_to_display)\n })\n \n}#close server\n\nshinyApp(ui, server)\nFirst, we can drop urlslug, and page_id. We also want to correct the column names.\nlibrary(shiny)\nlibrary(DT)\nlibrary(tidyverse)\n\n\npath <- \"https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv\"\n\n# This will read the first sheet of the Excel file\ncomics_data <- read_csv(path)\n\ncomics_data <- select(comics_data, \"name\", \"ID\", \"ALIGN\", \"EYE\", \n \"HAIR\", \"SEX\", \"GSM\", \"ALIVE\", \"APPEARANCES\",\n \"FIRST APPEARANCE\", \"Year\" ) %>% \n rename(Name = name,\n Alignment=ALIGN,\n Eye = EYE, \n Hair = HAIR,\n Gender = SEX, \n Gender_or_sexual_identity = GSM,\n Status = ALIVE, \n Appearances = APPEARANCES,\n First_appearance = 'FIRST APPEARANCE')\n\nui <- fluidPage(\n sidebarLayout(\n sidebarPanel(\n h2(\"How to use DataTables from the DT library\"),\n br(),\n p(\"To the right is a dataset displayed as a DataTable\"), \n p(\"This dataset is from the fivethirtyeight GitHub repository.\")\n ),#close sidebarPanel\n mainPanel(\n h2(\"the dataset\"),\n br(),\n DT::dataTableOutput(\"table_1\")#close dataTableOutput\n \n )#close mainPanel\n )#close sidebarLayout\n)#close fluidPage\n\nserver <- function(input, output) {\n data_to_display <- comics_data\n\n output$table_1 <- renderDataTable(\n data_to_display,\n options = list(\n scrollX = TRUE, \n scrollY = TRUE,\n autoWidth = TRUE,\n rownames = FALSE)\n ) #close renderDataTable\n}#close server\n\nshinyApp(ui, server)\nYou can see from this example, there are new options added to the renderDataTable() function. We also modified our data before the ui. You can find an excellent explanation of options as well as beautiful integration with the formattable library from this blog.\n\n\n\n\n\n\n\n12.3.5 Plots\nPlots are graphs and charts from packages like ggplot2, plotly, r2d3, and many others. Mastering Shiny has an excellent chapter dedicated to ggplot2 for more info on making ggplot2 interactive. For now, we’re going to stick with some simple ones to explain the basic plot output functions.\n\n12.3.5.0.1 plotOutput() & renderPlot()\nThese two work together to generate an R graphic, usually based on ggplot2 or similar graphic library.\nlibrary(shiny)\nlibrary(gglplot2)\n\nui <- fluidPage(\n plotOutput(\"plot\", width = \"400px\")\n)#close fluidPage\nserver <- function(input, output) {\n output$plot <- renderPlot(plot(1:10), res = 96)\n}#close server\n\nshinyApp(ui, server)\n\n\n\nA rather boring plot.\n\n\nLet’s look at something more interesting. ggplot2 supports interactive mouse inputs, such as click, dblclick, hover, and brush (rectangular select). This time we’re looking at the iris dataset.\n\nlibrary(shiny)\nlibrary(ggplot2)\n\nui <- fluidPage(\n plotOutput(\"plot\", click = \"plot_click\"),\n tableOutput(\"data\")\n)\nserver <- function(input, output) {\n output$plot <- renderPlot({\n ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) + \n geom_point()})\n \n output$data <- renderTable({\n nearPoints(iris, input$plot_click)\n })\n}\nshinyApp(ui, server) \nBelow is just a placeholder image. Try running the code above yourself to see how the mouse click function works.\n\nMore plot types can be found on the R Graph Library with helpful example code to get you started. This library focuses on ggplot 2 and tidyverse, so you have already learned what you need to make and modify these!\n\n\n\n\n\n\n\n\n12.3.6 Images\nYou can place images or logos in your app! Of course, you also use fluidRow() and column arguments to organize your space with text and images. In this case, we are pointing to an image located at a website.\nlibrary(\"shiny\")\nui <- fluidPage(\n mainPanel(\n img(src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Glazed-Donut.jpg/800px-Glazed-Donut.jpg\", align = \"center\")\n )\n)#close fluidPage\n\nserver <- function(input, output) {\n \n}#close server\nshinyApp(ui, server)\nYou can also source your image from a directory. Your files should be saved in a folder called ‘www’ inside your working, (or root), directory.\n shinyApp/\n app.R\n www/\n glazedDonut.jpg\n sprinkleDonut.jpg\n cakeDonut.jpg\nA simple way to call a locally saved image, such as a logo or banner image, might be this:\nlibrary(shiny)\n\nui <- fluidPage(\n \n img(src=\"glazedDonut.jpg\", align = \"right\")\n\n)\n\nserver <- function(input, output) {}\n\nshinyApp(ui, server)\nHowever, if we wanted to call those locally saved images from a choice list, our code would need to look like this:\nlibrary(shiny)\n\ndonuts <- tibble::tribble(\n ~type, ~ id, ~donut, \n \"glazed\", \"glazedDonut\",\"GlazedDonut\",\n \"sprinkles\", \"sprinkleDonut\", \"SprinkleDonut\",\n \"cake\", \"cakeDonut\", \"CakeDonut\"\n)\n\nui <- fluidPage(\n selectInput(\"id\", \"Pick a donut\", choices = setNames(donuts$id, donuts$type)),\n imageOutput(\"photo\")\n)\nserver <- function(input, output, session) {\n output$photo <- renderImage({\n list(\n src = file.path(\"www\", paste0(input$id, \".jpg\")),\n contentType = \"image/jpeg\",\n width = 800,\n height = 650\n )#close list\n }, deleteFile = FALSE)\n}#close function\n \nshinyApp(ui, server)\n\n\n\n\n\n\nTip\n\n\n\nPlacing many logos across a page can be very tedious and difficult to control. For example, inculding all the logos of universities involved in a research project. It may be easier to combine them all in one image file using an image editor, such as Photoshop or GiMP. Then place that one image on your page.\n\n\n\n\n\n12.3.7 File uploads\n\n12.3.7.0.1 fileInput()\nFile uploads and downloads are more complicated types of inputs and outputs. There is a special chapter dedicated to them on the Mastering Shiny webbook here. Loading data, usually in the form of a csv, is a very common need. The following code will upload a csv based on which dataset you’ve chosen from an input menu. However, this code also makes the upload available to the server function and reads it into a table.\nlibrary(shiny)\n\nui <- fluidPage(\n sidebarLayout(\n sidebarPanel(\n fileInput(\"file1\", \"Choose a CSV format File\", accept = \".csv\"),\n checkboxInput(\"header\", \"Header\", TRUE)\n ),\n mainPanel(\n tableOutput(\"contents\")\n ) #close mainPanel\n )#close sidebarLayout\n )#close fluidPage\n \nserver <- function(input, output) {\n output$contents <- renderTable({\n file <- input$file1\n ext <- tools::file_ext(file$datapath)\n \n req(file)\n validate(need(ext == \"csv\", \"Please upload a csv format file\"))\n \n read.csv(file$datapath, header = input$header)\n })#close renderTable\n }#close function\n \nshinyApp(ui, server)\n\n\n\n12.3.8 Downloads\nThe download button is a special case and it’s a super useful output for people to download a dataset or whatever, such as a manipulated data from a DataTable. The downloadHandler() function is critical to this working. In this case the downloadHandler() is sending the file named data to the write.csv() function.\nui <- fluidPage(\n downloadButton(\"downloadData\", \"Download\")\n)\n\nserver <- function(input, output) {\n # Our dataset\n data <- mtcars\n\n output$downloadData <- downloadHandler(\n filename = function() {\n paste(\"data-\", Sys.Date(), \".csv\", sep=\",\")\n },\n content = function(file) {\n write.csv(data, file)\n }#close function\n )#close downloadHandler\n}#close server function\n\nshinyApp(ui, server)\nWith other libraries you can also write Excel files, such as with writexl. This version writes to an Excel file.\nlibrary(shiny)\nlibrary(writexl)\n\n\nui <- fluidPage(\n downloadButton(\"downloadData\", \"Download\")\n )\n \nserver <- function(input, output) {\n # Our dataset\n data <- mtcars\n \n output$downloadData <- downloadHandler(\n filename = function() {\n #paste(\"data-\", Sys.Date(), \".csv\", sep=\"\")\n paste(\"data-\", Sys.Date(), \".xlsx\")\n },\n content = function(file) {\n #write.csv(data, file)\n writexl::write_xlsx(data, file)\n } #close function\n ) #close downloadHandler\n }#close server function\n \nshinyApp(ui, server)\nNext, let’s play with themes."
},
{
"objectID": "shinyapps_part2.html#themes",
"href": "shinyapps_part2.html#themes",
"title": "12 ShinyApps",
"section": "12.4 Themes",
"text": "12.4 Themes\nThemes control the styling of the application, unifying colors and fonts, for example. Themes are assigned in the ui(). You can find the shinythemes library here.\nlibrary(shinythemes)\n\nui = fluidPage(theme = shinytheme(\"cerulean\")\n )#close fluidPage\n\nserver = function(input,output) {}\n\nshinyApp(ui, server)\nOr use the theme picker option until you decide on one!\nlibrary(shinythemes)\n\nui = fluidPage(\n shinythemes::themeSelector()\n )#close fluidPage\n \nserver = function(input, output) {}\n\nshinyApp(ui, server)\n \nIt is possible to create your own themes, use themes other than Bootstrap, or modify Bootstrap themes to your own aesthetic needs. Mastering Shiny has a brief chapter on themes here, and you can also find more information here on Bootswatch themes. Here is a link for the hex codes or names you’ll need for colors.\nYou can even start making your own theme by specifying items to be used throughout your app in the ui. Colors are hex or by HTML names. Collections of web safe fonts can be found here. Note how the bslib library is called and the theme is identified in the first part of the page layout. You can find the documentation on bslib here. In this case, the open-source fonts came from Google Fonts.\nlibrary(shiny)\nlibrary(bslib)\n\nui <- fluidPage(\n theme = bs_theme( \n bg = \"#175d8d\", \n fg = \"#d5fbfc\", \n primary = \" #e3fffc\", \n base_font = font_google(\"Atkinson Hyperlegible\"),\n code_font = font_google(\"Roboto Mono\")\n ),\n sidebarLayout(\n sidebarPanel(\n fileInput(\"file1\", \"Choose a CSV format File\", accept = \".csv\"),\n checkboxInput(\"header\", \"Header\", TRUE)\n ),\n mainPanel(\n tableOutput(\"contents\")\n ) #close mainPanel\n )#close sidebarLayout\n)#close fluidPage\n\nserver <- function(input, output) {\n output$contents <- renderTable({\n file <- input$file1\n ext <- tools::file_ext(file$datapath)\n \n req(file)\n validate(need(ext == \"csv\", \"Please upload a csv format file\"))\n \n read.csv(file$datapath, header = input$header)\n })#close renderTable\n}#close function\n\nshinyApp(ui, server)\nThere is also the shiny.semantic library for a different look. You can find the link here."
},
{
"objectID": "shinyapps_part2.html#formatting-text",
"href": "shinyapps_part2.html#formatting-text",
"title": "12 ShinyApps",
"section": "12.5 Formatting text",
"text": "12.5 Formatting text\n\n12.5.0.1 HTML functions\nYou can apply HTML equivalent functions in shiny to format your text by creating distinct paragraphs, emphasize text, or change the font or color. The table below shows the shiny function, such as p(), and its HTML equivalent and what is modified.\nTable source: https://shiny.rstudio.com/tutorial/written-tutorial/lesson2/\n\n\n\n\n\n\n\n\nshiny function\nHTML5 equivalent\ncreates\n\n\n\n\np\n<p>\nA paragraph of text\n\n\nh1\n<h1>\nA first level header\n\n\nh2\n<h2>\nA second level header\n\n\nh3\n<h3>\nA third level header\n\n\nh4\n<h4>\nA fourth level header\n\n\nh5\n<h5>\nA fifth level header\n\n\nh6\n<h6>\nA sixth level header\n\n\na\n<a>\nA hyper link\n\n\nbr\n<br>\nA line break (e.g. a blank line)\n\n\ndiv\n<div>\nA division of text with a uniform style\n\n\nspan\n<span>\nAn in-line division of text with a uniform style\n\n\npre\n<pre>\nText ‘as is’ in a fixed width font\n\n\ncode\n<code>\nA formatted block of code\n\n\nimg\n<img>\nAn image\n\n\nstrong\n<strong>\nBold text\n\n\nem\n<em>\nItalicized text\n\n\nHTML\n \nDirectly passes a character string as HTML code\n\n\n\nBelow is an example with many types of the HTML modifiers.\nlibrary(shiny)\n\nui <- fluidPage(\n titlePanel(\"My Shiny App\"),\n sidebarLayout(\n sidebarPanel(),\n mainPanel(\n h1(\" h1() creates a level 1 header.\"),\n h2(\" h2() creates a level 2 header, and so on...\"),\n p(\"Use p() to create a new paragraph. \"),\n p(\"You can apply style to a paragraph using style\", style = \"font-family: 'times'; font-si16pt\"),\n strong(\"Using strong() bolds text.\"),\n em(\"Italics can be applied with em(). \"),\n br(),\n p(\"Use br() to apply a line break.\"),\n br(),\n code(\"You can create a code box with code().\"),\n div(\"div creates a container that can apply styles within it using 'style = color:magenta'\", style = \"color:blue\"),\n br(),\n p(\"span is similar to div but can affect smaller sections\",\n span(\"such as words or phrases\", style = \"color:purple\"),\n \"within a paragraph or body of text.\"),\n h3(p(\"You can also combine \", \n em(span(\"HTML\", style=\"color:magenta\")), \n \"functions.\"))\n )#close mainPanel\n )#close sidebarLayout\n)#close fluidPage\nserver <- function(input, output) {\n \n}\n\nshinyApp(ui, server)"
},
{
"objectID": "shinyapps_part2.html#section",
"href": "shinyapps_part2.html#section",
"title": "12 ShinyApps",
"section": "12.6 ",
"text": "12.6"
},
{
"objectID": "shinyapps_part2.html#reactive-pages",
"href": "shinyapps_part2.html#reactive-pages",
"title": "12 ShinyApps",
"section": "12.7 Reactive pages",
"text": "12.7 Reactive pages\nIn this section, we will look at how to make your app respond to inputs and do stuff!\n\n\n\n\n\n\nImportant\n\n\n\nThis section is difficult to understand. Take your time with it, and its ok if it doesn’t make sense right away. We’ll be working with this more in class.\n\n\n\n12.7.1 Basics of reactivity\nReactivity refers to the ability of an app to recalculate or perform some action when an input is changed. In a non-reactive Shiny app, all calculations are run when the session begins. However, if you want a user to change an input and display a new calculated result, then reactivity in Shiny apps allow only that part of the app to be computed and re-displayed, not the entire app. This is important for making things fast and able to display immediate changes. So, in a sense, reactivity is something that happens in the background, or more accurately, the back-end. But, knowing about it will help you use the following functions successfully.\n\n12.7.1.1 Two different types of programming: imperative and declarative\nImperative programming is like the analysis functions you’ve been running so far in this class. You write a series of commands and R executes them. You see errors if something doesn’t match. The UI layout is imperative. You write code that says where something will be, like a radio button, sidebar text, or text output in the main panel.\nDeclarative code provides options for the program if conditions are right. So, it will execute the code if conditions are met. This is what we see on the server side function. It is a series of codes that will run if something happens in the UI.\nIn the context of reactive expressions, all code is run when you start a session and results are stored in the cache. This means that if you move a slider and a new value can be retrieved from that cache, your app will show the result. However, if you move a slider and a new calculated value needs to be shown, this value will not be in your cache. You need some part of the app to calculate and store a new value to be shown. This is where reactivity steps in. Only the part of the code that needs to run, is executed - not the entire app. These declarative bits of code are wrapped in reactive function.\n\n\n12.7.1.2 Code execution order\nUnlike imperative, declarative coding such as in the server, run only when needed. So order is not as important in the server function. However, this can make code very difficult to read for humans as we tend to assume top to bottom. In practice, its best to keep your code human readable. For example, following the order of the ui can be helpful in navigating through the server code.\nData can be very large. To prevent this from slowing down the application, every time a new input is sent to the server function, do as much of your data cleaning and manipulation outside of the ShinyApp as possible. Of course, there are just some situations where this is not possible. But as a general rule, keep it minimal inside the ShinyApp. However, you can run code before the ui and server functions if data needs to be cleaned.\n#insert example code\ntake mtcars and apply some calculation based on existing data, maybe $ to drive 100 miles or km in today.\nIn the following section, let’s look at two reactive functions commonly found. More information on reactive functions can be found here in the Shiny documentation.\n\n\n12.7.1.3 observeEvent()\nobserveEvent() has two arguments: eventExpr and handlerExpr. The first input is the input or expression to take a dependency upon, the second is the code that will be executed.\nobserveEvent() functions provide the set-up for the eventReactive() by observing the data to check for updates. When updates are noticed, the eventReactive() pulls the new updated data and the HTML updates. Note that the following code changes in the HTML to the webpage with the reactive and also prints something in the console.\n\n\n12.7.1.4 eventReactive()\nWhen you select the actionButton(), the just the part of the code that needs to run updates what you see. This is great for click events like action buttons, where a user may expect something to occur. Details of the options and examples of eventReactive() and observeEvent() can be found here in the documentation.\nBelow, we’ve used the example from the documentation to illlustrate basic reactivity. When the app first runs, nothing is displayed as there was no information in the cache for the button ID. Once the button is pressed, the x value is attached to it and the first x rows of data are shown. It does nothing until the cache is updated again when both the x value and the button values have changed.\n\n\n\n\n\n\n\nImportant\n\n\n\nReactivity in Shiny is regretably complex. However the two functions above cover the most common ones you’ll see. The free, online book, Mastering Shiny, has 4 chapters dedicated to reactivity in Shiny. For now, when you see observeEvent() and eventReactive() functions in code you are getting from other sources, you should recognize these as reactive parts of the app that will update when some input has been changed.\n\n\n\n\n12.7.1.5 What is session?\nSession is an optional argument passed to the server function that enables inputs or outputs related to the current instance of the app to be used. You may see some examples with this. If you see it in a code example, its likely there to pull an input or output that is unique to this instance and do something with it. You can read more in its documentation here.\n\nThe session state is used when we need to retrieve some data that has been stored as a result of calculations or inputs from a user’s session. It could be user data, or it could be a file they uploaded that your Shiny app then performed an operation on."
},
{
"objectID": "shinyapps_part2.html#summary",
"href": "shinyapps_part2.html#summary",
"title": "12 ShinyApps",
"section": "12.8 Summary",
"text": "12.8 Summary\nReactivity is different due to the nature of declarative programming. It makes code less complex and providies conditions that could be met, rather than a linear progression of code to be executed. We use reactivity so that people can change inputs and receive different outputs or make external actions happen.\nRecognizing reactive functions of eventReactive() and observeEvent() and why its used on the back-end of a Shiny app will help in designing and debugging your code.\n\nYou can put code before the Shiny ui code to reduce calculation time.\nobserveEvent() sets up the reactivity by signaling which inputs the app should look for change.\neventReactive() is used to render or provide some action based on the updated input values."
},
{
"objectID": "shinyapps_part2.html#wrap-up",
"href": "shinyapps_part2.html#wrap-up",
"title": "12 ShinyApps",
"section": "12.9 Wrap up",
"text": "12.9 Wrap up\nThat was a massive chapter, but this is intended to be used as a reference for you for your future explorations in Shiny. By the end of this you have learned more about many of the Shiny inputs and output functions, how they work together, and an introduction to themes and formatting text. We also introduced reactivity in Shiny which is an important concept as you move forward into more complex interactions. You now have enough information to create and publish web-based applications. Combined with what you have covered in the previous chapters, you are now well prepared to create, modify, calculate, and publish your own research!"
},
{
"objectID": "references.html",
"href": "references.html",
"title": "References",
"section": "",
"text": "Blei, David M., and Padhraic Smyth. 2017. “Science and Data\nScience.” Proceedings of the National Academy of\nSciences 114 (33): 8689–92. https://doi.org/10.1073/pnas.1702076114.\n\n\nHealy, Kieran. 2018. Data Visualization: A Practical\nIntroduction. Princeton, NJ: Princeton University Press.\n\n\nRhys, Hefin. 2020. Machine Learning with r, the Tidyverse, and\nMlr. Shelter Island, NY: Manning publications.\n\n\nWickham, Hadley. 2014. “Tidy Data.” Journal of\nStatistical Software 59 (September): 1–23. https://doi.org/10.18637/jss.v059.i10.\n\n\n———. 2016. Ggplot2. Springer International Publishing. https://doi.org/10.1007/978-3-319-24277-4.\n\n\nWickham, Hadley, and Garrett Grolemund. 2016. R for Data Science:\nImport, Tidy, Transform, Visualize, and Model Data. First edition.\nSebastopol, CA: O’Reilly.\n\n\nWilke, C. 2019. Fundamentals of Data Visualization: A Primer on\nMaking Informative and Compelling Figures. First edition.\nSebastopol, CA: O’Reilly Media."
},
{
"objectID": "index.html",
"href": "index.html",
"title": "Introduction to Data Science",
"section": "",
"text": "Course overview\nIntroduction to Data Science is a hands-on course for students with no or minimal coding experience. We will learn to use R to collect, manipulate, analyze, and visualize data.",
"crumbs": [
"Course overview"
]
},
{
"objectID": "index.html#schedule",
"href": "index.html#schedule",
"title": "Introduction to Data Science",
"section": "Schedule",
"text": "Schedule\n\n\n\n\n\nDate of Class\nTopics\nChapter\n\n\n\n\nWeek 1 (Jan 12-16)\nWhat is data science? + Getting started with R (part 1)\n1 + 2 (up to section 2.9 inclusively)\n\n\nWeek 2 (Jan 19-23)\nGetting started with R (part 2)\n2 (from section 2.10)\n\n\nWeek 3 (Jan 26-30)\nImporting and tidying data (part 1)\n3 (up to section 3.4 inclusively)\n\n\nWeek 4 (Feb 2-6)\nImporting and tidying data (part 2)\n3 (from section 3.5)\n\n\nWeek 5 (Feb 9-13)\nWorking with strings\n4\n\n\nReading week (Feb 16-20)\n\n\n\n\nWeek 6 (Feb 23-27)\nPublishing in R + Summarizing data\n5 + 6\n\n\nWeek 7 (Mar 2-6)\nVisualizing data\n7\n\n\nWeek 8 (Mar 9-13)\nTopic modelling\n8\n\n\nWeek 9 (Mar 16-20)\nLogistic regression\n9\n\n\nWeek 10 (Mar 23-27)\nLinear regression\n10\n\n\nWeek 11 (Mar 30-Apr 3)\nInteractive dashboards (part 1)\n11\n\n\nWeek 12 (Apr 6-Apr 10)\nInteractive dashboards (part 2)\n12",
"crumbs": [
"Course overview"
]
},
{
"objectID": "index.html#assignments",
"href": "index.html#assignments",
"title": "Introduction to Data Science",
"section": "Assignments",
"text": "Assignments\n\n\n\n\n\n\nImportant\n\n\n\nImportant\nUse the course’s Brightspace to:\n\nAccess the syllabus (under the content - overview tab).\nAccess assignment instructions and due dates.\nSubmit your assignments.",
"crumbs": [
"Course overview"
]
},
{
"objectID": "index.html#bibliography",
"href": "index.html#bibliography",
"title": "Introduction to Data Science",
"section": "Bibliography",
"text": "Bibliography\nThe course website will contain everything you need, including references to resources that can be valuable to further develop your understanding and skills, most of which are all available online for free. They are all available in this Zotero library.",
"crumbs": [
"Course overview"
]
},
{
"objectID": "index.html#useful-resources",
"href": "index.html#useful-resources",
"title": "Introduction to Data Science",
"section": "Useful resources",
"text": "Useful resources\n\nTeams channel (only accessible to students currently registered in the course).",
"crumbs": [
"Course overview"
]
},
{
"objectID": "intro.html#learning-objectives",
"href": "intro.html#learning-objectives",
"title": "1 What is data science?",
"section": "1.1 Learning objectives",
"text": "1.1 Learning objectives\n\nWhat is data science?\nHow to draft a problem statement?\nWhere to find data?"
},
{
"objectID": "intro.html#what-is-data-science",
"href": "intro.html#what-is-data-science",
"title": "1 What is data science?",
"section": "1.2 What is data science?",
"text": "1.2 What is data science?\nAccording to a short perspective paper by Blei and Smyth (2017), which I recommend reading, data science:\n\nfocuses on exploiting the modern deluge of data for prediction, exploration, understanding, and intervention. It emphasizes the value and necessity of approximation and simplification. It values effective communication of the results of a data analysis and of the understanding about the world that we glean from it. It prioritizes an understanding of the optimization algorithms and transparently managing the inevitable trade-off between accuracy and speed. It promotes domain-specific analyses, where data scientists and domain experts work together to balance appropriate assumptions with computationally efficient methods."
},
{
"objectID": "intro.html#the-data-science-process",
"href": "intro.html#the-data-science-process",
"title": "1 What is data science?",
"section": "1.3 The data science process",
"text": "1.3 The data science process\n\nProblem statement: The data science process is filled with decisions. And there is no better way to get lost and frustrated than to not have an adequate and shared understanding of the problem that needs to be solved or the knowledge gap that needs to be filled. Problems are all around us, but not all of them are good data science problems. Good data science problem are relevant (they have a clear purpose) and are solvable with available data. At the end of this step, you have a clear research objective and clear research questions.\nData collection involves a series of steps aimed at gathering all the data that you will need for your project, such as finding relevant data sources, importing the data, and assessing its suitability for the problem at hand. At the end of this step, you have at hand all the data pieces that you will need for your project.\nPre-processing (tidying) involves structuring your data in a format suitable for analysis, and cleaning your data to remove errors, duplicates, etc. At the end of this step, you have a data set that is ready to produce valid answers to your research questions.\nAnalysis is about describing, analyzing, and visualizing the data. At the end of this step, you have produced tables and graphs that are informative in the context of your problem statement.\nInterpretation is about assigning meaning to the analyzed data and draw conclusions from it. At the end of this step you have answers to your research questions.\nCommunication is about transferring the new knowledge to its intended audience(s). At the end of this step, you have a clear, transparent, and effective report."
},
{
"objectID": "intro.html#writing-a-problem-statement",
"href": "intro.html#writing-a-problem-statement",
"title": "1 What is data science?",
"section": "1.4 Writing a problem statement",
"text": "1.4 Writing a problem statement\nIn the context of this course, the problem statement encompasses the identification of a problem/knowledge gap that has some relevance, the project objectives, and the research questions.\n\n1.4.1 Context\nThe context introduces the issue that needs to be solved or the knowledge gap that needs to be filled (what is it?) and an explanation of its relevance (why does it matter?).\nHere is an example:\n\nIn Quebec, academic research is supported by national and provincial research councils that select, after peer review, the individuals or teams that receive funding. The number of researchers that are able to receive research funds is constrained by the limited funds available and the size of the grants. Past research showed that 20% to 45% of Quebec’s researchers had no external funding between 1999 and 2006, while 10% of researchers accumulated between 50% and 80% of the available funds. While we know how funding is distributed, we do not know how optimal that distribution is for producing research output and impact. Optimizing our funding policies and programs could increase the production of scientific knowledge required to solve local, national and global issues.\n\n\n\n1.4.2 Objectives\nThe objective, of course, is to fix the problem or fill the gap identified in the problem statement. But here the goal is expressed as a research objective. It also provides information on the data set, and more details that help delineating the project. Here is an example:\n\nThe goal of this project is to help determine the optimal distribution of National research funds by applying data science methods to analyze data on the research funding, output and impact of Quebec researchers over a period of 15 years (1998-2012).\n\n\n\n1.4.3 Research questions\nThe research questions are the expression The best research questions tend to start with the words “how, why, what, which”. Here is an example.\n\n1) What is the relationship between the amount of research funding of individual researchers and their research output?\n2) What is the relationship between the amount of research funding of individual researchers and their research output?\n3) How does the relationship between funding and research impact and output vary between research fields?\n\nNote that depending on your project, your questions may look quite different. For instance, in this example research funding and field are selected as predictors based on past knowledge and theory, while research output and research impact are the predicted variables. You should always know what you are trying to predict, but perhaps you have a lot of potential predictors in your data and your goal is to identify which ones are good predictors. In this case, you could have a question that looks like:\n\nWhat are the best predictors of X?\n\nWhat matters most is that:\n\nYour questions are clear.\nYour questions can be answered with data\nYou actually provide an answer to the questions in your report."
},
{
"objectID": "intro.html#getting-data",
"href": "intro.html#getting-data",
"title": "1 What is data science?",
"section": "1.5 Getting data",
"text": "1.5 Getting data\nIf you are working for a company your client or employer will most likely have an internal database storing information related to its activities (e.g., clients, products, inventories, sales, employees, financial performance, etc.) which you may be using for your data science project.\n\n1.5.1 Open government data\nGovernments and public organizations are also increasingly making the data they collect openly accessible for the benefit of the public. Here are some sources:\n\nCanada (https://open.canada.ca/en/open-data)\nNova Scotia (https://data.novascotia.ca/)\nUnited States (https://www.data.gov/)\nWorld Bank (https://data.worldbank.org)\nToronto Public Library (https://opendata.tpl.ca/)\n\n\n\n1.5.2 Research data\nThe Open Science movement also emphasizes the importance for researchers to share the data that they used for their published work, which can be found in repositories such as:\n\nDataCite (https://datacite.org/). This is an aggregator that allows you to search hundreds of research data repositories.\nZenodo (https://zenodo.org/).\nFigshare (https://figshare.com/).\n\n\n\n1.5.3 Bibliographic records\nBibliographic records and other metadata related to different types of works can be used for data science projects. However, because of their enriched metadata and their inclusion of bibliometric indicators like citations, citation indices provide more opportunities. Examples of citation indices are:\n\nScopus (available through the Dal libraries)\nDimensions.\nOpenAlex. The search engine does not allow you to easily download data, but there is a free API that can be used quite easily in R with the openalexR package (we will learn how to use R and R packages in Chapter 2 and how to use an API in Chapter 3).\nGoogle Scholar. The easiest way to download data from Google Scholar is to use the Publish or Perish software.\n\n\n\n1.5.4 Miscellaneous datasets\nThere is an overwhelming amount of data available on the Web, so here is a non-exhaustive list of data sources that you might find useful.\n\nKaggle (https://www.kaggle.com)\nAwesome public datasets (https://github.com/awesomedata/awesome-public-datasets)\nInternet Movie Database (IMDb) (https://www.imdb.com/interfaces/)\n\nPlease note that you are free to use any data you wish for this course, the only restriction being that you must be able to share the data with your instructor."
},
{
"objectID": "intro.html#homework",
"href": "intro.html#homework",
"title": "1 What is data science?",
"section": "1.6 Homework",
"text": "1.6 Homework\nYour homework is to start gathering ideas and data for your research project proposal. This includes:\n\nThinking about a topic for your research project.\nFinding data sources that might be suitable for that topic.\nStarting a draft of your problem statement.\n\n\n\n\n\nBlei, David M., and Padhraic Smyth. 2017. “Science and Data Science.” Proceedings of the National Academy of Sciences 114 (33): 8689–92. https://doi.org/10.1073/pnas.1702076114."
},
{
"objectID": "intro.html#references",
"href": "intro.html#references",
"title": "1 What is data science?",
"section": "1.7 References",
"text": "1.7 References\n\n\n\n\nBlei, David M., and Padhraic Smyth. 2017. “Science and Data Science.” Proceedings of the National Academy of Sciences 114 (33): 8689–92. https://doi.org/10.1073/pnas.1702076114."
},
{
"objectID": "reading_and_tidying_data.html#learning-objectives",
"href": "reading_and_tidying_data.html#learning-objectives",
"title": "3 Reading and tidying data",
"section": "3.1 Learning objectives",
"text": "3.1 Learning objectives\n\nImport data in R\nMake data tidy with the tidyverse\nExport data\nUse the pipe to write clearer code"
},
{
"objectID": "reading_and_tidying_data.html#tidy-data",
"href": "reading_and_tidying_data.html#tidy-data",
"title": "3 Reading and tidying data",
"section": "3.2 Tidy data",
"text": "3.2 Tidy data\nData is stored in all kinds or places, can be accessed in many different ways, and comes in all kinds of shapes and forms. Therefore, much of the data scientist’s work is related to collecting, processing, and cleaning data to get it ready for analysis.\nTidy data a set of principles adapted from the relational model (those of you who took my data management course will be familiar with the relational model). According to Wickham (2014), those principles are:\n\nEach variable forms a column.\nEach observation forms a row.\nEach type of observational unit forms a table.\n\nWhile it may be implicit in principles 1 and 2, it is perhaps worth adding as a fourth principle that each cell should contain a single value. For example, comma-separated strings are not tidy.\nHere’s an example of what tidy data looks like taken from the Titanic dataset.\n\n\n\n\n\nPassengerId\nSurvived\nPclass\nSex\nAge\nTicket\nFare\n\n\n\n\n1\n0\n3\nmale\n22\nA/5 21171\n7.2500\n\n\n2\n1\n1\nfemale\n38\nPC 17599\n71.2833\n\n\n3\n1\n3\nfemale\n26\nSTON/O2. 3101282\n7.9250\n\n\n4\n1\n1\nfemale\n35\n113803\n53.1000\n\n\n5\n0\n3\nmale\n35\n373450\n8.0500\n\n\n6\n0\n3\nmale\nNA\n330877\n8.4583\n\n\n\n\n\n\n\nWe can see that each row is a single observation (a passenger) and that each column is a single variable, and that no cell contains multiple values.\nNow let’s take a look at a dataset that is not tidy.\n\n\n\n\n\ncourse\nstudents\ngrades\n\n\n\n\nINFO6270\nFrancis, Sam, Amy\n92 (A+), 86 (A), 84 (A-)\n\n\nINFO6540\nFrancis, Sam, Amy\n77 (B+), 100 (A+), 74(B)\n\n\nINFO5500\nFrancis, Sam, Amy\n86 (A), 70 (B-), 99 (A+)\n\n\n\n\n\n\n\nWe can see that for each course, students are listed in a single cell, and their grades as well. To make this data tidy, a first thing we might might to do is separate students and their grades so that they each occupy a row, and cells contains only one student or grade.\n\n\n\n\n\ncourse\nstudents\ngrades\n\n\n\n\nINFO6270\nFrancis\n92 (A+)\n\n\nINFO6270\nFrancis\n86 (A)\n\n\nINFO6270\nFrancis\n84 (A-)\n\n\nINFO6270\nSam\n92 (A+)\n\n\nINFO6270\nSam\n86 (A)\n\n\nINFO6270\nSam\n84 (A-)\n\n\nINFO6270\nAmy\n92 (A+)\n\n\nINFO6270\nAmy\n86 (A)\n\n\nINFO6270\nAmy\n84 (A-)\n\n\nINFO6540\nFrancis\n77 (B+)\n\n\nINFO6540\nFrancis\n100 (A+)\n\n\nINFO6540\nFrancis\n74(B)\n\n\nINFO6540\nSam\n77 (B+)\n\n\nINFO6540\nSam\n100 (A+)\n\n\nINFO6540\nSam\n74(B)\n\n\nINFO6540\nAmy\n77 (B+)\n\n\nINFO6540\nAmy\n100 (A+)\n\n\nINFO6540\nAmy\n74(B)\n\n\nINFO5500\nFrancis\n86 (A)\n\n\nINFO5500\nFrancis\n70 (B-)\n\n\nINFO5500\nFrancis\n99 (A+)\n\n\nINFO5500\nSam\n86 (A)\n\n\nINFO5500\nSam\n70 (B-)\n\n\nINFO5500\nSam\n99 (A+)\n\n\nINFO5500\nAmy\n86 (A)\n\n\nINFO5500\nAmy\n70 (B-)\n\n\nINFO5500\nAmy\n99 (A+)\n\n\n\n\n\n\n\nThis is much better, although we still have the issue of the numeric and letter grades being lumped together in a cell. To make this data truly tidy, we want to separate the numeric and the letter grades, like this:\n\n\n\n\n\ncourse\nstudents\ngrade_numeric\ngrade_letter\n\n\n\n\nINFO6270\nFrancis\n92\nA+\n\n\nINFO6270\nFrancis\n86\nA\n\n\nINFO6270\nFrancis\n84\nA-\n\n\nINFO6270\nSam\n92\nA+\n\n\nINFO6270\nSam\n86\nA\n\n\nINFO6270\nSam\n84\nA-\n\n\nINFO6270\nAmy\n92\nA+\n\n\nINFO6270\nAmy\n86\nA\n\n\nINFO6270\nAmy\n84\nA-\n\n\nINFO6540\nFrancis\n77\nB+\n\n\nINFO6540\nFrancis\n100\nA+\n\n\nINFO6540\nFrancis\n74\nB\n\n\nINFO6540\nSam\n77\nB+\n\n\nINFO6540\nSam\n100\nA+\n\n\nINFO6540\nSam\n74\nB\n\n\nINFO6540\nAmy\n77\nB+\n\n\nINFO6540\nAmy\n100\nA+\n\n\nINFO6540\nAmy\n74\nB\n\n\nINFO5500\nFrancis\n86\nA\n\n\nINFO5500\nFrancis\n70\nB-\n\n\nINFO5500\nFrancis\n99\nA+\n\n\nINFO5500\nSam\n86\nA\n\n\nINFO5500\nSam\n70\nB-\n\n\nINFO5500\nSam\n99\nA+\n\n\nINFO5500\nAmy\n86\nA\n\n\nINFO5500\nAmy\n70\nB-\n\n\nINFO5500\nAmy\n99\nA+\n\n\n\n\n\n\n\nThat’s it, now we have a tidy data set of grades! You can read more about tidy data the R for Data Science book Wickham and Grolemund (2016) , or the journal article by Wickham (2014)."
},
{
"objectID": "reading_and_tidying_data.html#the-tidyverse",
"href": "reading_and_tidying_data.html#the-tidyverse",
"title": "3 Reading and tidying data",
"section": "3.3 The tidyverse",
"text": "3.3 The tidyverse\nThe tidyverse is a collection of R packages and functions designed to help you make data tidy and work with tidy data. You can read more about the tidy verse and its packages here: https://www.tidyverse.org. The code below loads the tidyverse and returns the list of packages it includes.\n\nlibrary(tidyverse)\ntidyverse_packages()\n\n [1] \"broom\" \"conflicted\" \"cli\" \"dbplyr\" \n [5] \"dplyr\" \"dtplyr\" \"forcats\" \"ggplot2\" \n [9] \"googledrive\" \"googlesheets4\" \"haven\" \"hms\" \n[13] \"httr\" \"jsonlite\" \"lubridate\" \"magrittr\" \n[17] \"modelr\" \"pillar\" \"purrr\" \"ragg\" \n[21] \"readr\" \"readxl\" \"reprex\" \"rlang\" \n[25] \"rstudioapi\" \"rvest\" \"stringr\" \"tibble\" \n[29] \"tidyr\" \"xml2\" \"tidyverse\" \n\n\nIn the next sections, we will use a few of the tidyverse packages (but not exclusively) to import data into R from multiple types of source. Because the tidy format is pretty standard and not R-specific, you might often find that the data sets that you will work with are already respecting the tidy principles. But you’ll also come across different data structures and formats, and so we’ll learn how to tidy up data."
},
{
"objectID": "reading_and_tidying_data.html#import-data",
"href": "reading_and_tidying_data.html#import-data",
"title": "3 Reading and tidying data",
"section": "3.4 Import data",
"text": "3.4 Import data\nThe following sections show how to load data from different sources and format into R.\n\n\n\n\n\n\nImportant note\n\n\n\nSome of the processes and codes in this section are beyond the level of R proficiency that you are expected to have at this point, and even at the end of the course. They were included here so that:\n\nYou can gain awareness of the different ways in which data can be access through R.\nYou can gain awareness of the different file formats and data structure that R can handle.\nYou can have access to working pieces of code that you can use to get whatever data you need into R.\n\nSo if you feel like you are thoroughly understanding all the codes and processes that are show in section 3.4 and its subsections. Do not worry, and focus on the general tasks that these codes are accomplishing.\n\n\n\n3.4.1 Delimited file formats\nThe readr package (https://readr.tidyverse.org/) has a few useful functions for reading delimited file formats like comma-separated values (.csv) and tab-delimited values (.tsv) or any other type of delimiter. Here are a few examples (if you want to run the examples on your computer, you can download the titanic dataset in different formats here.\n\n# Imports a comma-separated file and saves it into a data frame called titanic\ntitanic <- read_csv(\"titanic.csv\")\n\n# Same, but with a tab-separated file (works with tab-separated .txt files also)\ntitanic <- read_tsv(\"titanic.tsv\")\n\n# Same, but with a txt file in which the columns are separated with a vertical bar.\ntitanic <- read_delim(\"titatic.txt\", delim=\"|\") \n\n# Same, but reading the file directly from a URL.\ntitanic <- read_csv(\"https://pmongeon.github.io/info6270/files/data/titanic.csv\")\n\nAs you can see from the third example above, you can specify any delimiter using the delim argument of the read_delim() function. You should also note that tab-delimited text files (.txt) are extremely common. You can read these files with the read_tsv() function even if the file as the .txt extension. Alternatively, you can use the read_delim() function and use delim = \"\\t\".\nOften, you will see examples where the path to a file is first stored in an object, so the object name, rather than the whole path, can be used in a function. Like this:\n\npath <- \"https://pmongeon.github.io/info6270/files/data/titanic.csv\"\ntitanic <- read_csv(path)\n\nAnother useful package is readxl (https://readxl.tidyverse.org/), which has functions to help you read data from Excel files, such as read_xlsx().\n\nlibrary(readxl)\n\npath <- \"https://pmongeon.github.io/info6270/files/data/halifax_weather.xlsx\"\n\n# This will read the first sheet of the Excel file\nhalifax_weather <- read_xlsx(path)\n\n# This will read the second sheet of the Excel file\nhalifax_weather <- read_xlsx(path, sheet = 2)\n\n# This will also read the second sheet of the Excel file \nhalifax_weather <- read_xlsx(path, sheet = \"out\")\n\n# If you don't know what the names of the sheets are, you can read them like this \nexcel_sheets(path)\n\n\n\n3.4.2 JSON files\nThe JSON format is very popular for exchanging information on the web, and is the typical format of the data that we retrieve from APIs (next). However the process for reading a JSON file and reading JSON data from an Application programming interface (API) are slightly different. This is how to convert a JSON file into a data frame using the fromJSON() function from the jsonlite package (included in the tidyverse).\n\nlibrary(jsonlite)\ndata <- fromJSON(\"https://pmongeon.github.io/info6270/files/data/public_housing_ns.json\")\n\nImportant note: importing and working with JSON files with simple structures is relatively easy. However, reading more complex JSON files might require a little more work. Reading complex JSON structures is beyond the scope of this chapter.\n\n\n3.4.3 Application programming interface (API)\nAPIs allow you to interact with computers, servers or software by making making different request such as sending or retrieving data through the web. Some APIs can be used for free and anonymously, others might require that you identify your self with your email, for instance. Finally, some APIs will require that you create an account to obtain an API key for you to use in your code. Fully understanding how APIs work and being proficient with them is beyond the scope of this course, but you should at least know that they exist and that they can be used to collect data for your data science projects. The R package that helps you work with APIs is httr. Below is an example of a request to retrieve data from a free anonymous API.\nImportant note: You do not need to understand how all of the code works. I mainly want you to have a working piece of code that you can use as a template if you ever need to retrieve data from an API.\n\n# Load the httr package\nlibrary(httr)\n\n# Make a request to an API to GET data\ndata = GET(\"https://openlibrary.org/api/books?bibkeys=OLID:OL22123296M&format=json\")\n\n# Isolate the content part of the information received from the API. This step is\n# necessary because the GET request returns additional information about the request \n# along with the requested data \ndata = content(data)\n\n# Load the data.table package for the rbindlist() function use in the code below.\n# You may need to install the package first if you don't have it installed already.\nlibrary(data.table)\n\n# Converts the data retrieved from the API call to a tibble\ndata = as_tibble(rbindlist(data))\n\n\n\n3.4.4 XML files\nThe xml2 package (also included in the tidyverse) provides a set of functions to work with data in the XML format. The conversion of XML data into a data frame is not so straightforward because of the nested structure of the XML. If you need to import XML data in R, the following code performs a series of steps that convert a XML file from data.novascotia.ca website data into a tibble. We will explore this code in more details further in this chapter, when we look more closely at the unnest() function.\nImportant note: the goal here is mainly to provide you with a piece of code that works, so that you can use it as a template to read XML files if you need to and convert them into data frames.\n\nlibrary(xml2)\n\npath <- \"https://data.novascotia.ca/api/views/2d4m-9e6x/rows.xml\"\n\npublic_housing_xml = as_list(read_xml(path))\npublic_housing = as_tibble(public_housing_xml) \npublic_housing = unnest_longer(public_housing, colnames(public_housing)[1])\npublic_housing = unnest_wider(public_housing, colnames(public_housing)[1]) \npublic_housing = unnest(public_housing, cols = names(public_housing)) \npublic_housing = unnest(public_housing, cols = names(public_housing))\npublic_housing = type_convert(public_housing)\n\n\n\n3.4.5 Connections\nImportant note: This section is a little bit advanced, so please do not feel like you need to master database and file connections at this point in your R journey. Again, I include this here to raise your awareness of connection and give you some working pieces of code you can use if you ever want or need to access data stored in a large file or database.\n\n3.4.5.1 Connect to a file\nIf you are trying to read a very large file, you may run into issues because the object in your environment are stored in your computer’s memory (RAM). So if your computer has 4GB of RAM, then files larger than 4GB can’t be imported into a data frame. One solution is to create a connection to the file using the file() function, and then processing the data a certain number of lines at a time with the readLines() function. Here is an example.\n\n# open the connection to a text file containing data\npath <- \"https://pmongeon.github.io/info6270/files/data/titanic.txt\"\n\n# open a connection to the file (note that connections are stored as objects)\ncon <- file(path, open=\"rb\")\n\n# Read the first 5 lines and print them\nreadLines(con, n = 5)\n\n[1] \"PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked\" \n[2] \"1|0|3|Braund, Mr. Owen Harris|male|22|1|0|A/5 21171|7.25|NA|S\" \n[3] \"2|1|1|Cumings, Mrs. John Bradley (Florence Briggs Thayer)|female|38|1|0|PC 17599|71.2833|C85|C\"\n[4] \"3|1|3|Heikkinen, Miss. Laina|female|26|0|0|STON/O2. 3101282|7.925|NA|S\" \n[5] \"4|1|1|Futrelle, Mrs. Jacques Heath (Lily May Peel)|female|35|1|0|113803|53.1|C123|S\" \n\n# Read the NEXT 5 lines and print them.\nreadLines(con, n=5) \n\n[1] \"5|0|3|Allen, Mr. William Henry|male|35|0|0|373450|8.05|NA|S\" \n[2] \"6|0|3|Moran, Mr. James|male|NA|0|0|330877|8.4583|NA|Q\" \n[3] \"7|0|1|McCarthy, Mr. Timothy J|male|54|0|0|17463|51.8625|E46|S\" \n[4] \"8|0|3|Palsson, Master. Gosta Leonard|male|2|3|1|349909|21.075|NA|S\" \n[5] \"9|1|3|Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)|female|27|0|2|347742|11.1333|NA|S\"\n\n# Close the connection to the file.\nclose(con)\n\nIf you wanted to go through the entire file, 50 lines at a time, and do something.\n\n# open the connection to a text file containing data.\npath <- \"https://pmongeon.github.io/info6270/files/data/titanic.txt\"\ncon <- file(path, open=\"rb\")\n\n# this will read 50 lines and print them until the end of the file has been reached.\nrepeat { \n x<-readLines(con, n = 50)\n if (is_empty(x) == TRUE) {\n break\n } else { \n print(x)\n }\n} \n\n# Close the connection to the file.\nclose(con)\n\n\n\n3.4.5.2 Connect to a database\nYou can also connect to practically all types of local or remote databases and servers as long as you have the required credentials. For example, the RMySQL package is great to work with MySQL databases. You can connect to a MySQL database (provided that you have a user name and password and the other information required by the dbConnect() function). With the RSQLite package, you can also work with local SQL databases stored in a single file on your computer with SQLite. The code below shows you how to connect to MySQL or SQLite databases in R.\n\n## This creates a connection to a MySQL database for which you would need \n## access credentials.\n\nlibrary(RMySQL)\ncon = dbConnect(MySQL(),\n user = \"username\",\n password = \"password\",\n host = \"host\",\n port = port_number, \n dbname = \"database name\")\n\n# This opens a connection called db to the info6270 SQLite database.\n# If the database doesn't exist, this code will also create it.\nlibrary(RSQLite)\ncon <- dbConnect(drv = SQLite(), dbname= \"C:/info6270.db\")\n\nOnce the connection is established, you can interact with the database with a set of functions, some of which are presented in the following code:\n\n# lists table in the database access through the db connection.\ndbListTables(con)\n\n# liste fields from a table\ndbListFields(con,\"table_name\")\n\n# import the results of a SQL query \ndata <- dbGetQuery(con, \"SQL query\")\n\n# import all the data from a table\ndata <- dbReadTable(con, \"table name\")\n\n# upload a data frame to a table (change options as needed)\ndbWriteTable(con, data, \"table name\", row.names=FALSE, overwrite = FALSE, append = TRUE)"
},
{
"objectID": "reading_and_tidying_data.html#tidy-your-data",
"href": "reading_and_tidying_data.html#tidy-your-data",
"title": "3 Reading and tidying data",
"section": "3.5 Tidy your data",
"text": "3.5 Tidy your data\nNow we know how to import data into R. However, not all data comes in a nice and tidy shape, even if the data is already shaped like a table. In this section, we’ll explore how to change the structure of your data frames with tidyr (https://tidyr.tidyverse.org/).\n\n3.5.1 Reshape data\nThe reshape functions include pivot_longer() and its opposite, pivot_wider(). Let’s look at how they work. First we create a tibble with fictional amounts of funding received from the Canadian Tri-Council by some Nova Scotian universities.\n\n# This creates a funding tibble with 4 columns and some data.\nfunding <- tibble(university = as.character(c(\"DAL\",\"SMU\",\"SFX\")),\n SSHRC = as.numeric(sample(1:100,3)),\n NSERC = as.numeric(sample(1:100,3)),\n CIHR = as.numeric(sample(1:100,3)))\n\n# Note: the sample(1:100,3) function randomly chooses three values between 1 and 100.\nfunding\n\n# A tibble: 3 × 4\n university SSHRC NSERC CIHR\n <chr> <dbl> <dbl> <dbl>\n1 DAL 70 96 44\n2 SMU 8 12 100\n3 SFX 59 97 79\n\n\n\n3.5.1.1 pivot_longer\nThe pivot_longer() function makes your data longer by appending multiple columns together in two column.\n\nfunding = pivot_longer(funding, cols = c(\"SSHRC\",\"NSERC\",\"CIHR\"), names_to = \"funder\", values_to = \"funding_amount\")\n\nfunding\n\n# A tibble: 9 × 3\n university funder funding_amount\n <chr> <chr> <dbl>\n1 DAL SSHRC 70\n2 DAL NSERC 96\n3 DAL CIHR 44\n4 SMU SSHRC 8\n5 SMU NSERC 12\n6 SMU CIHR 100\n7 SFX SSHRC 59\n8 SFX NSERC 97\n9 SFX CIHR 79\n\n\n\n\n3.5.1.2 pivot_wider\nThe pivot_wider() function does the opposite of pivot_longer() and takes a column with names and another with values and creates a new column for each name and storing the value in it. The following example will perhaps make this clearer.\n\nfunding = pivot_wider(funding, names_from = funder, values_from = funding_amount)\n\nfunding\n\n# A tibble: 3 × 4\n university SSHRC NSERC CIHR\n <chr> <dbl> <dbl> <dbl>\n1 DAL 70 96 44\n2 SMU 8 12 100\n3 SFX 59 97 79\n\n\n\n\n\n3.5.2 Expand tables\nLet’s create a new funding tibble to explore some more tidyr functions.\n\nfunding <- tibble(university = as.character(c(\"DAL\",\"DAL\",\"DAL\",\"SMU\",\"SMU\",\"SFX\",\"SFX\")),\n funder = as.character(c(\"SSHRC\",\"NSERC\",\"CIHR\",\"SSHRC\",\"CIHR\",\"NSERC\",\"CIHR\")),\n n_grants = as.numeric(sample(1:100, 7))) \n\nprint(funding)\n\n# A tibble: 7 × 3\n university funder n_grants\n <chr> <chr> <dbl>\n1 DAL SSHRC 78\n2 DAL NSERC 58\n3 DAL CIHR 95\n4 SMU SSHRC 82\n5 SMU CIHR 30\n6 SFX NSERC 71\n7 SFX CIHR 43\n\n\n\n3.5.2.1 expand\nThe expand() function creates all possible combinations of the data in two or more columns indicated in the function’s argument and drops other columns. The example below returns the possible combinations of universities and funders in the funding tibble we created. You can see that the SMU - NSERC combination and the SFX - SSHRC combination appear even though they are not present in the original funding tibble.\n\nexpand(funding, university, funder)\n\n# A tibble: 9 × 2\n university funder\n <chr> <chr> \n1 DAL CIHR \n2 DAL NSERC \n3 DAL SSHRC \n4 SFX CIHR \n5 SFX NSERC \n6 SFX SSHRC \n7 SMU CIHR \n8 SMU NSERC \n9 SMU SSHRC \n\n\n\n\n3.5.2.2 complete\nThe complete() function does the same thing as the expand() function but it keeps all the columns of the original tibble.\n\nfunding = complete(funding, university, funder)\n\nprint(funding)\n\n# A tibble: 9 × 3\n university funder n_grants\n <chr> <chr> <dbl>\n1 DAL CIHR 95\n2 DAL NSERC 58\n3 DAL SSHRC 78\n4 SFX CIHR 43\n5 SFX NSERC 71\n6 SFX SSHRC NA\n7 SMU CIHR 30\n8 SMU NSERC NA\n9 SMU SSHRC 82\n\n\n\n\n\n3.5.3 Handle missing values\nSometimes your dataset is incomplete for one reason or another. Such a reason could be a participant to a survey that did not complete the survey or answer all the questions. What to do with the missing values or incomplete records depends on the nature of the data and the goal of your analysis.\n\n3.5.3.1 drop_na\nThe drop_na() function is useful to remove incomplete observations in the data (i.e., rows for which one of the column contains no value. The code below includes a statement that removes all incomplete observations, and one where we specify the columns names that should be scanned for NA values (the rows won’t be deleted if there are NAs in the other columns).\n\n# This removes all incomplete observations\ndrop_na(funding) \n\n# A tibble: 7 × 3\n university funder n_grants\n <chr> <chr> <dbl>\n1 DAL CIHR 95\n2 DAL NSERC 58\n3 DAL SSHRC 78\n4 SFX CIHR 43\n5 SFX NSERC 71\n6 SMU CIHR 30\n7 SMU SSHRC 82\n\n# This would remove rows where the n_grants column is NA and leave NA values in other\n# columns.\ndrop_na(funding, n_grants)\n\n# A tibble: 7 × 3\n university funder n_grants\n <chr> <chr> <dbl>\n1 DAL CIHR 95\n2 DAL NSERC 58\n3 DAL SSHRC 78\n4 SFX CIHR 43\n5 SFX NSERC 71\n6 SMU CIHR 30\n7 SMU SSHRC 82\n\n\n\n\n3.5.3.2 replace_na\nInstead of deleting rows with missing values, we me may wish to replace NAs with some other value (e.g., 0). We can do that with replace_na(). You have to provide the columns for which you want to replace NAs and the values to replace the NAs with in a list.\n\nreplace_na(funding, list(n_grants = 0))\n\n# A tibble: 9 × 3\n university funder n_grants\n <chr> <chr> <dbl>\n1 DAL CIHR 95\n2 DAL NSERC 58\n3 DAL SSHRC 78\n4 SFX CIHR 43\n5 SFX NSERC 71\n6 SFX SSHRC 0\n7 SMU CIHR 30\n8 SMU NSERC 0\n9 SMU SSHRC 82\n\n\nIf we want to replace NAs in multiple columns, we simply need to put additional column names and values to replace NAs with in the list. The code below replaces the NAs with 0 in two columns of some tibble.\n\nreplace_na(some_tibble, list(some_column = 0, some_other_column = 0))\n\n\n\n\n\n\n\nCaution\n\n\n\nReplacing null values with zeros for a numerical variable may seems logical, and it may be appropriate for your intended purpose. However, NA and 0 are not logically equivalent: NAs implies the absence of data (somethings that we did not or could not observe), and zeros imply an observed or measured value of 0.\n\n\n\n\n\n3.5.4 Merge and split cells\nLet’s create a simple tibble with some names.\n\nmy_tibble <- tibble(first_name = c(\"Jos\",\"May\"),\n last_name = c(\"Louis\",\"West\"))\n\nprint(my_tibble)\n\n# A tibble: 2 × 2\n first_name last_name\n <chr> <chr> \n1 Jos Louis \n2 May West \n\n\n\n3.5.4.1 unite\nIn some scenario, it may be more useful to have the first names and last names combined into a full_name column. This can be achieved with the unite() function.\n\nmy_tibble = unite(my_tibble,\n \"first_name\",\"last_name\", # columns to unite.\n col = \"full_name\", # name of the new column.\n sep = \" \") # Tells R to seperate the first and last name with a space.\n\nprint(my_tibble)\n\n# A tibble: 2 × 1\n full_name\n <chr> \n1 Jos Louis\n2 May West \n\n\n\n\n3.5.4.2 separate\nIn the opposite scenario where you have a full_name column but would rather have a first_name and a last_name column, you can use the separate() function to do that.\n\nmy_tibble = separate(my_tibble,\n full_name, # column to separate.\n sep = \" \", # character string to use as separator.\n into = c(\"first_name\", \"last_name\")) # names of the new columns.\n\nprint(my_tibble)\n\n# A tibble: 2 × 2\n first_name last_name\n <chr> <chr> \n1 Jos Louis \n2 May West \n\n\n\n\n3.5.4.3 separate_rows\nThe separate_row() function is similar to the separate() function, but instead of creating new columns it creates new rows. In the following example, we have a tibble containing two articles for which we have a column title and an authors column in which the authors are separated with a semi-colon.\n\n# This creates some tibble contain names of authors in a single cell with a separator.\nmy_tibble = tibble(title = c(\"awesome article\",\"boring article\"),\n authors = c(\"Toze, S.; Brown, A.\",\"Smith, J.; Roberts, J.\"))\n\nprint(my_tibble)\n\n# A tibble: 2 × 2\n title authors \n <chr> <chr> \n1 awesome article Toze, S.; Brown, A. \n2 boring article Smith, J.; Roberts, J.\n\n\nThis is not tidy data, so we want to separate the authors so that there is only one author per cell. This will create a tidy associative table between articles and authors. Let’s use separate_rows() to do that.\n\nmy_tibble = separate_rows(my_tibble, authors, sep = \"; \")\n\nprint(my_tibble)\n\n# A tibble: 4 × 2\n title authors \n <chr> <chr> \n1 awesome article Toze, S. \n2 awesome article Brown, A. \n3 boring article Smith, J. \n4 boring article Roberts, J.\n\n\n\n\n\n\n\n\nChoose your separator carefully\n\n\n\nNotice how our separator in the code above was not just a semi-colon, but a semi-colon followed by a space. That is because in the original tibble, there were two characters between the end of an author’s name and the beginning of the other: a semi-colon and a space. If we had just used the semi-colon as our separator in the separate_rows(), the space would have been kept as the first character of the names of all authors except the first one.\n\n\n\n\n\n3.5.5 Nested data\nThe XML file import mentioned earlier is a good opportunity to explore the process of tidying nested data. converting an XML file to the tidy format requires quite a few steps. Here is a pretty standard, step by step process that you can follow whenever you are dealing with XML files.\n\n# load the xml2 package\nlibrary(xml2)\n\n# read an xml file \nxml = read_xml(\"https://data.novascotia.ca/api/views/2d4m-9e6x/rows.xml\")\n\nThe xml object is storing the data in its original xml format. You can verify that indeed the file is in XML by writing the xml object to a file and looking at it.\n\nwrite_xml(xml, file=\"c:/data/xml.xml\")\n\nThe xml2 package provides a function (as_list) to convert XML documents into an equivalent R list. Beware: as_list() should not be confused with as.list(), which is a completely different, base R, function.\n\nlist <- as_list(xml)\n\nThe next step is converting that list into a tibble. So let’s do that.\n\npublic_housing <- as_tibble(list)\nprint(public_housing)\n\n# A tibble: 1 × 1\n response \n <named list> \n1 <named list [342]>\n\n\nAt the moment it has only one column called “response” of the “named_list” type, containing a single cell with 342 named list. This is a nested structure, where a single cell contains a list that contains other lists that contain other lists.\n\n3.5.5.1 unnest_longer\nSo, let’s unnest the data in the response column using the unnest_longer() function. This function makes the tibble longer by expaning it vertically (adding rows) with the unnested data.\n\npublic_housing <- unnest_longer(public_housing, response)\n\nprint(public_housing)\n\n# A tibble: 342 × 2\n response response_id\n <named list> <chr> \n 1 <named list [22]> row \n 2 <named list [22]> row \n 3 <named list [21]> row \n 4 <named list [22]> row \n 5 <named list [22]> row \n 6 <named list [21]> row \n 7 <named list [22]> row \n 8 <named list [22]> row \n 9 <named list [22]> row \n10 <named list [22]> row \n# ℹ 332 more rows\n\n\nNow we’re making progress. We have 342 observations each occupying one line. This is one of the tidy criteria! However, the columns are not right. We can see that there are two columns. The first one appears to contain 342 list of 22 elements, which are probably the 22 variables that we want as our columns.\n\n\n3.5.5.2 unnest_wider\nLet’s unnest these 342 lists using the unnest_wider() function. This function makes the tibble wider by expaning it horizontally (adding columns) with the unnested data.\n\npublic_housing <- unnest_wider(public_housing, response)\n\nprint(public_housing)\n\n# A tibble: 342 × 23\n id property_project pid name address city postal_code\n <list> <list> <list> <list> <list> <list> <list> \n 1 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]> \n 2 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]> \n 3 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <NULL> \n 4 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]> \n 5 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]> \n 6 <list [1]> <list [1]> <list [1]> <list [1]> <NULL> <list> <list [1]> \n 7 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]> \n 8 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]> \n 9 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]> \n10 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]> \n# ℹ 332 more rows\n# ℹ 16 more variables: number_of_floors <list>, residential_units <list>,\n# housing_authority <list>, county <list>, elevator <list>, oil_heat <list>,\n# electric_heat <list>, public_water <list>, well <list>, sewer <list>,\n# onsite_septic <list>, municipality <list>, x_coordina <list>,\n# y_coordina <list>, location <lgl>, response_id <chr>\n\n\nwe can now see that each cell is a list of 1 element.\n\n\n3.5.5.3 unnest\nNow let’s unnest the data contained in each cell with the unnest() function.\n\nunnest()\n\npublic_housing = unnest(public_housing, cols = c(id, property_project, pid, name, address, city, postal_code, number_of_floors, residential_units, housing_authority, county, elevator, oil_heat, electric_heat, public_water, well, sewer, onsite_septic, municipality, x_coordina, y_coordina))\n\nThis seems silly… there has to be a leaner way of writing this. Yes, there is. The names() returns a vector of the column names of a data frame. So we can use this much leaner code:\n\n#create the vector with the column names\ncolnames = names(public_housing)\n\n#use that vector for the cols argument of the unnest() function.\npublic_housing = unnest(public_housing, cols = colnames)\n\nprint(public_housing)\n\n# A tibble: 342 × 23\n id property_project pid name address city postal_code number_of_floors\n <lis> <list> <lis> <lis> <list> <lis> <list> <list> \n 1 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]> \n 2 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]> \n 3 <chr> <chr [1]> <chr> <chr> <chr> <chr> <NULL> <chr [1]> \n 4 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]> \n 5 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]> \n 6 <chr> <chr [1]> <chr> <chr> <NULL> <chr> <chr [1]> <chr [1]> \n 7 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]> \n 8 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]> \n 9 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]> \n10 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]> \n# ℹ 332 more rows\n# ℹ 15 more variables: residential_units <list>, housing_authority <list>,\n# county <list>, elevator <list>, oil_heat <list>, electric_heat <list>,\n# public_water <list>, well <list>, sewer <list>, onsite_septic <list>,\n# municipality <list>, x_coordina <list>, y_coordina <list>, location <lgl>,\n# response_id <chr>\n\n\nWe now have a tibble with 342 observation of 22 variables. Looks good! But the values in the cells still don’t look right. There’s just one more round of unnesting to apply to all the cell using the same code as in the last step.\n\npublic_housing = unnest(public_housing, cols = colnames)\n\nprint(public_housing)\n\n# A tibble: 342 × 23\n id property_project pid name address city postal_code number_of_floors\n <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> \n 1 117 330101X 6503… Rive… 19/27/… Pict… B0K 1S0 2 \n 2 324 200214X 5006… Bell… 10 Gra… Gran… B0E 1L0 2 \n 3 247 100215X 1511… Sydn… 77 Ain… Sydn… <NA> 1 \n 4 337 390201X 2522… Elgi… 22 Elg… Spri… B0M 1X0 1 \n 5 224 100201X 1513… Will… 10 Wil… Sydn… B1N 1R4 2 \n 6 289 150226X 1524… 198 … <NA> Flor… B0C 1J0 2 \n 7 214 612301 9021… Trin… 3 Trin… Yarm… B5A 1P3 2 \n 8 143 350801 2017… Youn… 130 Yo… Truro B2N 3X4 2 \n 9 104 740101 6018… Ritc… 3809 H… Rive… B0J 2W0 2 \n10 174 430901 4017… Dr. … 3792 N… Hali… B3K 3G5 9 \n# ℹ 332 more rows\n# ℹ 15 more variables: residential_units <chr>, housing_authority <chr>,\n# county <chr>, elevator <chr>, oil_heat <chr>, electric_heat <chr>,\n# public_water <chr>, well <chr>, sewer <chr>, onsite_septic <chr>,\n# municipality <chr>, x_coordina <chr>, y_coordina <chr>, location <lgl>,\n# response_id <chr>\n\n\nWow. Now it looks really good! But wait… there’s one more thing. Look at the data types of the columns. They are all characters, which doesn’t seem right because some of our columns appear to contain only numerical data. The readr package comes to the rescue with its easy to use type_convert() function.\n\npublic_housing = type_convert(public_housing)\n\nprint(public_housing)\n\n# A tibble: 342 × 23\n id property_project pid name address city postal_code\n <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> \n 1 117 330101X 65032708 Riverton Heights I 19/27/… Pict… B0K 1S0 \n 2 324 200214X 50067362 Bellevue 10 Gra… Gran… B0E 1L0 \n 3 247 100215X 15113962 Sydney Senior Dupl… 77 Ain… Sydn… <NA> \n 4 337 390201X 25227687 Elgin Street Villa 22 Elg… Spri… B0M 1X0 \n 5 224 100201X 15130651 William Street S/C 10 Wil… Sydn… B1N 1R4 \n 6 289 150226X 15242738 198 Pitt Street <NA> Flor… B0C 1J0 \n 7 214 612301 90212812 Trinity Place 3 Trin… Yarm… B5A 1P3 \n 8 143 350801 20178653 Young St. Lodge 130 Yo… Truro B2N 3X4 \n 9 104 740101 60187283 Ritcey's Cove Mano… 3809 H… Rive… B0J 2W0 \n10 174 430901 40177982 Dr. Prince Manor (… 3792 N… Hali… B3K 3G5 \n# ℹ 332 more rows\n# ℹ 16 more variables: number_of_floors <dbl>, residential_units <dbl>,\n# housing_authority <chr>, county <chr>, elevator <chr>, oil_heat <chr>,\n# electric_heat <chr>, public_water <chr>, well <chr>, sewer <chr>,\n# onsite_septic <chr>, municipality <chr>, x_coordina <dbl>,\n# y_coordina <dbl>, location <lgl>, response_id <chr>\n\n\nWe have now converted an XML document into an easy to use tidy dataset! Here is the entire process that we just went through in a single chunk of code.\n\nxml = read_xml(\"https://data.novascotia.ca/api/views/2d4m-9e6x/rows.xml\")\nlist = as_list(xml)\npublic_housing = as_tibble(list) \npublic_housing = unnest_longer(public_housing, col = response)\npublic_housing = unnest_wider(public_housing, col = response) \npublic_housing = unnest(public_housing, cols = names(public_housing)) \npublic_housing = unnest(public_housing, cols = names(public_housing))\npublic_housing = type_convert(public_housing)\n\n\n\n\n3.5.6 Combine data frames\nSometimes the data will come in separate files so you will have to combine the pieces into a single and tidy dataset.\n\n3.5.6.1 bind_rows\nThe bind_rows() function concatenates data frames vertically, adding new columns when column names differ between data frames.\n\n# Create two tibbles with student info\nstudents1 <- tibble(student_id = c(1,2,3),\n name = c(\"Francis\",\"Sam\",\"Amy\"))\n\nstudents2 <- tibble(name = c(\"Sue\",\"Bill\",\"Lucy\"),\n email = c(\"Sue123@fakemail.com\",\"bill@info620.com\",\"lucy@lucy.org\"))\n\n# Apply bind_rows function\nbind_rows(students1, students2)\n\n# A tibble: 6 × 3\n student_id name email \n <dbl> <chr> <chr> \n1 1 Francis <NA> \n2 2 Sam <NA> \n3 3 Amy <NA> \n4 NA Sue Sue123@fakemail.com\n5 NA Bill bill@info620.com \n6 NA Lucy lucy@lucy.org \n\n\n\n\n3.5.6.2 bind_cols\nThe bind_cols() function concatenates the entire data frames horizontally. The number of rows in the two databases must be the same.\n\n# Create a tibble with student info\nstudents3 <- tibble(student_id = c(4,5,6),\n name = c(\"Bill\",\"Lucy\",\"Sue\"))\n\n\n# Apply bind_rows function\nbind_cols(students3, students2)\n\n# A tibble: 3 × 4\n student_id name...2 name...3 email \n <dbl> <chr> <chr> <chr> \n1 4 Bill Sue Sue123@fakemail.com\n2 5 Lucy Bill bill@info620.com \n3 6 Sue Lucy lucy@lucy.org \n\n\nThe result is a tibble with four columns, the two columns from the students2 tibble and the two columns from the students3 tibble. So the function did it’s job. However, notice how the order of the students were not the same in both tibble? Also notice how there are now two columns with the names of the students? That’s because bind_col() does nothing more than sticking the two tibbles together without making any attempts to connect the data from one tibble to the data in the other tibble (e.g. matching the name so that the student_id and the emails are accurately paired). This also means that bind_cols() requires that both tibbles have the same number of rows, otherwise we get an error.\nAn often better way of bringing together two tibbles is to use joins.\n\n\n\n3.5.7 Join data frames\nThe join family of functions is used to combine rows and columns of data frames based on a matching criteria. This means that the number of rows and their order in the joined tibbles does not matter.\nLet’s create two new tibble to illustrate how the joins work\n\ninstructors_schools <- tibble(name = c(\"Colin\",\"Sam\",\"Dominika\",\"Alana\"),\n school = c(\"SIM\",\"RSB\",\"SPA\",\"SRES\"))\n \ninstructors_emails <- tibble(name = c(\"Colin\", \"Sam\", \"Sandy\", \"Lindsey\"),\n email = c(\"cc@dalhousie.uni\",\"st@dalhousie.uni\",\"ss@dalhousie.uni\",\"lm@dalhousie.uni\")) \n\n\n3.5.7.1 left_join\nThe left_join() function returns all records from the first data frame and only records from the second data frame for which the matching condition is TRUE. Here is an example where we start with a list of instructors and their school (instructors_schools tibble) and want to obtain their email (instructors_emails tibble).\n\nleft_join(instructors_schools, instructors_emails, by=\"name\")\n\n# A tibble: 4 × 3\n name school email \n <chr> <chr> <chr> \n1 Colin SIM cc@dalhousie.uni\n2 Sam RSB st@dalhousie.uni\n3 Dominika SPA <NA> \n4 Alana SRES <NA> \n\n\nNotice how the name column is not duplicated? That is because the name is the key used for the matching of the two tibbles.\n\n\n3.5.7.2 right_join\nThe right_join() function is returns all records from the second data frame and only records from the first data frame for which the matching condition is TRUE. Here is an example:\n\nright_join(instructors_schools, instructors_emails, by=\"name\")\n\n# A tibble: 4 × 3\n name school email \n <chr> <chr> <chr> \n1 Colin SIM cc@dalhousie.uni\n2 Sam RSB st@dalhousie.uni\n3 Sandy <NA> ss@dalhousie.uni\n4 Lindsey <NA> lm@dalhousie.uni\n\n\nSince we used a right_join() here, R kept all rows from the right table (the second one listed in the function’s body), and kept only the names and schools of the instructors that were found in the instructors_emails tibble.\n\n\n3.5.7.3 inner_join\nThe inner_join() function returns only records for which the matching condition is TRUE. It excludes all records that were not matched from both tibbles. Here is an example:\n\ninner_join(instructors_schools, instructors_emails, by=\"name\")\n\n# A tibble: 2 × 3\n name school email \n <chr> <chr> <chr> \n1 Colin SIM cc@dalhousie.uni\n2 Sam RSB st@dalhousie.uni\n\n\nColin and Sam were the only instructors for which we had both a school and an email.\n\n\n3.5.7.4 anti_join\nThe anti_join() function is useful if you want to exclude observations (rows) in your tibble when the matching condition is TRUE.\n\nanti_join(instructors_schools, instructors_emails, by=\"name\")\n\n# A tibble: 2 × 2\n name school\n <chr> <chr> \n1 Dominika SPA \n2 Alana SRES \n\n\n\n\n3.5.7.5 full_join\nThe full_join() function is essentially a combination of the left_join() and right_join() functions. It will keep all the rows from both tibbles that are being joined.\n\nfull_join(instructors_schools, instructors_emails, by=\"name\")\n\n# A tibble: 6 × 3\n name school email \n <chr> <chr> <chr> \n1 Colin SIM cc@dalhousie.uni\n2 Sam RSB st@dalhousie.uni\n3 Dominika SPA <NA> \n4 Alana SRES <NA> \n5 Sandy <NA> ss@dalhousie.uni\n6 Lindsey <NA> lm@dalhousie.uni\n\n\nNote: You can use the functions without specifying the column(s) to use for the matching. In this case, all columns with the same names will be used for the matching. Because te name column is the only one that exists in both the tibbles used in this example, the result of the following code is the same as in our inner_join() example above.\n\ninner_join(instructors_schools, instructors_emails)\n\n# A tibble: 2 × 3\n name school email \n <chr> <chr> <chr> \n1 Colin SIM cc@dalhousie.uni\n2 Sam RSB st@dalhousie.uni\n\n\n\n\n\n3.5.8 Rename columns\nFor various reasons, we may want to rename columns in our tibble. We can do this with the rename() function. Let’s capitalize the first letter of the column names from our instructors_schools tibble.\nImportant: The syntax of the rename() function goes new_column_name = orginal_column_name.\n\ninstructors_schools <- rename(instructors_schools, Name = name, School = school)\n\nprint(instructors_schools)\n\n# A tibble: 4 × 2\n Name School\n <chr> <chr> \n1 Colin SIM \n2 Sam RSB \n3 Dominika SPA \n4 Alana SRES \n\n\n\n\n3.5.9 Select columns\nThe select() variable is used to retrieve a subset of columns from a data frame or tibble.\n\n# Select the School column from the instructors_schools tibble.\nselect(instructors_schools, School)\n\n# A tibble: 4 × 1\n School\n <chr> \n1 SIM \n2 RSB \n3 SPA \n4 SRES \n\n\nWe can select and rename columns at the same time, like this:\n\n# Select the School column from the instructors_schools tibble \n# and rename it to Department.\nselect(instructors_schools, Department = School)\n\n# A tibble: 4 × 1\n Department\n <chr> \n1 SIM \n2 RSB \n3 SPA \n4 SRES \n\n\n\n\n3.5.10 Filter rows\nThe filter() functions is used to retrieve a subset of rows from a data frame or tibble based on some criteria. For example, we could filter a tibble and include only the rows here x1 is lower than 3.\n\nfilter(instructors_schools, School == \"RSB\")\n\n# A tibble: 1 × 2\n Name School\n <chr> <chr> \n1 Sam RSB \n\n# you can use the %in% operator to select observations for which the value is in a list.\n\nWe can also use the %in% operator to select observations for which the value is in a vector.\n\nfilter(instructors_schools, School %in% c(\"RSB\",\"SPA\"))\n\n# A tibble: 2 × 2\n Name School\n <chr> <chr> \n1 Sam RSB \n2 Dominika SPA \n\n\n\n\n3.5.11 Modify or create new variables\nThe mutate() function is used to create a new variable or update an existing one.\n\n# This adds a column called Faculty that contains the string \"Faculty of Management\".\nmutate(instructors_schools, Faculty = \"Faculty of Management\")\n\n# A tibble: 4 × 3\n Name School Faculty \n <chr> <chr> <chr> \n1 Colin SIM Faculty of Management\n2 Sam RSB Faculty of Management\n3 Dominika SPA Faculty of Management\n4 Alana SRES Faculty of Management\n\n# This updates the already existing variable Faculty with the new value \"FoM\".\nmutate(instructors_schools, Faculty = \"FoM\")\n\n# A tibble: 4 × 3\n Name School Faculty\n <chr> <chr> <chr> \n1 Colin SIM FoM \n2 Sam RSB FoM \n3 Dominika SPA FoM \n4 Alana SRES FoM \n\n\nNote that you can also calculate the value assign to a variable with the mutate() function.\nYou can also use conditional statements in the mutate() function. For example, say we want to add a new column with the name of the schools spelled out. Of course, the values in that new column will depend on the value in the School column. Since there are more than two schools, we should use the case_when() function, like in the following example.\n\ninstructors_schools <- mutate(instructors_schools, \n school_name = case_when(School == \"RSB\" ~ \"Rowe School of Business\",\n School == \"SIM\" ~ \"School of Information Management\",\n School == \"SPA\" ~ \"School of Public Administration\",\n School == \"SRES\" ~ \"School for Resources and Environmental Studies\"))\n\nWe can also use ifelse(condition, value if TRUE, value if FALSE) when there is only two possible outcomes. For example, say some of our instructors are on leave, and we want to add a column to our instructors_schools tibble that indicates whether the instructor is on leave or not. We could do it this way.\n\n# vector of people on leave\ninstructors_on_leave <- c(\"Dominika\",\"Colin\")\n\n# Creates a variable called on_leave that is 1 if the person is on leave and 0 otherwise\ninstructors_schools <- mutate(instructors_schools, \n on_leave = ifelse(Name %in% instructors_on_leave,1,0))\n\nprint(instructors_schools)\n\n# A tibble: 4 × 4\n Name School school_name on_leave\n <chr> <chr> <chr> <dbl>\n1 Colin SIM School of Information Management 1\n2 Sam RSB Rowe School of Business 0\n3 Dominika SPA School of Public Administration 1\n4 Alana SRES School for Resources and Environmental Studies 0"
},
{
"objectID": "reading_and_tidying_data.html#export-data",
"href": "reading_and_tidying_data.html#export-data",
"title": "3 Reading and tidying data",
"section": "3.6 Export data",
"text": "3.6 Export data\nNow that you have a nice tidy data set, you may want to export it so you can save it for future use and not have to repeat the data collection and tidying process, or maybe because you want to share it with others.\n\n3.6.1 Export data frames\nExporting data frames is very easy in R (the hardest part was getting there). The readr package has a bunch of functions to write files. The most common are:\n\nwrite_csv() for comma-separated files (and it’s cousin write_csv2() that uses the european format (commas instead of dot for decimals, and semicolon as delimiter);\nwrite_tsv() for tab-delimited files;\nwrite_delim() which allows you to specify the separator (it can be anything you want, even the word “banana”)\nwrite_xlsx() from the writexl package can be used to write Excel files.\n\nHere’s how we would use these functions to write data to a file. Note that, in the example, we use the col_names = TRUE argument to ensure that the first line of the created files will contain the column names. the default setting of write_xlsx() bolds and centers the column names. In the example below, we chose to disable this feature with the format_headers = FALSE argument.\n\nlibrary(readr)\nwrite_csv(my_tibble, file = \"my_data.csv\", col_names = TRUE)\n\nwrite_csv2(my_tibble, file = \"my_data.csv\", col_names = TRUE) \n\nwrite_tsv(my_tibble, file = \"my_data.tsv\", col_names = TRUE)\n\nwrite_delim(my_tibble, file = \"my_data.txt\", delim = \"\\t\", col_names = TRUE)\n\nlibrary(writexl)\nwrite_xlsx(my_tibble, path = \"my_data.xlsx\", col_names = TRUE, format_headers = FALSE)\n\nThe write_xlsx() function can also write multiple tibbles into different sheets of a single Excel file, which can be quite useful. To do that, we need to provide a list of tibbles rather than a single tibble as the intput to the function.\nAssuming that we have already created several tibbles and named them my_tibble1, my_tibble2, and my_tibble3. Here’s how we would create a list containing the three tibbles and export it into an Excel file.\n\n# Create a list of tibbles tibbles\nmy_tibble_list <- list(my_tibble1, my_tibble2, my_tibble3)\n\nwrite_xlsx(my_tibble_list, path=\"my_data.xlsx\", col_names = TRUE, format_headers = FALSE)\n\n\n\n3.6.2 JSON\nWriting JSON files from a data frame with the toJSON() function is just as easy as reading them with the fromJSON() function. The following code reads a CSV file containing the Titanic dataset, and then exports the data to a JSON file. Note the that pretty=TRUE argument spaces out the JSON file so it is more readable for humans.\n\n# Reads the CSV file into a data frame.\ntitanic <- read_csv(\"https://pmongeon.github.io/info6270/files/data/titanic.csv\")\n\n# Writes the data frame to a JSON file.\nwrite_file(toJSON(titanic, pretty=TRUE), \"titanic.json\")"
},
{
"objectID": "reading_and_tidying_data.html#writing-readable-code",
"href": "reading_and_tidying_data.html#writing-readable-code",
"title": "3 Reading and tidying data",
"section": "3.7 Writing readable code",
"text": "3.7 Writing readable code\nThere are many ways of writing code to achieve the same process. You can structure the data in different ways, use different functions from different packages, but we can also use the exact same functions in exactly the same way but with very different code. Most of our examples so far were relatively simple and the processes did not involve a large number of steps. However, when they did, most of the steps were coded sequentially, one statement at a time. This makes all the steps explicit and clear, but it also makes the code longer than it needs to be. We can also write our code so that the entire process is in a single statement that contains other statements, that contain other statements, an so on. Let’s use an example to compare the two approaches. Say we want to use the Titanic dataset to create a CSV file that contains only the name, sex, age of the survivors. The steps would be:\n\nLoad the Titanic dataset.\nConvert it into a tibble (this step is not actually necessary, we use it to add an extra step to the process).\nFilter the dataset to include only the survivors.\nSelect only the Name, Sex, and Age columns.\nWrite the tibble to a .csv file.\n\nWe can write this in a single statement, like this.\n\nwrite_csv(select(filter(as_tibble(read_csv(\"https://pmongeon.github.io/info6270/files/data/titanic.csv\")),Survived == 1), Name,Sex,Age), file = \"titanic_survivors.csv\", col_names = TRUE)\n\nJust like in mathematical equations, the operations are completed from the inside out. So the first step, read_csv(\"titanic\") is in the middle, and it is inside the statement for the second step, as_tibble(), which is inside the third statemnt, filter(, Survived == 1), and so on. The outer statement is the write_csv() function with its argument, file = \"titanic_survivors.csv, col_names = TRUE, appearing at the very end of the code.\nThis may be efficient in terms of the number of characters needed to write the code, but perhaps a little difficult to read.\nNow lets do the same thing by writing distinct statements for each step.\n\n# Load the Titanic dataset.\ndata <- read_csv(\"https://pmongeon.github.io/info6270/files/data/titanic.csv\")\n\n# Convert it into a tibble\ndata <- as_tibble(data)\n\n# Filter the dataset to include only the survivors.\ndata <- filter(data, Survived == 1)\n\n# Select only the Name, Sex, and Age columns.\ndata <- select(data, Name, Sex, Age)\n\n# Write the tibble to a .csv file.\nwrite_csv(data, \"titanic_survivors.csv\")\n\nThis is very clear but perhaps not so efficient in terms of characters (the name of the tibble gets repeated many times). Also, all five statement will need to be executed one by one, or we will need to select all the code to run it all at once. This is perhaps not such a big deal, but the single statement version does have an edge there.\n\n3.7.1 The pipe\nThe dplyr and maggritr packages (automatically loaded with the tidyverse) include an extremely useful and popular operator called the pipe which looks like this in R: %>%. The keyboard shortcut is CTRL+SHIFT+M. The pipe is a wonderful tool for building lean code that are easier to read and to debug. It allows you to write multiple steps sequentially into a single statement. It combines the readability of the step by step approach with the efficiency of the single “embedded statement” approach. Here’s the same process written with the pipe.\n\nread_csv(\"https://pmongeon.github.io/info6270/files/data/titanic.csv\") %>% \n as_tibble() %>% \n filter(Survived == 1) %>% \n select(Name, Sex, Age) %>% \n write_csv(\"titanic_survivors.csv\", col_names = TRUE)\n\nWhen reading this code, I like to read the %>% as “and then”. So this would read, read the CSV, and then make it a tibble, and then limit the tibble to rows for which the Survived column contains 1, and then select only the Name, Sex, and Age columns, and then write the tibble to a file named “titanic_survivors.csv”.\nAs you can see, this is a lean and easily readable process which doesn’t require to create an object and explicitly read it, process it, and overwrite it for each step of the process. Note that having each step on a separate line is common and good practice but it is not required. In fact, R does not care about extra spaces, indentations, or line breaks in the code. We could have written this statement in a single line like this.\n\nread_csv(\"https://pmongeon.github.io/info6270/files/data/titanic.csv\") %>% as_tibble() %>% filter(Survived == 1) %>% select(Name, Sex, Age) %>% write_csv(\"titanic_survivors.csv\", col_names = TRUE)"
},
{
"objectID": "reading_and_tidying_data.html#additional-resources",
"href": "reading_and_tidying_data.html#additional-resources",
"title": "3 Reading and tidying data",
"section": "3.8 Additional resources",
"text": "3.8 Additional resources\nThere are many cheatsheets available for R packages that can be very useful to quickly see what the different functions of the packages can do.\n\n\n\n\nWickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (September): 1–23. https://doi.org/10.18637/jss.v059.i10.\n\n\nWickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition. Sebastopol, CA: O’Reilly."
},
{
"objectID": "reading_and_tidying_data.html#references",
"href": "reading_and_tidying_data.html#references",
"title": "3 Reading and tidying data",
"section": "3.9 References",
"text": "3.9 References\n\n\n\n\nWickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (September): 1–23. https://doi.org/10.18637/jss.v059.i10.\n\n\nWickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition. Sebastopol, CA: O’Reilly."
},
{
"objectID": "publishing_with_r.html#learning-objectives",
"href": "publishing_with_r.html#learning-objectives",
"title": "5 Publishing with R",
"section": "",
"text": "Quarto documents",
"crumbs": [
"<span class='chapter-number'>5</span> <span class='chapter-title'>Publishing with R</span>"
]
},
{
"objectID": "publishing_with_r.html#introduction",
"href": "publishing_with_r.html#introduction",
"title": "5 Publishing with R",
"section": "5.2 Introduction",
"text": "5.2 Introduction\nSo far, we have learned how to import data in all kinds of format into R, convert it into a tibble that follow the first two tidy principles (one observation per row, one variable per column), and clean up messy strings with the stringr package. These steps are the 80 in the 80/20 rules of data science, which states that 80% of the time is spent preparing the data, and 20% is spent analyzing it.\nWe are embarking on our journal to draw insights from the data and write clear and compelling data stories. This chapter introduces Quarto, a suite of tools that will enable you to perform all the data processing AND produce beautiful reports in one simple yet powerful interface: RStudio.",
"crumbs": [
"<span class='chapter-number'>5</span> <span class='chapter-title'>Publishing with R</span>"
]
},
{
"objectID": "publishing_with_r.html#quarto-documents",
"href": "publishing_with_r.html#quarto-documents",
"title": "5 Publishing with R",
"section": "5.3 Quarto documents",
"text": "5.3 Quarto documents\nSo far, we have been working with R scripts (.r files, like the ones we used for Labs 1 to 3). In R scripts, everything is interpreted by R as code unless we specify that we are writing a comment with #. This works fine, and there is no limit to what we can do with these kinds of R scripts regarding data processing and analysis. However, R scripts are not a great tool for making sharing our data stories with the world.\nQuarto brings coding, writing, and formatting together by allowing you to write nicely formatted HTML, Word or PDF documents, with embedded chunks of code that are executed to process the data and create beautiful tables and figures for your document. It uses the markdown (https://en.wikipedia.org/wiki/Markdown) syntax. This course website is an example of the type of output that you can produce with Quarto. It’s a Quarto book.\nThere is a lot that you can do with Quarto that is beyond the scope of this course, but if you want to explore the possibilities, I recommend that you consult the Quarto website, which contains plenty of documentation.\n\n5.3.1 Creating a Quarto document in RStudio\nTo create a new Quarto document (.qmd), you can click on file, then new file, then select the second option (Quarto Document…). Then RStudio will allow you to provide some information about the document you are creating.\n\n\n\n\n\nYou can select different types of outputs, like documents, presentations, and Shiny applications. In this course, we will use R Markdown to make documents. You can choose to output an HTML document, a PDF document, or a Word document. I recommend sticking with the default (HTML). Word is also a good option if you want to be able to modify the format or add text in Word. The PDF format is awesome but can sometimes be a little buggy as it requires additional packages and a LaTeX installation, so if you are struggling with PDF rendering, I recommend sticking to HTML documents and Word documents for this course.\nWhen you create a new Quarto document, it comes with some content that can get you started with the components of the document and the syntax. This can give you a quick overview of Quarto and the markdown syntax.\n\n\n5.3.2 Three components of a Quarto document\nThere are three components to a Quarto document: Metadata, text, and code. All components are optional. So you can write a R Markdown document without metadata, or include only code or text.\n\n5.3.2.1 Metadata\nThe metadata is located at the top of the Quarto document and is delineated by three dashes --- at the beginning and the end. When you create a new Quarto document in RStudio, the metadata will look like this:\n---\ntitle: \"My document\"\nauthor: \"Philippe Mongeon\"\nformat: html\neditor: visual\n---\nTo change the output type to a PDF or a Word document, we simply need to replace html with docx or pdf. We can also add features such as a table of content and specify where on the page we want it to appear. Note that toc-location does not work for PDFs and Word documents, that always put the table of content in the main body of the document. The number-sections: true numbers sections. Here’s what this looks like.\n---\ntitle: \"My document\"\nauthor: \"Philippe Mongeon\"\nformat: \n html:\n toc: true\n toc-location: left \n number-sections: true\neditor: visual\n---\n\n\n\n\n\n\nImportant\n\n\n\nNote the html tag was moved to a new line and is now followed by a colon. Also, pay attention to the indentation of the different parts, which is essential to follow. When facing errors, it is always a good idea to fall back on a working example like the one above or to consult the Quarto documentation.\n\n\n\n\n\n\n\n\nIMPORTANT: Save yourself some pain by doing this\n\n\n\nWhen rendering an HTML document, a .html file gets created, along with a folder that contains several the images and other files that are necessary for the formatting on the document when opened on a brower. This means that you send to send a bunch of files everytime you want to share your work. This can be largely avoided by adding self-contained: true in your metadata, like this:\n\n\n---\ntitle: \"My document\"\nauthor: \"Philippe Mongeon\"\nformat: html\nself-contained: true\n---\nNote: if your code reads from a file on your computer, you will still need to share that along with your QMD file if you want someone to be able to run your code and re-render your document on their end.\n\n\n5.3.2.2 Text\nWhatever you write after the metadata statement is text, which can be formatted using markdown syntax. Here is an example, with the code on the left and the HTML output on the right.\n\n\n\n\n\n\n\n5.3.2.3 Code\nWhat makes Quarto so great is that we can embed code chunks into our document. We have seen a lot of those code chunks throughout this website so far. They start with three backticks (`) followed by curly brackets with the coding language used in the chunk and end with three backticks.\n```{r}\nsome r code\n```\nWe can also set options for the code chunks to tell R whether to display the code itself, its output, or both. #| echo: FALSE tells R not to display the code in the published document. #| eval: FALSE tells R not to execute the code (so the output of the code will not be displayed in the published document because there will be no output). Here is an example of each option. The code is on the left, and the HTML output is on the right.\n\n\n\n\n\nOther useful code chunk options are #| message: FALSE, #| warning: FALSE, and #| error: FALSE. Sometimes your code will produce error messages, warnings, or other messages. Typically, we do not want these to show in our documents, so setting these options to FALSE is t,he way to go.\n\n\n\n5.3.3 Visual mode\nThe visual mode provides an interface and tools to make writing and formatting your Quarto document easier. Here is the same .qmd file edited in normal mode (left) and visual mode (right).\n\n\n\n\n\n\n\n5.3.4 Rendering your document\nWhen you are ready to produce your HTML, Word or PDF document, you click the Render button in RStudio. This will execute all the code and output the HTML, Word or PDF file in your working directory. You can also choose, in the settings, to preview the output in a popup window or the viewer pane (as in the screenshots above).",
"crumbs": [
"<span class='chapter-number'>5</span> <span class='chapter-title'>Publishing with R</span>"
]
},
{
"objectID": "publishing_with_r.html#making-beautiful-tables",
"href": "publishing_with_r.html#making-beautiful-tables",
"title": "5 Publishing with R",
"section": "5.4 Making beautiful tables",
"text": "5.4 Making beautiful tables\nA lot of times, we write statements that print tibbles as their output. So part of making beautiful documents is learning how to print our tibbles as nicely formatted tables.\nHere are the key rules for table layout proposed by Wilke (2019):\n\nDo not use vertical lines.\nDo not use horizontal lines between data rows. (Horizontal lines as the separator between the title row and the first data row or as a frame for the entire table are fine.)\nText columns should be left aligned.\nNumber columns should be right-aligned and use the same number of decimal digits throughout.\nColumns containing single characters are centred.\nThe header fields are aligned with their data, i.e., the heading for a text column will be left aligned, and the heading for a number column will be right aligned.\nCaptions are placed above the table.\n\n\n5.4.1 the kableExtra package\nThere are a bunch of popular packages that you can explore to make nice tables. The kableExtra package is very versatile, and I recommend it as a start. You can find all the information you need to fully exploit the package in the official vignette.\nThe following code uses the kbl() function to print the first 6 rows (this is done with the head() function) of the mpg dataset.\n\nlibrary(kableExtra)\nmpg %>% \n head() %>% \n kbl()\n\n\n\n\nmanufacturer\nmodel\ndispl\nyear\ncyl\ntrans\ndrv\ncty\nhwy\nfl\nclass\n\n\n\n\naudi\na4\n1.8\n1999\n4\nauto(l5)\nf\n18\n29\np\ncompact\n\n\naudi\na4\n1.8\n1999\n4\nmanual(m5)\nf\n21\n29\np\ncompact\n\n\naudi\na4\n2.0\n2008\n4\nmanual(m6)\nf\n20\n31\np\ncompact\n\n\naudi\na4\n2.0\n2008\n4\nauto(av)\nf\n21\n30\np\ncompact\n\n\naudi\na4\n2.8\n1999\n6\nauto(l5)\nf\n16\n26\np\ncompact\n\n\naudi\na4\n2.8\n1999\n6\nmanual(m5)\nf\n18\n26\np\ncompact\n\n\n\n\n\n\n\nAs you can see, the table meets most of the criteria listed above. Our table lacks a caption, and the drv column should be centred. We can fix that by using the caption arguments and manually setting the columns’ alignment with the align argument (“l” = left, “r” = right, and “c” = centre).\n\nmpg %>% \n head() %>% \n kbl(caption = \"This is the caption of the table\",\n align=c(\"l\",\"l\",\"r\",\"r\",\"r\",\"l\",\"c\",\"r\",\"r\",\"c\",\"l\"))\n\n\nThis is the caption of the table\n\n\nmanufacturer\nmodel\ndispl\nyear\ncyl\ntrans\ndrv\ncty\nhwy\nfl\nclass\n\n\n\n\naudi\na4\n1.8\n1999\n4\nauto(l5)\nf\n18\n29\np\ncompact\n\n\naudi\na4\n1.8\n1999\n4\nmanual(m5)\nf\n21\n29\np\ncompact\n\n\naudi\na4\n2.0\n2008\n4\nmanual(m6)\nf\n20\n31\np\ncompact\n\n\naudi\na4\n2.0\n2008\n4\nauto(av)\nf\n21\n30\np\ncompact\n\n\naudi\na4\n2.8\n1999\n6\nauto(l5)\nf\n16\n26\np\ncompact\n\n\naudi\na4\n2.8\n1999\n6\nmanual(m5)\nf\n18\n26\np\ncompact\n\n\n\n\n\n\n\nThe kableExtra package includes some standard themes that you can use to format your tables: kable_classic(), kable_classic_2, kable_minimal(), kable_material(), kable_material_dark(), and kable_paper(). Here’s the same table to which we apply the classic theme.\n\nmpg %>% \n head() %>% \n kbl(caption = \"This is the caption of the table\",\n align=c(\"l\",\"l\",\"r\",\"r\",\"r\",\"l\",\"c\",\"r\",\"r\",\"c\",\"l\")) %>% \n kable_classic()\n\n\nThis is the caption of the table\n\n\nmanufacturer\nmodel\ndispl\nyear\ncyl\ntrans\ndrv\ncty\nhwy\nfl\nclass\n\n\n\n\naudi\na4\n1.8\n1999\n4\nauto(l5)\nf\n18\n29\np\ncompact\n\n\naudi\na4\n1.8\n1999\n4\nmanual(m5)\nf\n21\n29\np\ncompact\n\n\naudi\na4\n2.0\n2008\n4\nmanual(m6)\nf\n20\n31\np\ncompact\n\n\naudi\na4\n2.0\n2008\n4\nauto(av)\nf\n21\n30\np\ncompact\n\n\naudi\na4\n2.8\n1999\n6\nauto(l5)\nf\n16\n26\np\ncompact\n\n\naudi\na4\n2.8\n1999\n6\nmanual(m5)\nf\n18\n26\np\ncompact\n\n\n\n\n\n\n\n\n\n5.4.2 keeping it simple\nMy main advice with tables is to keep them clean, simple, and informative. Usually, simple tables are enough to convey the message efficiently. If you want to impress with fancy tables, this can certainly be done with kableExtra or some other table packages (an overview of some of the main packages is available on this website).",
"crumbs": [
"<span class='chapter-number'>5</span> <span class='chapter-title'>Publishing with R</span>"
]
},
{
"objectID": "publishing_with_r.html#summary",
"href": "publishing_with_r.html#summary",
"title": "5 Publishing with R",
"section": "5.5 Summary",
"text": "5.5 Summary\nQuarto documents combine the data processing, analysis and reporting elements of the data science workflow in a single location. The visual tool brings the Quarto document writing experience closer to writing in Word or by hiding the code and letting you visualize the format of your document as you write them.\nTo get you used to Quarto, we will use it for the remaining course labs. You are also encouraged to use it for your project.\n\n\n\n\nWilke, C. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. First edition. Sebastopol, CA: O’Reilly Media.",
"crumbs": [
"<span class='chapter-number'>5</span> <span class='chapter-title'>Publishing with R</span>"
]
},
{
"objectID": "publishing_with_r.html#references",
"href": "publishing_with_r.html#references",
"title": "5 Publishing with R",
"section": "5.6 References",
"text": "5.6 References\n\n\n\n\nWilke, C. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. First edition. Sebastopol, CA: O’Reilly Media."
},
{
"objectID": "visualizing_data.html#introduction",
"href": "visualizing_data.html#introduction",
"title": "7 Visualizing data",
"section": "7.1 Introduction",
"text": "7.1 Introduction\nOne of the most widely used package for data visualization in R is called ggplot2 (https://ggplot2.tidyverse.org). It has a bit of a learning curve but it is extremely powerful and can visualize almost anything. The R for data science book by Wickham and Grolemund (2016) has a chapter on ggplot2, and the same author also wrote an entire book on ggplot2 (Wickham 2016) , which I also recommend. The book Data visualization: a practical introduction book by Healy (2018) is another good resource. These books cover a lot of ground on how to write code to visualize data in R, but ultimately, an understanding of of the fundamentals of data visualization (what does a good visualization look like?) is also an invaluable asset for you as a data scientist. For that, I recommend the Fundamentals of data visualization book by Wilke (2019). Another wonderful resource is the R Graph Gallery website, which contains a large collection of ggplot graph examples that include the code so you can easily use it in your own scripts.\nThis chapter draws from these resources to provide a short introduction to visualizing data with ggplot2. First, we will explore the ggplot syntax, and then explore different types of visualizations with some examples and some exercises. Like most of the other packages that we used so far, there is a ggplot 2 cheatsheet that can be useful for a quick reference."
},
{
"objectID": "visualizing_data.html#the-ggplot-syntax",
"href": "visualizing_data.html#the-ggplot-syntax",
"title": "7 Visualizing data",
"section": "7.2 The ggplot syntax",
"text": "7.2 The ggplot syntax\nThe minimum requirements to make a plot is data, an aesthetic argument aes(), and a geometric object geom(). Just like the pipe (%>%) is used to build R statements layer by layer, ggplot follows the same principle but uses the plus sign “+” to add layers to the plot. Let’s construct a basic dot plot to see how it works.\n\n7.2.1 The data\nThis is the starting point. We tell ggplot what object contains the data that we want to visualize.\n\nggplot(mpg)\n\n\n\n\nTelling R to just plot the data produces nothing because we did not provide information on the dimensions of the graph (x and y axis) or how to show the data in the graph.\n\n\n7.2.2 Aesthetic mappings\nWe can use aes() to tell R what variables we want to use for the x and y axes.\n\nggplot(mpg) +\n aes(x=displ, y=hwy)\n\n\n\n\nThe provided aes() were used by ggplot to generate a blank plot. We now need to tell how we want the data to be visualized in this plot with a geom() layer.\n\n\n7.2.3 Geoms\nThe geom_ function family allow you to specify how you want the data to be displayed on the graph. There are often called layers and you can have multiple layers in a single figure.\n\nggplot(mpg) +\n aes(x=displ, y=hwy) + \n geom_point()\n\n\n\n\nThat’s it! Now you know how to make a plot in R. Let’s look at a few more examples on adding layers and making our graph look better. First, let’s add a second geom() layer on top of this graph to make it more informative.\n\n\n7.2.4 Adding a second geom\n\nggplot(mpg) +\n aes(x=displ, y=hwy) + \n geom_point() +\n geom_smooth()"
},
{
"objectID": "visualizing_data.html#making-pretty-figures",
"href": "visualizing_data.html#making-pretty-figures",
"title": "7 Visualizing data",
"section": "7.3 Making pretty figures",
"text": "7.3 Making pretty figures\n\n7.3.1 Modifying labels\n\nggplot(mpg) +\n aes(x=displ, y=hwy) + \n geom_point() +\n geom_smooth() +\n labs(x = \"Engine displacement (litres)\",\n y = \"Highway miles per gallon\",\n title = \"Mileage by engine size and cylinders\",\n subtitle = \"Source: http://fueleconomy.gov\")\n\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\n\n\n\n\n\n\n7.3.2 Themes\nin ggplot, you can modify the layout of your graphs with themes. There are standard themes available that you can use like this:\n\nggplot(mpg) +\n aes(x=displ, y=hwy) + \n geom_point() +\n geom_smooth() +\n labs(x = \"Engine displacement (litres)\",\n y = \"Highway miles per gallon\",\n title = \"Mileage by engine size and cylinders\",\n subtitle = \"Source: http://fueleconomy.gov\") +\n theme_classic()\n\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\n\n\n\n\nThere are a lot of things you can modify and a whole range of possible arguments for the theme() function. The best way to find out how to do what you are trying to do will often be a Google search, or the theme() documentation that you can view in RStudio ?theme(). For example, if we wanted our legend to be at the bottom of the graph, center the title, and make it bigger, we could do it like this:\n\nggplot(mpg) +\n aes(x=displ, y=hwy) + \n geom_point(aes(colour=class)) +\n geom_smooth() +\n labs(x = \"Engine displacement (litres)\",\n y = \"Highway miles per gallon\",\n title = \"Mileage by engine size and cylinders\",\n subtitle = \"Source: http://fueleconomy.gov\") +\n theme_classic() +\n theme(legend.position = \"bottom\") +\n theme(plot.title = element_text(size = 20, hjust = .5))\n\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\n\n\n\n\n\n\n7.3.3 Colours\nWe can specify the color of our geoms.\n\nggplot(mpg) +\n aes(x=displ, y=hwy) + \n geom_point(colour = \"green\") +\n geom_smooth(colour = \"red\") +\n labs(x = \"Engine displacement (litres)\",\n y = \"Highway miles per gallon\",\n title = \"Mileage by engine size and cylinders\",\n subtitle = \"Source: http://fueleconomy.gov\") +\n theme_classic()\n\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\n\n\n\n\nPerhaps we’d rather have the colour be based on a variable in the dataset. In this case the colour goes into an aes() function within the geom_point() function. like this:\n\nggplot(mpg) +\n aes(x=displ, y=hwy) + \n geom_point(aes(colour=class)) +\n geom_smooth() +\n labs(x = \"Engine displacement (litres)\",\n y = \"Highway miles per gallon\",\n title = \"Mileage by engine size and cylinders\",\n subtitle = \"Source: http://fueleconomy.gov\") +\n theme_classic()\n\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\n\n\n\n\nYou can customize the colours in your graphs by using one of many palettes provided by different packages. Here is a great place to explore palettes and where to get them: <https://emilhvitfeldt.github.io/r-color-palettes/discrete.html>. In the following example, I use the colorblind palette from the ggthemes package.\n\n# First I load the ggthemes package\nlibrary(ggthemes)\n\nWarning: package 'ggthemes' was built under R version 4.3.2\n\nggplot(mpg) +\n aes(x=displ, y=hwy, colour=class) + \n geom_point() +\n labs(x = \"Engine displacement (litres)\",\n y = \"Highway miles per gallon\",\n title = \"Mileage by engine size and cylinders\",\n subtitle = \"Source: http://fueleconomy.gov\") +\n scale_color_colorblind()\n\n\n\n\nHere’s how you can specify the colours using scale_colour_manual(), or scale_fill_manual() when working with continuous variables.\n\nggplot(mpg) +\n aes(x=displ, y=hwy, colour=class) + \n geom_point() +\n labs(x = \"Engine displacement (litres)\",\n y = \"Highway miles per gallon\",\n title = \"Mileage by engine size and cylinders\",\n subtitle = \"Source: http://fueleconomy.gov\"\n ) +\n scale_color_manual(values = c(\"blue\",\"red\",\"green\",\"yellow\",\"purple\",\"orange\",\"pink\"))\n\n\n\n\nThe code above assigned assigned the colors in order (first group has first colour, second group as second colour, etc.). You can also specify the group colors like this:\n\nggplot(mpg) +\n aes(x=displ, y=hwy, colour=class) + \n geom_point() +\n labs(x = \"Engine displacement (litres)\",\n y = \"Highway miles per gallon\",\n title = \"Mileage by engine size and cylinders\",\n subtitle = \"Source: http://fueleconomy.gov\"\n ) +\n scale_color_manual(values = c(\"2seater\" = \"pink\",\n \"suv\"=\"red\",\n \"minivan\" = \"green\",\n \"pickup\" = \"yellow\",\n \"compact\" = \"blue\",\n \"subcompact\" = \"orange\",\n \"midsize\"=\"purple\"))\n\n\n\n\n\n\n7.3.4 Adjusting axes scales\nWe can adjust the limits and breaks of our axis with scale_x_continuous() and scale_y_continuous(). The limits arguments needs two value (the start and the end of the axis) and the breaks argument is a vector of the value labels you want to show on the axis.\n\nggplot(mpg) +\n aes(x=displ, y=hwy) + \n geom_point(aes(colour=class)) +\n labs(x = \"Engine displacement (litres)\",\n y = \"Highway miles per gallon\",\n title = \"Mileage by engine size and cylinders\",\n subtitle = \"Source: http://fueleconomy.gov\") +\n theme(legend.position = \"bottom\") +\n theme(plot.title = element_text(size = 20, hjust = .5)) +\n scale_x_continuous(limits = c(0,10), breaks = c(0,2,4,6,8,10)) +\n scale_y_continuous(limits = c(0,100), breaks = c(0,20,40,60,80,100))"
},
{
"objectID": "visualizing_data.html#choosing-the-right-visualization-for-your-data",
"href": "visualizing_data.html#choosing-the-right-visualization-for-your-data",
"title": "7 Visualizing data",
"section": "7.4 Choosing the right visualization for your data",
"text": "7.4 Choosing the right visualization for your data\nThe ggplot2 website provides a comprehensive list of the geoms and all the other functions that you can use with ggplot. You can can also click on any function listed to get more information, including the arguments that the function accepts or requires, as well as an example.\nAs you will see, there is a lot of things that you can do with ggplot. That’s because ggplot is meant to fulfill the needs of a large community of users working in completely different industries with completely different kinds of data and completely different objectives. Ideally, when you are attempting to visualize your data, you already have a pretty clear idea of what the data looks like and what you are trying to accomplish, so you can start by asking these few questions, so that you can identify the limited set of options that are relevant for your data and your goals. Figuring out your options “on paper” before jumping in the code will help guide you and protect you from information overload and potentially a lot of wasted time trying to create plots using geoms that simply do not work for you data. Here are the questions:\n\nWhat question am I trying to answer with this plot?\nHow many variables do I want to visualize?\nWhat type of variables do I want to visualize?\n\nThe directory of visualizations proposed by Wilke (2019) is a great resource to help you think about this. You can also use the following table that lists geoms based on the number and types of data to be plotted.\nMapping of graph types to data number and types\n\n\n\n\n\n\n\nVariables\nTypical graph\n\n\n\n\nSingle discrete/categorical\nBar chart\n\n\nSingle continuous\nHistogram\n\n\ntwo continuous\nScatter plot, line graph\n\n\nTwo discrete/categorical\nBar chart\n\n\nOne discrete/categorical, one continuous\nBar chart, box plot, dot plot\n\n\n\n\n7.4.1 Single discrete/categorical variable\n\n7.4.1.1 Bar chart\nAs we saw in chapter 6, categorical variables can often best represented with frequency tables. Put simply, a bar chart is nothing more than the visual representation of a frequency table. So let’s look at our categorical and discrete variables to see which ones we might want to visualize with a bar chart. The only categorical variables that we should really avoid visualizing with a bar chart are those with too many possible values, so let’s count the number of possible values for each of our categorical variables.\n\nmpg %>%\n pivot_longer(cols = c(\"manufacturer\",\"model\",\"trans\",\"drv\",\"fl\",\"class\"), \n names_to = \"variable\",\n values_to = \"value\") %>% \n select(variable, value) %>% \n unique() %>% \n group_by(variable) %>% \n summarize(possible_value = n()) %>% \n kbl()\n\n\n\n\nvariable\npossible_value\n\n\n\n\nclass\n7\n\n\ndrv\n3\n\n\nfl\n5\n\n\nmanufacturer\n15\n\n\nmodel\n38\n\n\ntrans\n10\n\n\n\n\n\n\n\nThese are all reasonable amounts of possible values, so all the variables are good candidates for bar charts. Let’s use ggplot to make a bar chart representing the frequency of observations for each manufacturer in the mpg dataset.\n\nggplot(mpg) +\n aes(manufacturer) +\n geom_bar()\n\n\n\n\nWe can see that the names of the manufacturers overlap, so let’s fix that by giving a 45 degree angle to these labels.\n\nggplot(mpg) +\n aes(manufacturer) +\n geom_bar() +\n theme(axis.text.x = element_text(angle = 45))\n\n\n\n\n\n\n\n7.4.2 Single continuous variable\n\n7.4.2.1 Histogram\n\nggplot(mpg) +\n aes(hwy) +\n geom_histogram()\n\n\n\n\n\n\n\n7.4.3 Two continuous variable\n\n7.4.3.1 Scatter plot\nThe example we used earlier happens to be a good example of a scatter plot on which we added a trend line with the geom_smooth function.\n\nggplot(mpg) +\n aes(x=displ, y=hwy) + \n geom_point(aes(colour=class)) +\n geom_smooth(method = \"loess\")\n\n\n\n\n\n\n7.4.3.2 Line graph\n\nggplot(mpg) +\n aes(x=hwy, y=cty) +\n geom_line() +\n ylab(\"Miles per gallon (city)\") +\n xlab(\"Miles per gallon (highway)\")\n\n\n\n\n\n\n\n7.4.4 Two categorical or discrete variables\n\n7.4.4.1 Grouped bar chart\n\nmpg %>% \n select(manufacturer, cyl) %>% \n mutate(cyl = as.character(cyl)) %>% \n group_by(manufacturer, cyl) %>% \n mutate(count = n()) %>% \n ggplot() + \n aes(x=manufacturer, y=count, fill=cyl) + \n geom_bar(position=\"dodge\", stat=\"identity\") +\n theme(axis.text.x = element_text(angle = 45))\n\n\n\n\n\n\n7.4.4.2 Stacked bar chart\nThe previous example isn’t looking too great. Maybe if we stacked the bars? We only have to make one change to the code: geom_bar(position = \"stack\") instead of “dodge”.\n\nmpg %>% \n select(manufacturer, cyl) %>% \n mutate(cyl = as.character(cyl)) %>% \n group_by(manufacturer, cyl) %>% \n mutate(count = n()) %>% \n ggplot() + \n aes(x=manufacturer, y=count, fill=cyl) + \n geom_bar(position=\"stack\", stat=\"identity\") +\n theme(axis.text.x = element_text(angle = 45))\n\n\n\n\n\n\n7.4.4.3 Percent stacked bar chart\nWe can easily do a percent stacked bar chart by changing the position argument of the geom_bar() to “fill”.\n\nmpg %>% \n select(manufacturer, cyl) %>% \n mutate(cyl = as.character(cyl)) %>% \n group_by(manufacturer, cyl) %>% \n mutate(count = n()) %>% \n ggplot() + \n aes(x=manufacturer, y=count, fill=cyl) + \n geom_bar(position=\"fill\", stat=\"identity\") +\n theme(axis.text.x = element_text(angle = 45))\n\n\n\n\n\n\n\n7.4.5 One categorical and one continuous variable\n\n7.4.5.1 Box plot\nLet’s use a box plots to compare the miles per gallon performance of cars on the highway depending on the number of cylinder of the motor. As you can see in the code below, the cyl variable needs to be converted to a char (or a factor) otherwise ggplot treats it as a numerical variable instead of a categorical variable.\n\nggplot(mpg) +\n aes(x=as.character(cyl), y = hwy) +\n geom_boxplot() +\n theme(axis.text.x = element_text(angle = 45))\n\n\n\n\n\n\n7.4.5.2 Jitter plot\nLet’s try the same thing with a jitter plot.\n\nggplot(mpg) +\n aes(x=as.character(cyl), y = hwy) +\n geom_jitter(width=0.2)\n\n\n\n\nYou can add the jitter plot on top of the box plot!\n\nggplot(mpg) +\n aes(x=as.character(cyl), y = hwy) +\n geom_boxplot() +\n theme(axis.text.x = element_text(angle = 45)) +\n geom_jitter(width=0.2)\n\n\n\n\n\n\n\n7.4.6 Facets\nYou may want to produce a distinct plot for each of your groups in order to have more space to show your data. For instance, the plot above is a vertical display of 4different distributions. It’s space efficient and makes it easy to compare the distributions, but maybe you have space and would rather show a panel of 4 distributions using histograms, for instance. This is done with the facet_grid() and facet_wrap() functions. Here’s an example with facet_wrap().\n\nmpg %>% \n mutate(cyl = as.character(cyl)) %>% \nggplot() +\n aes(hwy) +\n geom_histogram() +\n facet_wrap(facets = \"cyl\", ncol=2)\n\n`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.\n\n\n\n\n\nWe can use variables to position my facets with facet_grid(). For example, the cyl variable has four possible values (4,5,6,8) and the drv variable as three possible values (4, f, r). I can use those variables to create a 4x3 grid of histograms.\n\nmpg %>% \n mutate(cyl = as.character(cyl)) %>% \nggplot() +\n aes(hwy) +\n geom_histogram() +\n facet_grid(cols = vars(cyl), rows = vars(drv))\n\n`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.\n\n\n\n\n\n\n\n\n\nHealy, Kieran. 2018. Data Visualization: A Practical Introduction. Princeton, NJ: Princeton University Press.\n\n\nWickham, Hadley. 2016. Ggplot2. Springer International Publishing. https://doi.org/10.1007/978-3-319-24277-4.\n\n\nWickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition. Sebastopol, CA: O’Reilly.\n\n\nWilke, C. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. First edition. Sebastopol, CA: O’Reilly Media."
},
{
"objectID": "visualizing_data.html#references",
"href": "visualizing_data.html#references",
"title": "7 Visualizing data",
"section": "7.5 References",
"text": "7.5 References\n\n\n\n\nHealy, Kieran. 2018. Data Visualization: A Practical Introduction. Princeton, NJ: Princeton University Press.\n\n\nWickham, Hadley. 2016. Ggplot2. Springer International Publishing. https://doi.org/10.1007/978-3-319-24277-4.\n\n\nWickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition. Sebastopol, CA: O’Reilly.\n\n\nWilke, C. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. First edition. Sebastopol, CA: O’Reilly Media."
},
{
"objectID": "linear_regression.html#introduction",
"href": "linear_regression.html#introduction",
"title": "10 Linear regression",
"section": "10.1 Introduction",
"text": "10.1 Introduction\nIn the previous chapter, we learned that regression is a method used to determine the relationship between a dependent variable (the variable we want to predict) and one or more independent variables (the predictors available to make the prediction). In the last chapter, we used logistic regression to predict a dichotomous variable. In this chapter we will learn how to use linear regression to predict a continuous dependent variable."
},
{
"objectID": "linear_regression.html#linear-regression",
"href": "linear_regression.html#linear-regression",
"title": "10 Linear regression",
"section": "10.2 Linear regression",
"text": "10.2 Linear regression\nLinear regression uses a straight line to model the relationship between categorical or numerical predictors (independent variables) and a numerical predicted value (dependent variable). for a single predictor variable, the formula for the prediction is:\n\\[ y = intercept + slope × x \\]\nIn statistical terms, the same formula is written like this:\n\\[ y = β_0 + βx \\]\nAnd if we have multiple independent variables, the formula becomes:\n\\[ y = β_0 + β_1x_1 + β_2x_2 .... β_nx_n \\]\nTo determine how well the model fits the data (how well does x predict y. the linear model uses the square of the residuals (r2). The residuals are the difference between the predicted values and the real data, measured by the vertical distance between the line and the data points. Here’s a figure from Rhys (2020) to help you visualize this.\n\n\n\n\n\nWe can see in this figure that the intercept is where the line crosses the y-axis. The slope is calculated by dividing the difference in the predicted value of y by the difference in the value of x.\nWhen working with categorical predictors, the intercept is the mean value of the base category, and the slope is the difference between the means of each category. Here’s an example taken again from Rhys (2020)."
},
{
"objectID": "linear_regression.html#building-a-linear-regression-model",
"href": "linear_regression.html#building-a-linear-regression-model",
"title": "10 Linear regression",
"section": "10.3 Building a linear regression model",
"text": "10.3 Building a linear regression model\nLet’s use on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 to try to build a model that will predicted delayed arrival. For this we will use the flights dataset included in the nycflights13 package. We will consider the following variables in our model:\n\norigin: airport of departure (JFK, LGA, EWR)\ncarrier (we will only compare United Airlines - UA, and American Airlines - AA)\ndistance: flight distance in miles.\ndep_delay: delay of departure in minutes\narr_delay: delay of arrival in minutes (this is our independent variable)\n\n\nlibrary(nycflights13)\ndata <- flights %>% \n filter(carrier %in% c(\"UA\", \"AA\")) %>% \n select(origin, carrier, distance, dep_delay, arr_delay) %>% \n mutate(origin = as_factor(origin),\n carrier = as_factor(carrier)) %>% \n drop_na()\n\nhead(data) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\norigin\ncarrier\ndistance\ndep_delay\narr_delay\n\n\n\n\nEWR\nUA\n1400\n2\n11\n\n\nLGA\nUA\n1416\n4\n20\n\n\nJFK\nAA\n1089\n2\n33\n\n\nEWR\nUA\n719\n-4\n12\n\n\nLGA\nAA\n733\n-2\n8\n\n\nJFK\nUA\n2475\n-2\n7"
},
{
"objectID": "linear_regression.html#visualizing-the-relationship-between-the-variables",
"href": "linear_regression.html#visualizing-the-relationship-between-the-variables",
"title": "10 Linear regression",
"section": "10.4 Visualizing the relationship between the variables",
"text": "10.4 Visualizing the relationship between the variables\nWe can explore the relationship between our independent variables and our independent variable. How we will approach this depends on the type of independent variables we have.\n\n10.4.1 Continuous independent variables\nFor continuous independent variables, we do scatter plots with a fitted regression line. We can see in the plot below that there appears to be a linear relationship between the delay at departure and the delay at arrival (which is of course not so surprising). Here we used the stat_poly_line() and the stat_poly_eq() functions from the ggpmisc package to display the slope and the coefficient of determination (R2) of the regression line.\n\n\n\n\n\n\nThe coefficient of determination (R2)\n\n\n\nThe coefficient of determination should not be mistaken for the square of residuals, even thought they have the same notation (R2). The coefficient of determination tells us how well the regression line fits the data. It’s value ranges from 0 to 1. An R2 of 0 means that the linear regression model doesn’t predict your dependent variable any better than just using the average, and a value of 1 indicates that the model perfectly predicts the exact value of the dependent variable.\nAnother way to interpret the coefficient of determination is to consider it as a measure of the variation in the dependent variable is explained (or determined) by the model. For instance, a R2 of 0.80 indicates that 80% of the variation in the dependent variable is explained by the model.\n\n\n\ndata %>%\n ggplot() +\n aes(dep_delay, arr_delay) + \n stat_poly_line() +\n stat_poly_eq(aes(label = paste(after_stat(eq.label),\n after_stat(rr.label), sep = \"*\\\", \\\"*\"))) +\n geom_point() + \n geom_smooth(method=\"lm\")\n\n\n\n\nWe can see that the the linear regression model using considering only the departure delay explains 79% of the variation in the arrival delay. That makes dep_delay a good predictor in our model. On the other hand, the flight distance explains less then 1% of the arrival delays, as we can see in the next graph. This makes distance a very weak predictor of arrival delays.\n\n10.4.1.1 Distance\n\ndata %>%\n ggplot() +\n aes(dep_delay, distance) + \n stat_poly_line() +\n stat_poly_eq(aes(label = paste(after_stat(eq.label),\n after_stat(rr.label), sep = \"*\\\", \\\"*\"))) +\n geom_point() + \n geom_smooth(method=\"lm\")\n\n\n\n\n\n\n\n10.4.2 Categorical independent variables\nTo test the linear relationship between an categorical independent variable and the dependent variable, we use the same approach: a scatterplot with geom_smooth(). However, geom_smooth does not work with factors and requires numerical data. Therefore, we need to convert our factors into numerical variables to visualize the trend line with geom_smooth().\nAlso, when dealing with factors variables with more than two categories (factors with more than 2 levels), we need to choose a base level to which we will compare each of the other levels. In other words, all our graphs should display only two categories.\n\n10.4.2.1 Carrier\nThe plot below shows that there is hardly any relationship between the carrier and the arrival delay, with the variable explaining less than 1% of the variation in arrival delays.\n\ndata %>%\n mutate(carrier = as.numeric(carrier)) %>%\n ggplot() +\n aes(carrier, arr_delay) + \n stat_poly_line() +\n stat_poly_eq(aes(label = paste(after_stat(eq.label),\n after_stat(rr.label), sep = \"*\\\", \\\"*\"))) +\n scale_x_continuous(breaks = c(1,2)) +\n geom_jitter(size = 1) +\n geom_smooth(method=\"lm\") \n\n\n\n\n\n\n10.4.2.2 Origin\nSince origin has three levels (EWR, LGA and JFK), we want to plot choose a base level and compare each other level to this one. Let’s see which level represents which airport.\n\nlevels(data$origin)\n\n[1] \"EWR\" \"LGA\" \"JFK\"\n\n\nSo let’s choose level 1 (EWR) as our base and compare it with level 2 (LGA).\n\ndata %>%\n mutate(origin = as.numeric(origin)) %>%\n filter(origin %in% c(1,2)) %>% \n ggplot() +\n aes(origin, arr_delay) + \n stat_poly_line() +\n stat_poly_eq(aes(label = paste(after_stat(eq.label),\n after_stat(rr.label), sep = \"*\\\", \\\"*\"))) +\n scale_x_continuous(breaks = c(1,2)) +\n geom_jitter(size = 1) +\n geom_smooth(method=\"lm\")\n\n\n\n\nAnd then we compare level 1 (EWR) with level 3 (JFK).\n\ndata %>%\n mutate(origin = as.numeric(origin)) %>%\n filter(origin %in% c(1,3)) %>% \n ggplot() +\n aes(origin, arr_delay) + \n stat_poly_line() +\n stat_poly_eq(aes(label = paste(after_stat(eq.label),\n after_stat(rr.label), sep = \"*\\\", \\\"*\"))) +\n scale_x_continuous(breaks = c(1,3)) +\n geom_jitter(size = 1) +\n geom_smooth(method=\"lm\")\n\n\n\n\nAgain, we can see that the airport from which the flight takes off explains less than 1% of the arrival delays and is therefore a poor predictor."
},
{
"objectID": "linear_regression.html#building-the-multiple-linear-regression-model",
"href": "linear_regression.html#building-the-multiple-linear-regression-model",
"title": "10 Linear regression",
"section": "10.5 Building the multiple linear regression model",
"text": "10.5 Building the multiple linear regression model\nThe process to build the model is the same as the one we used for the logistic regression in the previous chapter. In fact, the process is simpler here because we do not need to convert the coefficient into odds ratios to make them easier to interpret. Let’s use the lm() function to build the model that predicts delay at arrival based on the distance of the flight, the carrier and the origin (we’ll leave the delay of departure out of the model for now).\n\nmodel <- lm(arr_delay ~ distance + carrier + origin, \n data = data)\n\n\nsummary(model)$coefficients %>% \n kbl() %>% \n kable_classic()\n\n\n\n\n\nEstimate\nStd. Error\nt value\nPr(>|t|)\n\n\n\n\n(Intercept)\n4.6299672\n0.3577348\n12.942455\n0.0000000\n\n\ndistance\n-0.0007208\n0.0002005\n-3.595645\n0.0003238\n\n\ncarrierAA\n-3.6699113\n0.3907133\n-9.392849\n0.0000000\n\n\noriginLGA\n-0.7238011\n0.4052653\n-1.785993\n0.0741037\n\n\noriginJFK\n1.6725192\n0.4638028\n3.606100\n0.0003110\n\n\n\n\n\n\n\nThe estimate coefficient represents the slope of the linear trend line for each predictor, so we can plug these values into our linear equation.\n\\[ ArrDelay = 4.63 - 0.00distance - 3.67AA - 0.72LGA + 1.67JFK \\]\nWe can see that most coefficients are statistically significant, which appears to indicate that they are good predictors, but let’s hold on for a minute before drawing too hasty conclusions. Look at the Adjusted R-squared (r2). It has a value of 0.001663, which is extremely small and indicate that the model explains less than 1% of the variance in delays. In other words, our model does not at all allow us to make predictions about delays. How can almost all predictors in a model be statistically significant and still be very bad predictors?\n\n\n\n\n\n\nBeware of too large sample\n\n\n\nStatistically significant predictors in a model with low predictive value mostly occur when our data set or sample is too large. What’s a good sample size? a good rule of thumb is 10% of the total observations, with at least ten observations per variable in the model but no more than 1000 observations in total. Let’s do this again with a sample of 500 observations. As shown in the code below, we can use the sample_n() function to create a sample of a specific size that will be used in the lm() function.\n\n\n\nmodel <- lm(arr_delay ~ distance + carrier + origin, \n data = sample_n(data, 500))\nsummary(model)\n\n\nCall:\nlm(formula = arr_delay ~ distance + carrier + origin, data = sample_n(data, \n 500))\n\nResiduals:\n Min 1Q Median 3Q Max \n-78.69 -28.36 -14.42 6.09 389.58 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 12.9625267 5.8186783 2.228 0.0263 *\ndistance -0.0005263 0.0032042 -0.164 0.8696 \ncarrierAA -4.6999911 6.4827986 -0.725 0.4688 \noriginLGA 2.2033532 6.8630179 0.321 0.7483 \noriginJFK -8.2434084 7.2785302 -1.133 0.2579 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 50.22 on 495 degrees of freedom\nMultiple R-squared: 0.01057, Adjusted R-squared: 0.002579 \nF-statistic: 1.323 on 4 and 495 DF, p-value: 0.2604\n\n\nWe see that the model still does a terrible job at predicting arrival delays, and that none of the predictors are statistically significant (at the p < 0.05 level).\n\n10.5.0.1 Adding departure delay to the model\nFinally, let’s add the departure delay to the model. We’ve seen in the figure above, in which we plotted the arrival delay against the departure delay, that our data points seemed to follow our trend line, so we can expect that adding this predictor will improve our model.\n\nmodel <- lm(arr_delay ~ distance + carrier + origin + dep_delay, \n data = sample_n(data, 500))\nsummary(model)\n\n\nCall:\nlm(formula = arr_delay ~ distance + carrier + origin + dep_delay, \n data = sample_n(data, 500))\n\nResiduals:\n Min 1Q Median 3Q Max \n-49.860 -11.607 -1.231 9.120 104.131 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) -8.0518001 2.3909440 -3.368 0.000817 ***\ndistance -0.0007411 0.0013000 -0.570 0.568902 \ncarrierAA -2.1873939 2.3382136 -0.935 0.349989 \noriginLGA -0.5129079 2.4503477 -0.209 0.834284 \noriginJFK 2.9417865 2.7888601 1.055 0.292017 \ndep_delay 0.9910595 0.0176231 56.236 < 2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 18.87 on 494 degrees of freedom\nMultiple R-squared: 0.8662, Adjusted R-squared: 0.8648 \nF-statistic: 639.4 on 5 and 494 DF, p-value: < 2.2e-16\n\n\nWe can see that when we consider the delay in the departure, we can more accurately predict the delay at arrival, with around 80% of the variance explained by our model!"
},
{
"objectID": "linear_regression.html#homework",
"href": "linear_regression.html#homework",
"title": "10 Linear regression",
"section": "10.6 Homework",
"text": "10.6 Homework\nFor this week’s lab, your task is to complete lab 8 (the template is available on BrightSpace). In this lab, you will build a linear regression model on a dataset of your choice. I encourage you to use the dataset that you will be using for your individual project, if it contains a numerical variable that you can try to predict using other variables in your dataset.\n\n\n\n\nRhys, Hefin. 2020. Machine Learning with r, the Tidyverse, and Mlr. Shelter Island, NY: Manning publications."
},
{
"objectID": "linear_regression.html#references",
"href": "linear_regression.html#references",
"title": "10 Linear regression",
"section": "10.7 References",
"text": "10.7 References\n\n\n\n\nRhys, Hefin. 2020. Machine Learning with r, the Tidyverse, and Mlr. Shelter Island, NY: Manning publications."
},
{
"objectID": "getting_started_with_R.html#learning-objectives",
"href": "getting_started_with_R.html#learning-objectives",
"title": "2 Getting started with R",
"section": "2.1 Learning objectives",
"text": "2.1 Learning objectives\n\nInstalling R and Rstudio\nNavigating the Rstudio interface\nInstalling packages\nWriting basic commands\nUnderstanding R functions\nUnderstanding R data types\nUnderstanding R objects\nUnderstanding conditional statements\nUnderstanding loops\nUnderstanding vectorization"
},
{
"objectID": "getting_started_with_R.html#setting-up-r",
"href": "getting_started_with_R.html#setting-up-r",
"title": "2 Getting started with R",
"section": "2.2 Setting up R",
"text": "2.2 Setting up R\n\n2.2.1 Installing R and Posit\n\nVisit the Comprehensive R Archive Network (r-project.org) and download the appropriate version of R based on your operating system.\nRun the file that you just downloaded to install R. You can accept all the default installation parameters, or customize your installation if you wish.\nDownload the RStudio IDE - RStudio. Again, choose the appropriate version based on your operating system.\nRun the file that you just downloaded to install RStudio. You can accept all the default installation parameters or customize your installation if you wish.\n\n\n\n2.2.2 Running R in the cloud\nPosit Cloud (formerly RStudio Cloud) is a platform that allows you to create R projects and run RStudio on the cloud using your browser. The name was changed to reflect a shift from a focus on R to a multi-language platform. As of the writing of this book, Posit Cloud allows users to create RStudio projects (free) and Jupyter Notebook (Python) projects (only with paid premium accounts). There are some pros and cons to using Posit Cloud:\nPros\n\nYou don’t have to install anything on your computer.\nThe controlled environment can be less prone to bugs.\nAll your projects can be found in one place.\nYou can easily share projects with other Posit Cloud users.\nYou can create shared projects to collaborate with other Posit Cloud users.\n\nCons\n\nThe free account has low CPU and RAM limits (1 CPU and1GB of RAM) which means that some computationally intensive processes might take much longer time to run than they would on your PC.\nThe free account has a limit of 25 project hours per month, so if you run code that takes a very long time to process data, that’s a factor you might want to consider. You should be able to complete the course labs within that 25 project hours limit. Project hours are calculated based on the CPU and RAM allocated using this formula: (RAM + CPUs allocated) / 2 x hours. This means that with 1 CPU and 1GB of ram, your code can run 25 hours of code in a month. You can decrease the CPU and RAM to 0.5 CPU and 0.5GB to get 50 hours.\nYou cannot work with files directly on your computer, so you have to upload your data files to your projects and then download any files generated with your code. There is a limit of 20GB by project and a single file cannot exceed 500MB.\n\n\n\n2.2.3 Exploring the RStudio environment\n\n\n\n\n\n\nNote\n\n\n\nThe screenshots below are from RStudio Desktop. The interface is practically the same whether you use the desktop version or the cloud version, but there might still be minor variations.\n\n\nWhen you open RStudio . You should see three panes as in the image below (note that your background will probably be white. You can select your theme in Tools - Global options - Appearance. The theme I use in these screenshots is “Pastel On Dark”."
},
{
"objectID": "getting_started_with_R.html#writing-and-executing-code",
"href": "getting_started_with_R.html#writing-and-executing-code",
"title": "2 Getting started with R",
"section": "2.3 Writing and executing code",
"text": "2.3 Writing and executing code\nYou can write code direct in the console gets executed when you press the Enter key.\n\n\n\n\n\n\n2.3.1 Write completing statements\nWhat you write in the console is code. However, not all code is an executable and complete statement. Another way to think about this is letters and words or words and sentences: you can write words (code) without writing a full sentence (statement). R wants to execute statements, which is useful to know for at least two reasons:\n\nWhen you hit CTRL + ENTER or click the Run button, R will execute the entire statement, and not just the specific piece of code where your cursor is located. You can write statements on multiple lines to make your code easier to read.\nSometimes, when you write an incomplete statement, you will not get an error but the + symbol in the console. This is R waiting for you to write code to finish your statement. This often happens because of a missing closing parenthesis.\n\n\n# This statement creates a vector of names\nnames <- c(\"Billy\",\"Qi\",\"Claire\")\n \n# This statement does exactly the same thing\nnames <- c(\"Billy\",\n \"Qi\",\n \"Claire\")\n\n# This statement is imcomplete, it's missing the last paranthesis\nnames <- c(\"Billy\",\n \"Qi\",\n \"Claire\"\n\n\n\n2.3.2 R Scripts\nThe console (pane on the left) is where code gets executed and, depending on your code, where the output gets displayed. Error messages and warnings are also displayed in the console. Note that code that you write in the console and execute will be saved in the session’s history but erased when you close RStudio. To avoid losing your code and having to write it all over again every time you want to do something in R, it is recommended to use R scripts to write your code.\nYou can create a new script in RStudio by clicking on File, then New File, then R Script.\nR scripts are basic text files with a “.R” extension in which you can write code (that R can execute) and comments (that R will not execute and are meant to help you and others understand your code and script). The # character indicates the beginning of a comment.\n\n# The following code tells R to calculate the value of 1 + 1 and return the result.\n1+1\n\nWhen writing scripts for projects, it is a good practice to include useful metadata at the beginning (e.g., the project’s name, the script’s purpose, your name, and the date). Here is an example.\n\n# ***********************************************\n# Project: Introduction to data science\n# Purpose: This is an example of a basic script\n# Author: Philippe Mongeon\n# Date: 2022-12-29\n# ***********************************************\n\n# The following code tells R to calculate the value of 1 + 1 and return the result.\n1+1\n\n[1] 2\n\n\n\n2.3.2.1 Executing code in a script\nWhen working with a script, you can write all your code and execute it only when you are ready. You can also execute only parts of your code. To execute the code, you can click the Run button at the top of the script pane, or you can use Ctrl + Enter. That will execute the code where your cursor is located and then move the cursor to the next line with code or a comment. If your cursor is in between codes, R will execute the next statement in the script and then move the cursor to the next line with code or a comment. You can also highlight a section of your code and run only the highlighted code. If the code contains multiple statements R will still execute them one after the other in the order in which they appear."
},
{
"objectID": "getting_started_with_R.html#working-directory",
"href": "getting_started_with_R.html#working-directory",
"title": "2 Getting started with R",
"section": "2.4 Working directory",
"text": "2.4 Working directory\nYour working directory is the default folder from your computer where R will read and save files. The default working directory will typically be your R/ folder. You can change the default working directory in Tools - Global options - General. However, a better approach may be to explicitly specify the working directory for your project using the setwd() function.\n\n# ***********************************************\n# Project: Introduction to data science\n# Purpose: This is an example of a basic script\n# Author: Philippe Mongeon\n# Date: 2021-12-29\n# ***********************************************\n\n# Set working directory\nsetwd(\"c:/courses/introduction-to-data-science/\")\n\n# The following code tells R to calculate the value of 1 + 1 and return the result.\n1+1"
},
{
"objectID": "getting_started_with_R.html#projects",
"href": "getting_started_with_R.html#projects",
"title": "2 Getting started with R",
"section": "2.5 Projects",
"text": "2.5 Projects\nJust like Posit Cloud, RStudio allows you to create projects to better organize your files and processes. Creates are associated to a directory in your computer. You can click on the R Project icon in the top right of RStudio screen to create a new project or access existing projects.\n\n\n\n\n\nIf you choose to create a project, you will be asked whether you want to use an existing folder on your computer to store your project files, or create a new one. You can also use version control (e.g. a Git repository) to manage your project, but this is beyond the scope of this book.\n\n\n\n\n\nIf you choose to create a project in a new directory, you will be asked to choose what type of project you are creating. Choose “new project”, unless of course you are creating an R Package, Shiny Application, or Quarto Book (by the way, the book you are reading now is a Quarto Book project that was entirely written in R Studio).\n\n\n\n\n\nThen you will be asked to name your project folder and choose a current folder on your computer where you want this new project folder to be created. If you chose to use an existing folder, you will only be asked to select the folder.\n\n\n\n\n\nYou can see which project you are working on in the top right of the screen. Your default working directory is the project folder, so you don’t have to specify your working directory with the setwd() function every time you switch to a new project."
},
{
"objectID": "getting_started_with_R.html#functions",
"href": "getting_started_with_R.html#functions",
"title": "2 Getting started with R",
"section": "2.6 Functions",
"text": "2.6 Functions\nR functions are called like this: name_of_the_function(some argument, some argument, some other argument, some other argument, …). The arguments are placeholders for which you provide values when calling the functions. Some functions contain many arguments, and some functions contain none. Typically some functions will have default values for most of the arguments but require that you provide values for the other arguments that don’t have default values. In the code chunk above, setwd(path) is the function and \"c:/courses/introduction-to-data-science/\" is the value passed to the argument path.\nArguments have a specific order, so you don’t have to enter the argument’s name if you know that order. It is common practice to not write the argument name for functions with a single argument and to not write the name of the first argument in multi-argument functions. The example below is a script that uses the read_delim function.\n\n# This prints the string \"hello world\"\nprint(x=\"hello world\")\n\n[1] \"hello world\"\n\n# This also prints the string \"hello world\"\nprint(\"hello world\")\n\n[1] \"hello world\"\n\n\nThe print function is the default option that is called if your code does not include a function. So you can print values, the result of operations, or the content of an object without explicitly calling the print() function.\n\n# This prints the string \"hello world\"\n\"hello world\"\n\n# This prints the result of the operation 1+3\n1+3\n\n# This prints the content of the object called \"data\"\ndata\n\n\n\n\n\n\n\nR is case sensitive\n\n\n\nIt is important to know that everything in R is case sensitive to avoid unnecessary and frustrating errors. So, the function print() is not the same as the function Print().\nSo if you get an error message telling you that the function or object you are trying to access does not exist or was not found, make sure you used the correct case in your code.\n\n\n\n2.6.1 Documentation\nIf you need to consult the documentation of a function to learn how to use it, there is usually excellent documentation with examples online which you can easily find by searching the name of the function or package. You can also access the official documentation directly in RStudio. Here is how:\n\n# This opens the documentation for the function setwd() in the Help panel.\n?setwd()\n\n# This does the same thing.\nhelp(setwd)\n\n\n\n2.6.2 Writing your own functions\nYou can write your own functions and then use them in your code. Here’s an example of a simple, not so useful, function that returns the square of a number.\n\n# square is the name of the function. \n# x is the argument (information that the user of the function will need to provide.)\n# the code between the curly brakets is what the function will do when called.\nsquare <- function(x) {\n x^2\n}\n\nThen we can call the function and see what happens.\n\n# calculate the square of 5\nsquare(5)\n\n[1] 25\n\n\nYou can provide default values for the arguments in your functions. Let’s try that with a new function where the use provides both the number and the exponent, with no default value for the number (x) and 1 as the default exponent (exp).\n\nexponent <- function(x, exp = 1) {\n x^exp\n}\n\n# Then we can all he function.\nexponent(3,3)\n\n[1] 27\n\n\nBy the way, when you create a function it appears under your environment tab in RStudio.\n\nThat’s it. Now you know how to create a function and specify its arguments with their default value. There are no limits to the number of statements in the body of your function, so you can make them as complex as you want."
},
{
"objectID": "getting_started_with_R.html#packages",
"href": "getting_started_with_R.html#packages",
"title": "2 Getting started with R",
"section": "2.7 Packages",
"text": "2.7 Packages\nPackages are groups of functions that you can install and load so that you can use them in your code. Your basic R installation comes with many pre-installed packages and functions automatically loaded when you open RStudio. The preinstalled packages and functions are often called “base R”.\nUsually, a package contains a set of functions designed for specific purposes. Functions can use other functions from the same or from other packages, so don’t be surprised if you try to install one package but end up installing more than one. The tidyverse, which we will explore in the next chapter, is one of these “meta-packages” that installs and loads a collection of packages and their functions.\n\n2.7.1 Installing packages\nPackages only need to be installed once for all your projects, as they are installed directly on your computer. However, if you are using Posit Cloud, then you will need to install the packages (other than the base R packages) once for each project. To install a package, you can use the install.packages() function, with the name of the package in single or double quotes. The code below installs the tidyverse package.\n\ninstall.packages(\"tidyverse\")\n\nYou can also use the Packages tab in lower right pane of RStudio to see what packages are already installed, install new packages, or update packages.\n\n\n2.7.2 Loading packages\nPackages provide functions for you to use. But first you must load the package using the library() function. I recommend loading all the packages required for your script at the beginning, after you set the working directory. Let’s update our script.\n\n# ***********************************************\n# Project: Introduction to data science\n# Purpose: This is an example of a basic script\n# Author: Philippe Mongeon\n# Date: 2022-12-29\n# ***********************************************\n\n# Set working directory\nsetwd(\"c:/courses/introduction-to-data-science/\")\n\n# Load packages\nlibrary(tidyverse)\n\nNotice that my script does not include an install.packages() statement. That’s because I already have the package installed on my machine (or project, if I were using Posit Cloud)."
},
{
"objectID": "getting_started_with_R.html#objects",
"href": "getting_started_with_R.html#objects",
"title": "2 Getting started with R",
"section": "2.8 Objects",
"text": "2.8 Objects\nObjects are containers of data with different data types and data structures. You can create as many objects as you want and store any data that you want in them. Let’s create two objects containing numbers, and then write a command that multiplies the content of both objects.\n\n# create an object called \"a\" containing the value 2\na <- 2 \n\n# create an object caled \"b\" containing the value 4 \nb = 4 \n\n# multiplies the value of object a and b and stores the results \n# in object c\nc <- a*b \n\n# prints the content of the object c\nprint(c) \n\n[1] 8\n\n\nExcellent. Now you know how to store values in objects and use them in your statements. Let’s now take a deeper look at the different data types and structures that R objects can use."
},
{
"objectID": "getting_started_with_R.html#data-types",
"href": "getting_started_with_R.html#data-types",
"title": "2 Getting started with R",
"section": "2.9 Data types",
"text": "2.9 Data types\nThere are four types of data that we will deal with in this course: strings, numbers, logical values, and dates. Let’s look at them one by one.\n\n2.9.1 Character strings\nCharacter strings (or strings) are sequences of any kind of symbols (including space and other non-printed characters like line breaks). They can easily be recognized because they are surrounded by single (') or double (\") quotation marks. Let’s create an object called fruit and store the string “banana” inside it.\n\nfruit <- \"banana\"\n\nNote that the following code would have achieved the same result.\n\nfruit <- 'banana'\n\nWe can check if the object we created contains a string using the class() function, which returns the data type of an object or value. We can also test specifically if the object is a character string with the is.character() function, which returns TRUE or FALSE.\nLet’s try the first method.\n\nclass(fruit)\n\n[1] \"character\"\n\n\nNow let’s try the second is.character() method.\n\nis.character(fruit)\n\n[1] TRUE\n\n\nLet’s say we wanted to store the type of plane 747 into an object. This is not the number 747 as in after 746 and before 748. It is a string that represents the concept of the 747 plane. Therefore, we would logically want to store it as a string rather than a number. This can be done easily using quotation marks plane <- \"747\". However, if the number happens to be already stored in an object as a number, we can convert it into a string using the as.character() function that converts whatever is provided into a character string. Here’s the demo.\n\nplane<-747\nclass(plane)\n\n[1] \"numeric\"\n\n\nWe can see that the number 747 was stored in the plane object. Now let’s convert it into a string.\n\n# This code takes the content of the plane object (currently the\n# number 747), converts it into a character string, and stores\n# the result into an object called plane (it overwrites the \n# previous plane object).\nplane <- as.character(plane)\n\n# Then we can check what data type the plane object now contains.\nclass(plane)\n\n[1] \"character\"\n\n\n\n\n2.9.2 Numbers\nAs we saw in some of the previous examples, numbers are another common data type in R. In fact, there are three different number data types: numeric, integer, and complex. However, in this course, we will only consider the numeric and integer types. The only difference between the numeric and integer data types is that numerics can store decimals, but integers cannot. By default, R will use the numeric data type for any number. So if we want to use the integer data type we need to specify it.\n\n# Create an numeric object\nmy_numeric <- 42\n\n# Create an integer object\nmy_integer <- as.integer(42)\n\nAgain, we can check the data type of the plane object with class() or with is.numeric() and is.integer().\n\nclass(my_numeric)\n\n[1] \"numeric\"\n\nis.numeric(my_numeric)\n\n[1] TRUE\n\nclass(my_integer)\n\n[1] \"integer\"\n\nis.integer(my_integer)\n\n[1] TRUE\n\n\n\n\n\n\n\n\nPractice\n\n\n\nWrite some code to answer the following questions:\n\nAre all integers numerics in R?\nAre all numerics integers?\nWhat happens if you try to convert a number with decimals to an integer?\n\n\n\n\n2.9.2.1 Arithmetic operators\nWhen working with numbers, you can use arithmetic operators:\n\n# Addition\n1+1\n\n# Substraction\n4-1 \n\n# Multiplication\n4*2 \n\n# Division\n8/2\n\n# Exponant\n2^4\n\n# Modulo\n11 %% 2\n\n\n\n\n\n\n\nPractice\n\n\n\nLet’s use the Pythagorean theorem to determine the length of the hypothenuse (c) of a square triangle, knowing that the two other sides (a and b) forming the 90 degree angle have lengths of 4 cm and 7 cm and knowing that a2 + b2 = c2. Use only the six arithmetic operators shown above (hint: the answer is 8.06225774829855).\nHere are the recommended steps:\n\nCreate an object called a that contains the value 4.\nCreate an object called b that contains the value 7.\nCreate an object called c that contains the calculated value of c.\nprint the value of c.\n\n\n\n\n\n\n2.9.3 Logical\nLogical values are TRUE or FALSE (which are actually stored as the numbers 0 and 1). We can also use the shortcut T for TRUE and F for FALSE.\n\n# these are different ways to creates an object containing \n# a logical value (TRUE or FALSE).\nmy_logical_1 = TRUE\nmy_logical_2 = FALSE\nmy_logical_3 = T\nmy_logical_4 = F\n\nAgain we can test whether an object or value is logical or not using the class() and is.logical() and we can convert non-logical values or objects to logical values or objects with as.logical().\n\n\n\n\n\n\nPractice\n\n\n\n\nCreate an object that contains the number 1, an object that contains the number 0, an object that contains the number 42, and an object that contains the number -1.\nUse the as.logical() function to convert each of the objects into a logical object and print them.\nCreate objects that contain the character 1, the string “1”, the string “TRUE”, and the string “banana” converting them into logical objects and printing them.\n\n\n\nAs you might have realized while practicing, all numbers (negative or positive) other than 0 are equivalent to the logical TRUE, and only the number 0 is equivalent to the logical FALSE. You can also convert numbers or logical values stored as strings into logical values, but not other strings. That is why the conversion of the string “banana” to a logical value returns “NA”.\n\n\n2.9.4 NA values\nIn R, NA is the absence of a value, or missing data. Still, it can be stored in an object like any other value, as any data type. Let’s give it a try by creating a few objects that contain NA.\n\nnumeric_na <- as.numeric(NA)\ninteger_na <- as.integer(NA)\nlogical_na <- as.logical(NA)\nstring_na <- as.character(NA)\n\nThen we can confirm that the objects’ classes are as planned.\n\nclass(numeric_na)\n\n[1] \"numeric\"\n\nclass(integer_na)\n\n[1] \"integer\"\n\nclass(logical_na)\n\n[1] \"logical\"\n\nclass(string_na)\n\n[1] \"character\"\n\n\nWe can also test whether a value is NA with the is.na() function.\n\nis.na(numeric_na)\n\n[1] TRUE\n\nis.na(integer_na)\n\n[1] TRUE\n\nis.na(logical_na)\n\n[1] TRUE\n\nis.na(string_na)\n\n[1] TRUE\n\n\nAnother important thing to remember about NAs is that whenever they are part of a mathematical operation, the result will be NA. Here’s an example:\n\n1+1+NA\n\n[1] NA\n\n\n\n\n2.9.5 Dates\nDates are also a frequently used data type. They can be somewhat complex to work with depending on what we are trying to do, but there is a package called lubridate that contains a lot of functions to make working with dates easier. Although they are rendered as character strings, dates are stored in the memory as numbers. This is why it’s possible to add one day to a date by adding 1 to it, or one week by adding 7, for example.\nIt is beyond the scope of this chapter to provide a thorough guide on how to work with dates, but let us explore the dates data type a little by first loading the lubridate package and using it’s today() function to store today’s date into an object that we will call today.\n\n# Load the lubridate package to access its functions\nlibrary(lubridate) \n\n# Get today's date\ntoday <- today()\n\n# Print today's date\nprint(today)\n\n[1] \"2024-01-25\"\n\n\nYou can use several functions to extract parts of dates.\n\n# Here we create several objects that contain different parts of the date\nweek_day <- weekdays(today)\nweek <- week(today)\nday <- day(today)\nmonth <- month(today)\nyear <- year(today)\n\n# And now we print them.\nprint(week_day)\n\n[1] \"Thursday\"\n\nprint(week)\n\n[1] 4\n\nprint(day)\n\n[1] 25\n\nprint(month)\n\n[1] 1\n\nprint(year)\n\n[1] 2024\n\n\nJust like other data types, we can use class() and is.Date() to test whether an object or value is a date.\n\nclass(today)\n\n[1] \"Date\"\n\nis.Date(today)\n\n[1] TRUE"
},
{
"objectID": "getting_started_with_R.html#data-structures",
"href": "getting_started_with_R.html#data-structures",
"title": "2 Getting started with R",
"section": "2.10 Data structures",
"text": "2.10 Data structures\nAll the objects created in the last example are storing a single value. But objects can also contain multiple values organized in different kinds of structures. Let’s go through those structures.\n\n2.10.1 Vectors\nVectors are the most basic structure in R and are simply sequences of values of the same type. You can create a vector with the function c().\n\n# This creates a character vector\nmy_character_vector <- c(\"Lucy\",\"Matthew\",\"Ricardo\",\"Adrian\",\n \"Mathilda\", \"Beyoncé\")\n\n# This creates a numeric vector\nmy_numeric_vector <- c(1,42,25,4)\n\n# This creates a integer vector\nmy_integer_vector <- as.integer(c(1,42,25,4))\n\n# This creates a logical vector\nmy_logical_vector <- c(TRUE,T,FALSE,F)\n\n# This prints the character vector\nmy_character_vector\n\n[1] \"Lucy\" \"Matthew\" \"Ricardo\" \"Adrian\" \"Mathilda\" \"Beyoncé\" \n\n\n\n\n2.10.2 Matrices\nA matrix is a bi-dimensional representation of a vector. You can create a matrix with the matrix() function and specify the number of rows and columns for your matrix as in the example below.\n\n#this creates a character vector\nmy_character_vector <- c(\"Lucy\",\"Matthew\",\"Ricardo\",\"Adrian\",\"Mathilda\", \"Beyoncé\") \n\n#this converts the character vector into a matrix with 2 rows and 3 columns.\nmy_matrix <- matrix(my_character_vector, nrow = 2, ncol=3)\n\n#this prints the matrix\nmy_matrix\n\n [,1] [,2] [,3] \n[1,] \"Lucy\" \"Ricardo\" \"Mathilda\"\n[2,] \"Matthew\" \"Adrian\" \"Beyoncé\" \n\n\n\n\n2.10.3 Factors\nFactors are another special type of vector. Think of them as categorical variables that can have a distinct number of possible values (called levels). Say we have a vector of values representing the transmission type of cars. The two levels would be “automatic” and “manual”.\n\n# This creates a factor object\ntransmission <- as.factor(c('manual',\"manual\",\"automatic\",\"automatic\",\"automatic\",\"manual\"))\n\n# This prints the factor object\ntransmission\n\n[1] manual manual automatic automatic automatic manual \nLevels: automatic manual\n\n\nFactors are also memory efficient because the values are stored as number. As you can see in this screenshot of my R objects, the values are recorded as 2, 2, 1, 1, 1, 2.\n\n\n\n\n\n\n\n2.10.4 Lists\nUnlike vectors that are combinations of values of the same type, lists are combinations of objects of any type (including vectors, matrices, data frames, factors or other lists). In the example below, we create a list that contains some of the objects we created so far.\n\n# This creates a list\nmy_list <- list(my_character_vector,my_numeric_vector,my_integer_vector,my_logical_vector,today, transmission)\n\n# This prints the list\nmy_list\n\n[[1]]\n[1] \"Lucy\" \"Matthew\" \"Ricardo\" \"Adrian\" \"Mathilda\" \"Beyoncé\" \n\n[[2]]\n[1] 1 42 25 4\n\n[[3]]\n[1] 1 42 25 4\n\n[[4]]\n[1] TRUE TRUE FALSE FALSE\n\n[[5]]\n[1] \"2024-01-25\"\n\n[[6]]\n[1] manual manual automatic automatic automatic manual \nLevels: automatic manual\n\n\n\n\n2.10.5 Data frames and tibbles\nData frames (and tibbles, which are essentially data frames with improved features) are tabular structures (like Excel spreadsheets) that organize data in rows and columns. The columns of a data frame are essentially vectors. A single column must store data (or objects) of the same type, but since different columns can have different data types, then a single row can contain data of different types. In the code below, we create a data frame that will contain three vectors with different data types.\n\n# A vector of first names\nfirst_name <- c(\"Rob\",\"June\",\"Lise\")\n\n# A vector of last names\nlast_name <- c(\"Murphy\",\"Jones\",\"MacFly\")\n\n# A vector of ages\nage <- c(42,30,34)\n\n# A data frame that combined the three vectors\nmy_data_frame<- data.frame(first_name, last_name, age)\n\n# This prints the data frame\nmy_data_frame\n\n first_name last_name age\n1 Rob Murphy 42\n2 June Jones 30\n3 Lise MacFly 34\n\n\nAs mentioned above, tibbles are essentially the same as data frames, but with improvements) one of the main improvement is how the date gets displayed when printed in the console. Let’s convert the data frame we just created into a tibble and print it to see what happens.\n\n# Load the tibble library \nlibrary(tibble)\n\n# Create a tibble\nmy_tibble <- as_tibble(my_data_frame)"
},
{
"objectID": "getting_started_with_R.html#subsetting-data",
"href": "getting_started_with_R.html#subsetting-data",
"title": "2 Getting started with R",
"section": "2.11 Subsetting data",
"text": "2.11 Subsetting data\nYou can select one or multiple elements from an R object by providing its coordinates in brackets [row number, column number]. When the rows or columns have names, the elements can be called by name with the dollar sign ($) Here’s an example.\n\n# Let's create a tibble called t\nt <- tibble(first_name = as.character(c(\"Rob\",\"June\",\"Lise\")),\n last_name = as.character(c(\"Murphy\",\"Jones\",\"MacFly\")),\n age = as.integer(c(42,30,34)))\n\n# Select the first column\nt[,1]\n\n# A tibble: 3 × 1\n first_name\n <chr> \n1 Rob \n2 June \n3 Lise \n\n# Select the first row\nt[1,]\n\n# A tibble: 1 × 3\n first_name last_name age\n <chr> <chr> <int>\n1 Rob Murphy 42\n\n# Select the element in the second row and the second column\nt[2,2]\n\n# A tibble: 1 × 1\n last_name\n <chr> \n1 Jones \n\n# Select the first and second rows\nt[1:2,1]\n\n# A tibble: 2 × 1\n first_name\n <chr> \n1 Rob \n2 June \n\n# Select the first and third rows\nt[c(1,3),]\n\n# A tibble: 2 × 3\n first_name last_name age\n <chr> <chr> <int>\n1 Rob Murphy 42\n2 Lise MacFly 34\n\n# Select the first name column\nt$first_name\n\n[1] \"Rob\" \"June\" \"Lise\"\n\n# Select the first name of the second observation\nt$first_name[2]\n\n[1] \"June\"\n\n# Select the first name of the second observation (the other way around)\nt[2,]$first_name\n\n[1] \"June\"\n\n\nSubsetting from lists is a little different. Let’s take a look.\n\n# Let's create a list called l.\nl <- list(1,\"Banana\",c(\"a\",\"b\",\"c\"),\"Wayne Gretzky\", list(c('x',\"y\",\"z\")))\nprint(l)\n\n[[1]]\n[1] 1\n\n[[2]]\n[1] \"Banana\"\n\n[[3]]\n[1] \"a\" \"b\" \"c\"\n\n[[4]]\n[1] \"Wayne Gretzky\"\n\n[[5]]\n[[5]][[1]]\n[1] \"x\" \"y\" \"z\"\n\n\nWe can see coordinates of the list’s elements are in double brackets, and then the coordinates of the object within the list are in single brackets. Let’s subset some of the elements to see how it works.\n\n# select the third element of the list\nl[[3]]\n\n[1] \"a\" \"b\" \"c\"\n\n# select the second element of the third element of the list\nl[[3]][2]\n\n[1] \"b\"\n\n# select the second element of the first element in the fift element of the list\nl[[5]][[1]][2]\n\n[1] \"y\""
},
{
"objectID": "getting_started_with_R.html#logical-and-boolean-operators",
"href": "getting_started_with_R.html#logical-and-boolean-operators",
"title": "2 Getting started with R",
"section": "2.12 Logical and boolean operators",
"text": "2.12 Logical and boolean operators\nLogical statements compare two values and return 0 (FALSE) or 1 (TRUE).\n\n\n\nOperator\nMeaning\n\n\n\n\n<\nless than\n\n\n<=\nless than or equal to\n\n\n>\ngreater than\n\n\n>=\ngreater than or equal to\n\n\n==\nequal to\n\n\n!=\nnot equal to\n\n\n%in%\nis contained in a vector\n\n\n\nLet’s give them a try.\n\n\"apples\" == \"oranges\"\n\n[1] FALSE\n\n1 > 2\n\n[1] FALSE\n\n\"Albert\" > \"Justin\"\n\n[1] FALSE\n\n6 != 3\n\n[1] TRUE\n\n# This creates a vector of fruits\nfruits <- c(\"apples\",\"organges\",\"bananas\")\n\n# This checks if the fruits vector contains \"bananas\"\n\"bananas\" %in% fruits \n\n[1] TRUE\n\n\nYou can also combine conditions with the Boolean operators AND (&&), OR (\\|\\|), and NOT (!). Here are some examples:\n\nTRUE || FALSE\n\n[1] TRUE\n\nTRUE && FALSE\n\n[1] FALSE\n\n!TRUE\n\n[1] FALSE\n\n!FALSE\n\n[1] TRUE\n\nTRUE && !FALSE\n\n[1] TRUE"
},
{
"objectID": "getting_started_with_R.html#conditional-statements",
"href": "getting_started_with_R.html#conditional-statements",
"title": "2 Getting started with R",
"section": "2.13 Conditional statements",
"text": "2.13 Conditional statements\nConditional statements allow you to tell R what to do if a condition is met. The syntax of an if statement is if (condition to test) {code to execute}. You can also tell R what to do if the condition is not met with else. The syntax is: if (condition) {code to execute} else {code to execute}.\nI recommend breaking the code over multiple lines, like in the code below. Note that the else must be on the same line as the closing curly bracket from the previous section of the statement.\n\nif (1+1==3) {\n print(\"1 + 1 = equals 3\")\n } else {\n print (\"1 + 1 does not equal 3\")\n }\n\n[1] \"1 + 1 does not equal 3\"\n\n\nYou can chain more conditional statements with else if, like this:\n\ndonuts_jim <- 6\ndonuts_lucy <- 3\n\nif (donuts_jim < donuts_lucy) {\n print(\"Jim had fewer donuts than Lucy\")\n} else if (donuts_jim > donuts_lucy) {\n print(\"Jim had more donuts than Lucy\")\n} else {\n print(\"Jim had as many donuts as Lucy\")\n}\n\n[1] \"Jim had more donuts than Lucy\""
},
{
"objectID": "getting_started_with_R.html#loops",
"href": "getting_started_with_R.html#loops",
"title": "2 Getting started with R",
"section": "2.14 Loops",
"text": "2.14 Loops\nLoops are generally used to run the same statement multiple times. Instead of repeating the same code over and over, you write a loop that ends when a certain condition is met. There are different kinds of loop statements (for, while, repeat) and while they work a little bit differently you can do the same thing with all of them.\n\n2.14.1 For loops\nThe for loop goes through an entire vector, one element at a time. In the example, we have a vector of fruits, and we write a for loop that takes every fruit in the vector one by one, and print the string “I love [fruit]”.\n\nfruits<-c(\"apples\",\"bananas\",\"pears\")\nfor (i in fruits) { \n print(paste(\"I love\",i))\n}\n\n[1] \"I love apples\"\n[1] \"I love bananas\"\n[1] \"I love pears\"\n\n\nNote: the letters i, j, and k tend to be used by convention for the cursor in a for loop, but you can use any name you want.\n\n\n2.14.2 While loops\nWhile loops do not specify a vector of values to iterate through but will execute until a specified condition is met. The code below initiates the value of n at 0, then loops through a code that asks the user to enter a number. It keeps looping until the number 7 is entered.\n\nn <- 0\nwhile (n != 7) {\n n <- readline(prompt=\"Please, choose a number betweet 1 and 10:\")\n if (n==7) {\n print(\"Lucky seven!\")\n } else {\n print(\"Wrong number\")\n }\n}\n\n\n\n2.14.3 Repeat loops\nWith repeat loops, you don’t specify the condition for the end of the loop, so you have to use the break function inside the loop to exit it. The code below does exactly the same thing as the previous example, but uses the repeat loop instead.\n\nrepeat {\n n <- readline(prompt=\"Please, choose a number betweet 1 and 10:\") \n if (n==7) {\n print(\"Lucky seven!\")\n break\n } else {\n print(\"Wrong number\")\n }\n}\n\n\n\n\n\n\n\nBeware of the dreaded infinite loop\n\n\n\nIf your code is written in such a way that the condition is never met, you will end up with an infinite loop and the process will just keep running until you stop it manually or your computer explodes. (Don’t worry, your computer is not actually going to explode, no matter how terrible your code might be.)"
},
{
"objectID": "getting_started_with_R.html#vectorization",
"href": "getting_started_with_R.html#vectorization",
"title": "2 Getting started with R",
"section": "2.15 Vectorization",
"text": "2.15 Vectorization\nAlmost every object in R is actually a vector or a collection of vectors, and R makes coding and computing more efficient by performing the computation for all elements of the objects. This makes loop (especially the for loop) less useful when the same simple operation needs to be performed for an entire vector. Here’s how we can print “I love apples”, “I love bananas”, and “I love pears” with a leaner and quicker code than in the for loop example above.\n\nfruits <- c(\"apples\",\"bananas\",\"pears\")\nprint(paste(\"I love\",fruits))\n\n[1] \"I love apples\" \"I love bananas\" \"I love pears\" \n\n\nYou can also use conditional statements with vectors with the ifelse() function.\n\nvector <- c(1:5)\nifelse(vector==3, # condition to verify\n \"this is 3!\", # what do to if the condition is met\n \"this is not 3!\") # what to do otherwise\n\n[1] \"this is not 3!\" \"this is not 3!\" \"this is 3!\" \"this is not 3!\"\n[5] \"this is not 3!\""
},
{
"objectID": "logistic_regression.html#learning-objectives",
"href": "logistic_regression.html#learning-objectives",
"title": "9 Logistic regression",
"section": "9.1 Learning objectives",
"text": "9.1 Learning objectives\n\nLogistic regression"
},
{
"objectID": "logistic_regression.html#introduction",
"href": "logistic_regression.html#introduction",
"title": "9 Logistic regression",
"section": "9.2 Introduction",
"text": "9.2 Introduction\nRegression is a method used to determine the relationship between a dependent variable (the variable we want to predict) and one or more independent variables (the predictors available to make the prediction). There are a wide variety of regression methods, but in this course we will learn two: the logistic regression, which is used to predict a categorical dependent variable, and the linear regression, which is used to predict a continuous dependent variable.\nIn this chapter, we focus on the binomial logistic regression (we will refer to it as logistic regression or simply regression in the rest of the chapter), which means that our dependent variable is dichotomous (e.g., yes or no, pass vs fail). Ordinal logistic regression (for ordinal dependent variables) and multinominal logistic regression (for variables with more than 2 categories) are beyond the scope of the course."
},
{
"objectID": "logistic_regression.html#logistic-regression",
"href": "logistic_regression.html#logistic-regression",
"title": "9 Logistic regression",
"section": "9.3 Logistic regression",
"text": "9.3 Logistic regression\n\n9.3.1 Logistic function\nLogistic regression is called this way because it fits a logistic function (an s-shaped curve) to the data to model the relationship between the predictors and a categorical outcome. More specifically, it models the probability of an outcome (the dependent categorical variable) based on the value of the independent variable. Here’s an example:\n\nThe model gives us, for each value of the independent variable (Copper content, in this example), the probability (odds) that the painting is an original. The point where the logistic curve reaches .5 (50%) on the y-axis is where the cut-off happens: the model predicts that any painting with a copper content above that point is an original.\n\n\n9.3.2 Odds\nWe can convert probabilities into odds by dividing the probability p of the one outcome by the probability of the other outcome, so because there are only two outcomes, then odds = p / 1-p. For example, say we have a bag of 10 balls, 2 red and 8 black. If we draw a ball at random, we have a 8/10 = 80% chance of drawing a black ball. The odds of drawing a black ball are thus 0.8/(1-0.8) = 0.8/0.2 = 4. There is a 4 to 1 chance that we’ll draw a black ball over a red one.\nHowever, the output of the logistic regression model is the natural logarithm of the odds: log odds = ln(p/1-p), so it is not so easily interpreted."
},
{
"objectID": "logistic_regression.html#logistic-regression-example",
"href": "logistic_regression.html#logistic-regression-example",
"title": "9 Logistic regression",
"section": "9.4 Logistic regression example",
"text": "9.4 Logistic regression example\n\n9.4.1 Loading and preparing data\n\nlibrary(tidyverse)\ndata <- read_csv(\"https://pmongeon.github.io/info6270/files/data/titanic.csv\")\nhead(data) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nPassengerId\nSurvived\nPclass\nName\nSex\nAge\nSibSp\nParch\nTicket\nFare\nCabin\nEmbarked\n\n\n\n\n1\n0\n3\nBraund, Mr. Owen Harris\nmale\n22\n1\n0\nA/5 21171\n7.2500\nNA\nS\n\n\n2\n1\n1\nCumings, Mrs. John Bradley (Florence Briggs Thayer)\nfemale\n38\n1\n0\nPC 17599\n71.2833\nC85\nC\n\n\n3\n1\n3\nHeikkinen, Miss. Laina\nfemale\n26\n0\n0\nSTON/O2. 3101282\n7.9250\nNA\nS\n\n\n4\n1\n1\nFutrelle, Mrs. Jacques Heath (Lily May Peel)\nfemale\n35\n1\n0\n113803\n53.1000\nC123\nS\n\n\n5\n0\n3\nAllen, Mr. William Henry\nmale\n35\n0\n0\n373450\n8.0500\nNA\nS\n\n\n6\n0\n3\nMoran, Mr. James\nmale\nNA\n0\n0\n330877\n8.4583\nNA\nQ\n\n\n\n\n\n\n\n\n9.4.1.1 Choose a set of predictors (independent variables)\nLooking at our dataset, we can identify some variables that we think might affect the probability that a passenger survived. In our model, we will choose Sex, Age, Pclass, and Fare. The SibSp and Parch variables represent, respectively, the combined number of siblings and spouses and the combined number of parents and children a passenger has on board. We will add them together to create a fifth predictor called FamilySize.\n\ndata <- data %>% \n mutate(FamilySize = SibSp + Parch)\n\nWe can now remove the variables that we don’t need in our model by selecting the ones we want to keep.\n\ndata <- data %>% \n select(Survived, Sex, Age, Fare, FamilySize, Pclass)\n\nhead(data) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nSurvived\nSex\nAge\nFare\nFamilySize\nPclass\n\n\n\n\n0\nmale\n22\n7.2500\n1\n3\n\n\n1\nfemale\n38\n71.2833\n1\n1\n\n\n1\nfemale\n26\n7.9250\n0\n3\n\n\n1\nfemale\n35\n53.1000\n1\n1\n\n\n0\nmale\n35\n8.0500\n0\n3\n\n\n0\nmale\nNA\n8.4583\n0\n3\n\n\n\n\n\n\n\n\n\n9.4.1.2 Set categorical variables as factors\nSetting categorical variables as factors is always necessary when fitting regression models in R. In our case there are three: Sex, Survived, and Pclass.\n\ndata <- data %>% \n mutate(Sex = as_factor(Sex),\n Survived = as_factor(Survived),\n Pclass = as_factor(Pclass))\n\n\n\n9.4.1.3 Dealing with missing data\nWe can count the number of empty cells for each variable to see if some data is missing. We do this for each variable in the set.\n\nsum(is.na(data$Survived))\n\n[1] 0\n\nsum(is.na(data$Sex))\n\n[1] 0\n\nsum(is.na(data$Age))\n\n[1] 177\n\nsum(is.na(data$Fare))\n\n[1] 0\n\nsum(is.na(data$FamilySize))\n\n[1] 0\n\nsum(is.na(data$Pclass))\n\n[1] 0\n\n\nOnce we have identified that some columns contain missing data, we have two choices. We do nothing and these cases will be left out of the regression model, or we fill the empty cells in some way (this is called imputation). We have many missing values (177 out of 891 observations is quite large) and leaving out these observations could negatively affect the performance of our regression model. Therefore, we will assign the average age for all 177 missing age values, which is a typical imputation mechanism to replace NA values with an estimate based on the available data.\n\n# We use na.rm = TRUE otherwise mean(Age) would return NA due to the missing values.\ndata <- data %>% \n mutate(Age = replace_na(Age, round(mean(Age, na.rm=TRUE),0)))\n\nhead(data) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nSurvived\nSex\nAge\nFare\nFamilySize\nPclass\n\n\n\n\n0\nmale\n22\n7.2500\n1\n3\n\n\n1\nfemale\n38\n71.2833\n1\n1\n\n\n1\nfemale\n26\n7.9250\n0\n3\n\n\n1\nfemale\n35\n53.1000\n1\n1\n\n\n0\nmale\n35\n8.0500\n0\n3\n\n\n0\nmale\n30\n8.4583\n0\n3\n\n\n\n\n\n\n\n\n\n\n9.4.2 Visualizing the relationships\nTo explore the relationship between variables. we can visualize the distribution of independent variable values for each value of the dependent variable. We can use box plots or violin plots for continuous independent variables and bar charts for the categorical variables. To make the process faster, let’s briefly untidy the data and use the gather() function to create key-value pairs for each observation of the dependent variable.\n\ndata_untidy <- gather(data, key = \"Variable\", value = \"Value\",\n -Survived) # Creates key-value pairs for all columns except Survive\nhead(data_untidy) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nSurvived\nVariable\nValue\n\n\n\n\n0\nSex\nmale\n\n\n1\nSex\nfemale\n\n\n1\nSex\nfemale\n\n\n1\nSex\nfemale\n\n\n0\nSex\nmale\n\n\n0\nSex\nmale\n\n\n\n\n\n\n\nWe can now easily create box plots for all our independent variables and outcome.\n\ndata_untidy %>% \n filter(Variable != \"Pclass\" & Variable != \"Sex\") %>%\n ggplot(aes(Survived, as.numeric(Value))) +\n facet_wrap(~ Variable, scales = \"free_y\") +\n geom_boxplot(draw_quantiles = c(0.25, 0.5, 0.75))\n\n\n\n\nAnd now we make a bar chart for our categorical independent variables.\n\ndata_untidy %>%\n filter(Variable == \"Pclass\" | Variable == \"Sex\") %>%\n ggplot(aes(Value, fill = Survived)) +\n facet_wrap(~ Variable, scales = \"free_x\") +\n geom_bar(position = \"fill\")\n\n\n\n\n\n\n9.4.3 Creating the model\nThe following code generates our logistic regression model using the glm() function (glm stands for general linear model). The syntax is gml(predicted variable ~ predictor1 + predictor2 + preductor3..., data, family) where data is our dataset and the family is the type of regression model we want to create. In our case, the family is binomial.\n\nmodel <- glm(Survived ~ Sex + Age + Fare + FamilySize + Pclass,\n data = data,\n family = binomial)\n\n\n9.4.3.1 Model summary\nNow that we have created our model, we can look at the coefficients (estimates) which tell us about the relationship between our predictors and the predicted variable. The Pr(>|z|) column represents the p-value, which determines whether the effect observed is statistically significant. It is common to use 0.05 as the threshold for statistical significance, so all the effects in our model are statistically significant (p < 0.05) except for the fare (p > 0.05).\n\nsummary(model)$coefficients %>% \n kbl() %>% \n kable_classic()\n\n\n\n\n\nEstimate\nStd. Error\nz value\nPr(>|z|)\n\n\n\n\n(Intercept)\n1.0377996\n0.3932615\n2.638956\n0.0083162\n\n\nSexfemale\n2.7763022\n0.1985984\n13.979481\n0.0000000\n\n\nAge\n-0.0388614\n0.0078206\n-4.969087\n0.0000007\n\n\nFare\n0.0032132\n0.0024551\n1.308796\n0.1906036\n\n\nFamilySize\n-0.2435093\n0.0676841\n-3.597733\n0.0003210\n\n\nPclass2\n-1.0021830\n0.2929527\n-3.420972\n0.0006240\n\n\nPclass3\n-2.1318527\n0.2891435\n-7.372993\n0.0000000\n\n\n\n\n\n\n\n\n\n9.4.3.2 Converting log odds to odds ratio\nAs we mentioned above, the coefficients produced by the model are log odds, which are difficult to interpret. We can convert them to odds ratio, which are easier to interpret. This can be done with the exp() function. We can now see that according to our model, female passengers were 16 times more likely to survive than male passengers.\n\nbind_rows(exp(model$coefficients)) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\n(Intercept)\nSexfemale\nAge\nFare\nFamilySize\nPclass2\nPclass3\n\n\n\n\n2.822998\n16.05953\n0.961884\n1.003218\n0.7838722\n0.3670772\n0.1186173\n\n\n\n\n\n\n\n\n\n9.4.3.3 Adding confidence intervals\nThe confidence intervals are an estimation of the precision odds ratio. In the example below, We use a 95% confidence interval which means that we are 95% of our estimated coefficients for a predictor are between the 2.5th percentile and the 97.5th percentile (the two values reported in the tables). If we were using a sample to make claims about a population, which does not really apply here due to the unique case of the titanic, we could then think of the confidence interval as indicating a 95% probability that the true coefficient for the entire population is situated in between the two values.\n\nodds_ratio <- cbind(Odds_Ratio = exp(model$coefficients), \n exp(confint(model, level = .95)))\n\nodds_ratio %>% \n kbl() %>% \n kable_classic()\n\n\n\n\n\nOdds_Ratio\n2.5 %\n97.5 %\n\n\n\n\n(Intercept)\n2.8229984\n1.3083312\n6.1328871\n\n\nSexfemale\n16.0595260\n10.9723363\n23.9218257\n\n\nAge\n0.9618840\n0.9469679\n0.9764876\n\n\nFare\n1.0032184\n0.9987147\n1.0086269\n\n\nFamilySize\n0.7838722\n0.6827452\n0.8907643\n\n\nPclass2\n0.3670772\n0.2060871\n0.6511201\n\n\nPclass3\n0.1186173\n0.0670531\n0.2089431\n\n\n\n\n\n\n\n\n\n9.4.3.4 Model predictions\nFirst we add to our data the probability that the passenger survived as calculated by the model.\n\ndata <- tibble(data,\n Probability = model$fitted.values)\n\nhead(data) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nSurvived\nSex\nAge\nFare\nFamilySize\nPclass\nProbability\n\n\n\n\n0\nmale\n22\n7.2500\n1\n3\n0.1025490\n\n\n1\nfemale\n38\n71.2833\n1\n1\n0.9107566\n\n\n1\nfemale\n26\n7.9250\n0\n3\n0.6675927\n\n\n1\nfemale\n35\n53.1000\n1\n1\n0.9153720\n\n\n0\nmale\n35\n8.0500\n0\n3\n0.0810373\n\n\n0\nmale\n30\n8.4583\n0\n3\n0.0968507\n\n\n\n\n\n\n\nThen we obtain the prediction by creating a new variable called prediction and setting the value to 1 if the calculated probability of survival is greater than 50% and 0 otherwise.\n\ndata <- data %>% \n mutate(Prediction = if_else(Probability > 0.5,1,0))\n\nhead(data) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nSurvived\nSex\nAge\nFare\nFamilySize\nPclass\nProbability\nPrediction\n\n\n\n\n0\nmale\n22\n7.2500\n1\n3\n0.1025490\n0\n\n\n1\nfemale\n38\n71.2833\n1\n1\n0.9107566\n1\n\n\n1\nfemale\n26\n7.9250\n0\n3\n0.6675927\n1\n\n\n1\nfemale\n35\n53.1000\n1\n1\n0.9153720\n1\n\n\n0\nmale\n35\n8.0500\n0\n3\n0.0810373\n0\n\n\n0\nmale\n30\n8.4583\n0\n3\n0.0968507\n0\n\n\n\n\n\n\n\nThen we can print a table comparing the model’s prediction to the real data.\n\ntable(Survived = data$Survived, \n Predicted = data$Prediction)\n\n Predicted\nSurvived 0 1\n 0 476 73\n 1 101 241\n\n\nFinally, we can calculate how well our model fits the data (how good was it at predicting the independent variable) by testing, for each passenger in the dataset, whether the prediction matched the actual outcome for that passenger. Because that test gives TRUE (1) or FALSE (0) for every row of the dataset, calculating the mean of the test results gives us the percentage of correct predictions, which in our case is about 80.5%.\n\nmean(data$Survived == data$Prediction)\n\n[1] 0.8047138"
},
{
"objectID": "logistic_regression.html#summary",
"href": "logistic_regression.html#summary",
"title": "9 Logistic regression",
"section": "9.5 Summary",
"text": "9.5 Summary\nIn this chapter we introduced the logistic regression model as a useful tool to predict a dichotomous categorical variable based on one or multiple predictors that can be either categorical or continuous. The R scripts provided will work with any dataset provided that they are in a tidy format. So you now have everything you need to try to predict categorical outcomes in all kinds of scenarios and with all kinds of data."
},
{
"objectID": "logistic_regression.html#lab",
"href": "logistic_regression.html#lab",
"title": "9 Logistic regression",
"section": "9.6 Lab",
"text": "9.6 Lab\nFor this week’s lab, your task is to build a logistic regression model on a dataset of your choice. I encourage you to use the dataset that you will be using for your individual project, if it contains a dichotomous variable that you can try to predict using other variables in your dataset."
},
{
"objectID": "topic_modelling.html#learning-objectives",
"href": "topic_modelling.html#learning-objectives",
"title": "8 Topic Modelling",
"section": "8.1 Learning objectives",
"text": "8.1 Learning objectives\n\nTidytext\nTokenization\nStop words\nSentiment analysis\nTF-IDF\nTopic modelling"
},
{
"objectID": "topic_modelling.html#introduction",
"href": "topic_modelling.html#introduction",
"title": "8 Topic Modelling",
"section": "8.2 Introduction",
"text": "8.2 Introduction\nText-mining could be a course of it’s own. This short introduction introduces a few of the most important concepts and provides examples of of to perform a range of text mining tasks. We will introduce a few new packages:\n\ntidytext: a package that contains a lot of function to work with text in a tidy format.\ntextstem: to perform stemming and lemmatization on tokens\nwordcloud2: to make word clouds\nvader: to perform sentiment analysis\ntopicmodels: to do topic modelling"
},
{
"objectID": "topic_modelling.html#the-dataset",
"href": "topic_modelling.html#the-dataset",
"title": "8 Topic Modelling",
"section": "8.3 The dataset",
"text": "8.3 The dataset\nThe dataset that we will use in this chapter consists of the the abstracts of 1672 open access scientific articles published between 2019 and 2020 in 10 journals from different fields of research.\n\n\n\nJournals included in the dataset\n\n\nJournal\n\n\n\n\nAmerican Journal of International Law\n\n\nFood Security\n\n\nJournal of Business Economics and Management\n\n\nJournal of Child Psychology and Psychiatry\n\n\nJournal of Mathematics\n\n\nJournal of the Association for Information Science and Technology\n\n\nNature Astronomy\n\n\nOrganic & Biomolecular Chemistry\n\n\nPublic Health\n\n\nSolar Energy\n\n\n\n\n\n\n\nThe dataset is available here, or can loaded into R directly from the URL like this:\n\nabstracts <- read_tsv(\"https://pmongeon.github.io/info6270/files/abstracts.txt\") %>% \n filter(journal %in% c(\"A\",\"B\",\"C\",\"D\"))"
},
{
"objectID": "topic_modelling.html#data-formats-for-text-mining",
"href": "topic_modelling.html#data-formats-for-text-mining",
"title": "8 Topic Modelling",
"section": "8.4 Data formats for text mining",
"text": "8.4 Data formats for text mining\nThe two formats that we will consider in this chapter ar the tidy text format and the document-term matrix. Let’s start with a tibble that contain 2 documents with their text.\n\n\n\n\n\nDocument\nText\n\n\n\n\nA\nI love bananas\n\n\nB\nI hate potatoes\n\n\n\n\n\n\n\n\n8.4.1 Tidy text\nThe tidy text format follows the same tidy principles that were introduced earlier in the course. It simply means that each row contains one token. A token is a meaningful units of text. Most often the tokens are individual words, but they can be sentences, paragraphs, groups of characters or words of a fixed length n (also known as n-grams), etc.\nIn principle the tibble above with the text of our two documents is already in a tidy text format where each row is a meaningful unit of text (the token, here would be the entire text). But if we chose words as our tokens, then it would look like this:\n\n\n\n\n\ndocument\nwords\n\n\n\n\nA\ni\n\n\nA\nlove\n\n\nA\nbananas\n\n\nB\ni\n\n\nB\nhate\n\n\nB\npotatoes\n\n\n\n\n\n\n\n\n\n8.4.2 Document-term matrix\nDocument-term matrices contain all documents in a set as rows and all tokens as columns. The cells typically containing the frequency of the term in the document. Here is a document-term matrix for our two documents\n\n\n\n\n\n\nbananas\nhate\ni\nlove\npotatoes\n\n\n\n\nA\n1\n0\n1\n1\n0\n\n\nB\n0\n1\n1\n0\n1\n\n\n\n\n\n\n\nThe topicmodels package requires data to be formatted in a document-term matrix, but all the other tasks that we will perform in the rest of the chapter are working with the tidy text format."
},
{
"objectID": "topic_modelling.html#tokenizing",
"href": "topic_modelling.html#tokenizing",
"title": "8 Topic Modelling",
"section": "8.5 Tokenizing",
"text": "8.5 Tokenizing\nTokenizing is the process of dividing the text of our documents in our chosen unit. The unnest_tokens() function of the tidytext package can be used for task. Three main arguments are required:\n\noutput: the name of the column that will contain the tokens.\ninput: the name of the column that currently contain the text.\ntoken: the desired type of token (possible values are “words”,“sentences”, “ngrams”)\nn: this specifies the value of n for n-grams.\n\nNote that you could perform write a code that does the exact same thing as the unnest_tokens() function with the set of functions that you learned in chapter 3 and 4. The unnest_tokens() function simply makes that code much, much more efficient.\n\n8.5.1 Words\nThe following code divides the abstract into individual words with token = \"words\" .\n\nabstracts_words <- abstracts %>% \n unnest_tokens(output = word,\n input = abstract,\n token = \"words\")\n\nhead(abstracts_words) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nid\njournal\nword\n\n\n\n\n1\nA\nparenting\n\n\n1\nA\nbehaviors\n\n\n1\nA\nhave\n\n\n1\nA\nbeen\n\n\n1\nA\nshown\n\n\n1\nA\nto\n\n\n\n\n\n\n\n\n\n8.5.2 Sentences\nYou can tokenize by sentence with token = \"sentences\".\n\nabstracts_sentences <- abstracts %>% \n unnest_tokens(output = sentence,\n input = abstract,\n token = \"sentences\")\n\nhead(abstracts_sentences) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nid\njournal\nsentence\n\n\n\n\n1\nA\nparenting behaviors have been shown to moderate the association between sensation seeking and antisocial behaviors.\n\n\n1\nA\ndata were obtained from the boricua youth study, a longitudinal study of 2,491 puerto rican youth living in the south bronx, new york, and the metropolitan area of san juan, puerto rico.\n\n\n1\nA\nfirst, we examined the prospective relationship between sensation seeking and antisocial behaviors across 3 yearly waves and whether this relationship varied by sociodemographic factors.\n\n\n1\nA\nsecond, we examined the moderating role of parenting behaviors-including parental monitoring, warmth, and coercive discipline-on the prospective relationship between sensation seeking and antisocial behaviors.\n\n\n1\nA\nsensation seeking was a strong predictor of antisocial behaviors for youth across two different sociocultural contexts.\n\n\n1\nA\nhigh parental monitoring buffered the association between sensation seeking and antisocial behaviors, protecting individuals with this trait.\n\n\n\n\n\n\n\n\n\n8.5.3 N-grams\nYou can tokenize by groups of words of size N with token = \"ngrams\" and the specifying how many words you want in each groups with n=N.\n\nabstracts_bigrams <- abstracts %>% \n unnest_tokens(output = bigram,\n input = abstract,\n token = \"ngrams\",\n n = 2)\n\nhead(abstracts_bigrams) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nid\njournal\nbigram\n\n\n\n\n1\nA\nparenting behaviors\n\n\n1\nA\nbehaviors have\n\n\n1\nA\nhave been\n\n\n1\nA\nbeen shown\n\n\n1\nA\nshown to\n\n\n1\nA\nto moderate\n\n\n\n\n\n\n\nFor the rest of the chapter, we’ll use the abstracts_words tibble, in which our tokens are individual words."
},
{
"objectID": "topic_modelling.html#removing-stop-words",
"href": "topic_modelling.html#removing-stop-words",
"title": "8 Topic Modelling",
"section": "8.6 Removing stop words",
"text": "8.6 Removing stop words\nThe English language is full of words that carry little to no meanings, and that tend to be the most frequent terms we use. To illustrate this, let’s look at the ten most frequent terms in the abstracts dataset.\n\n\n\n\n\nword\nn\n\n\n\n\nthe\n5626\n\n\nof\n4463\n\n\nand\n4049\n\n\nin\n2519\n\n\nto\n2427\n\n\na\n1932\n\n\nwith\n1217\n\n\nfor\n1106\n\n\nthat\n1087\n\n\nis\n893\n\n\n\n\n\n\n\nWe can see that the dataset is dominated by terms that don’t carry much meaning, such as “a”, “of”, “the”, and “and”. These are called stop words and, fortunately, the tidytext package includes a a dataset called stop_words that we can use to eliminate these words from our data. Let’s look at a few rows from the stop_words dataset.\n\nhead(stop_words) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nword\nlexicon\n\n\n\n\na\nSMART\n\n\na's\nSMART\n\n\nable\nSMART\n\n\nabout\nSMART\n\n\nabove\nSMART\n\n\naccording\nSMART\n\n\n\n\n\n\n\nWe can use the anti_join() function to remove any row in a tibble that matches another tibble on a specified variable. In this case, we want to remove all the words that are in the stop_words tibble.\n\nabstracts_words <- abstracts_words %>% \n anti_join(stop_words, by=\"word\")\n\nIf we take a look at the most frequent terms again, we see that most of those little meaningless words are gone.\n\nabstracts_words %>% \n count(word, sort = TRUE) %>% \n top_n(10) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nword\nn\n\n\n\n\nfood\n555\n\n\nchildren\n292\n\n\nstudy\n267\n\n\nsymptoms\n266\n\n\n1\n248\n\n\nage\n230\n\n\nadhd\n215\n\n\nrisk\n208\n\n\ndata\n207\n\n\n2\n179"
},
{
"objectID": "topic_modelling.html#stemming-and-lemmatization",
"href": "topic_modelling.html#stemming-and-lemmatization",
"title": "8 Topic Modelling",
"section": "8.7 Stemming and lemmatization",
"text": "8.7 Stemming and lemmatization\nThe next issue that we typically face with text mining is the presence of word variations that are considered distinct tokens. These variation can include plural and singular terms, verb conjugations, etc. Here is an example of a set of tokens from our dataset.\n\n\n\n\n\nword\n\n\n\n\nstudy\n\n\nstudies\n\n\nstudents\n\n\nstudent\n\n\nstudy's\n\n\nstudied\n\n\nunderstudied\n\n\nstudying\n\n\n\n\n\n\n\nStemming and lemmatization are two ways of normalizing words by reducing words to their base called stem and lemma, respectively. Here are the stems and the lemmas of the words containing the string “stud”.\n\n\n\n\n\nword\nstem\nlemma\n\n\n\n\nstudy\nstudi\nstudy\n\n\nstudies\nstudi\nstudy\n\n\nstudents\nstudent\nstudent\n\n\nstudent\nstudent\nstudent\n\n\nstudy's\nstudy'\nstudy's\n\n\nstudied\nstudi\nstudy\n\n\n\n\n\n\n\nUnderstanding how stemming and lemmatization works is beyond the scope of this course, but you should know that:\n\nboth methods work well for eliminating the plural form.\nStems are not always an actual word but the root of a word (ex. “moder” is the stem of “moderate”), while Lemmas are always actual words.\nStemming usually produces a lower number of unique words than than lemmatization.\n\nWe can easily add the stem and the lemma of words in our tibble using the stem_words() and lemmatize_words() functions from the textstem package, which also includes stem_strings() and lemmatize_strings() functions when dealing with tokens that are more than one word.\nFor this chapter, I will use the lemmatization to normalize my words. I do so by replacing my words with their lemma with the following code.\n\nlibrary(textstem)\nabstracts_words <- abstracts_words %>% \n mutate(word = lemmatize_words(word))\n\nhead(abstracts_words, n=10) %>% \n kbl() %>% \n kable_classic() \n\n\n\n\nid\njournal\nword\n\n\n\n\n1\nA\nparenting\n\n\n1\nA\nbehavior\n\n\n1\nA\nshow\n\n\n1\nA\nmoderate\n\n\n1\nA\nassociation\n\n\n1\nA\nsensation\n\n\n1\nA\nseek\n\n\n1\nA\nantisocial\n\n\n1\nA\nbehavior\n\n\n1\nA\ndatum"
},
{
"objectID": "topic_modelling.html#term-frequency",
"href": "topic_modelling.html#term-frequency",
"title": "8 Topic Modelling",
"section": "8.8 Term frequency",
"text": "8.8 Term frequency\nA first exploration of our data is to calculate the frequency of each terms for the whole set of documents, for each journals, and for each individual document.\n\n8.8.1 Overall term frequency\n\nabstracts_words %>% \n count(word, sort = TRUE) %>% \n top_n(20) %>% # same has head(n=20) \n kbl() %>% \n kable_classic()\n\n\n\n\nword\nn\n\n\n\n\nfood\n584\n\n\nchild\n441\n\n\nstudy\n375\n\n\nage\n316\n\n\nmodel\n303\n\n\nreport\n295\n\n\nstar\n293\n\n\nsymptom\n293\n\n\nsystem\n293\n\n\n1\n248\n\n\nlow\n239\n\n\nrisk\n228\n\n\nadhd\n215\n\n\nincrease\n208\n\n\ndatum\n207\n\n\nhousehold\n201\n\n\nmass\n201\n\n\neffect\n199\n\n\nmeasure\n198\n\n\nresult\n194\n\n\n\n\n\n\n\n\n\n8.8.2 Term frequency by journal\n\nabstracts_words %>% \n group_by(journal) %>% \n count(word, sort = TRUE) %>% \n slice(1:8) %>% \n ggplot() + \n aes(x=n, y=reorder(word,n)) +\n geom_col() +\n facet_wrap(facet = ~journal, \n ncol=2, \n scale=\"free\") +\n theme_minimal() +\n labs(y = \"Word\", x=\"Frequency\") +\n scale_y_reordered()\n\n\n\n\n\n\n8.8.3 Term frequency by document\n\nabstracts_words %>%\n filter(id <= 4) %>% # I select papers with ID 1 to 4\n group_by(id) %>% # I chose the papers as my unit of analysis\n count(word, \n sort = TRUE) %>% \n slice(1:8) %>% # get the 8 most frequent terms for each document\n ggplot() + \n aes(x = n, \n y = reorder(word, n)) +\n geom_col() +\n facet_wrap(facet = ~id, # here I use the papers_id for the facet\n ncol=2, \n scale=\"free\") + \n theme_minimal() +\n labs(y = \"Word\", x = \"Freqency\")\n\n\n\n\n\n\n8.8.4 Word clouds\nWhile word clouds are not more informative than term frequency tables, they are a space-efficient means to show a larger number of words and their relative frequency in a set. the wordcloud and wordcloud2 packages can be used to generate word clouds. Let’s look at the word cloud for the 100 most frequent terms for each journal.\n\nlibrary(wordcloud2)\nabstracts_words %>%\n filter(journal == \"A\") %>% \n count(word, sort=TRUE) %>% # adds a column n with the frequency of the word.\n slice(1:100) %>% \n wordcloud2(size = 0.3, # Size of the text (default is 1)\n minRotation = -pi/2, # Min roation is 90 degrees\n maxRotation = -pi/2, # Max rotation is 90 degrees\n rotateRatio = 0, # percentage of words to rotate (none, in this case)\n shape = \"circle\",\n color=\"black\")\n\n\n\nlibrary(wordcloud2)\nabstracts_words %>%\n filter(journal == \"B\") %>% \n count(word, sort=TRUE) %>% # adds a column n with the frequency of the word.\n slice(1:100) %>% \n wordcloud2(size = 0.3, # Size of the text (default is 1)\n minRotation = -pi/2, # Min roation is 90 degrees\n maxRotation = -pi/2, # Max rotation is 90 degrees\n rotateRatio = 0, # percentage of words to rotate (none, in this case)\n shape = \"circle\",\n color=\"black\")\n\n\n\nlibrary(wordcloud2)\nabstracts_words %>%\n filter(journal == \"C\") %>% \n count(word, sort=TRUE) %>% # adds a column n with the frequency of the word.\n slice(1:100) %>% \n wordcloud2(size = 0.3, # Size of the text (default is 1)\n minRotation = -pi/2, # Min roation is 90 degrees\n maxRotation = -pi/2, # Max rotation is 90 degrees\n rotateRatio = 0, # percentage of words to rotate (none, in this case)\n shape = \"circle\",\n color=\"black\")\n\n\n\nlibrary(wordcloud2)\nabstracts_words %>%\n filter(journal == \"D\") %>% \n count(word, sort=TRUE) %>% # adds a column n with the frequency of the word.\n slice(1:100) %>% \n wordcloud2(size = 0.5, # Size of the text (default is 1)\n minRotation = -pi/2, # Min roation is 90 degrees\n maxRotation = -pi/2, # Max rotation is 90 degrees\n rotateRatio = 0, # percentage of words to rotate (none, in this case)\n shape = \"circle\",\n color=\"black\")\n\n\nAs you can see, the word clouds give us a pretty good idea of what fields the journal included in the dataset may be from.\n\n\n8.8.5 TF-IDF\nTF-IDF stands for term frequency * inverse document frequency. Essentially, it measures how important (frequent) a term is in a document, and how rare that term is in the corpus. The result is a value between 0 and 1. The highest the tf-idf value, the more the term is specific to a document and therefore indicative of its specific topic.\n\nabstracts_words <- abstracts_words %>%\n group_by(id) %>% \n count(word, sort=TRUE) %>% \n bind_tf_idf(term = word, \n document = id, \n n = n)\n\nLet’s look at the most important words in a set of 9 nine articles. These are the words with the highest tf-idf values.\n\nabstracts_words %>% \n filter(id <= 9) %>% \n group_by(id) %>%\n top_n(8) %>% \n ggplot() + \n aes(x=tf_idf, y=reorder(word, tf_idf)) +\n geom_col(show.legend = FALSE) +\n facet_wrap(~id, ncol = 3, scales = \"free\") +\n labs(y = \"Word\")\n\n\n\n\n\nabstracts_words %>% \n filter(id <= 9) %>% \n group_by(id) %>%\n top_n(8) %>%\n ggplot() + \n aes(x=n, y=reorder(word, n)) +\n geom_col(show.legend = FALSE) +\n facet_wrap(~id, ncol = 3, scales = \"free\") +\n labs(y=\"Word\")\n\n\n\n\n## Sentiment analysis\nAs the name suggest, sentiment analysis is a technique use to identify sentiment in text. The most basic sentiment analysis methods use list of terms with an associated sentiment (positive or negative) or a numerical value representing the strength or the sentiment: -2 (very negative) to +2 (very positive), for instance. The tidytext package includes a few of those terms list with associated sentiments, which you can access with the get_sentiments() function. The following example import the bing list and the afinn list into two tibbles.\n\nsentiment_bing <- get_sentiments(\"bing\")\nsentiment_afinn <- get_sentiments(\"afinn\")\n\nLet’s take a look.\n\nhead(sentiment_bing) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nword\nsentiment\n\n\n\n\n2-faces\nnegative\n\n\nabnormal\nnegative\n\n\nabolish\nnegative\n\n\nabominable\nnegative\n\n\nabominably\nnegative\n\n\nabominate\nnegative\n\n\n\n\n\n\n\n\nhead(sentiment_afinn) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nword\nvalue\n\n\n\n\nabandon\n-2\n\n\nabandoned\n-2\n\n\nabandons\n-2\n\n\nabducted\n-2\n\n\nabduction\n-2\n\n\nabductions\n-2\n\n\n\n\n\n\n\nWe can then assign a sentiment to the words in our abstracts with the left_join() function that we introduced in chapter 3. Let’s use the afinn list for this example.\n\nabstracts_words <- abstracts_words %>%\n left_join(sentiment_afinn, by=\"word\")\n\nhead(abstracts_words) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nid\nword\nn\ntf\nidf\ntf_idf\nvalue\n\n\n\n\n10\nccbt\n20\n0.0847458\n6.228511\n0.5278399\nNA\n\n\n1276\nyield\n19\n0.1210191\n2.932674\n0.3549096\nNA\n\n\n1598\nfood\n19\n0.1043956\n1.706722\n0.1781743\nNA\n\n\n244\nsct\n18\n0.1011236\n6.228511\n0.6298494\nNA\n\n\n256\nanxiety\n18\n0.1208054\n2.794524\n0.3375935\n-2\n\n\n20\nsleep\n17\n0.0949721\n3.589454\n0.3408978\nNA\n\n\n\n\n\n\n\nNow let’s just sum the score for each abstract to have a single sentiment score for each article, and order abstracts from the highest sentiment score to the lowest.\n\nabstracts_sentiment <- abstracts_words %>% \n group_by(id) %>% \n summarize(sentiment = sum(value, na.rm=TRUE)) %>% \n arrange(desc(sentiment))\n\nAnd then let’s look at the most positive abstracts and the most negative abstracts with head() and tail().\n\nabstracts_sentiment %>% \n inner_join(abstracts, by=\"id\") %>% \n select(id, journal, abstract, sentiment) %>% \n head()\n\n# A tibble: 6 × 4\n id journal abstract sentiment\n <dbl> <chr> <chr> <dbl>\n1 1137 D Rwanda has experienced significant economic growth fo… 16\n2 1284 D Assessing progress towards healthier people, farms an… 15\n3 1138 D This paper concerns Drought-Tolerant Maize (DTM) and … 13\n4 214 A Care for children and adolescents with psychosocial p… 12\n5 1451 D Using survey dataset collected from nearly 9000 farme… 12\n6 73 A BackgroundChildren in the UK go through rigorous teac… 11\n\n\n\nabstracts_sentiment %>% \n inner_join(abstracts, by=\"id\") %>% \n select(id, journal, abstract,sentiment) %>% \n tail()\n\n# A tibble: 6 × 4\n id journal abstract sentiment\n <dbl> <chr> <chr> <dbl>\n1 174 A \"Suicide is the second leading cause of death in youn… -14\n2 1598 D \"This opinion article results from a collective analy… -14\n3 248 A \"Only one-third of young people who experience suicid… -15\n4 1532 D \"Children who experience poor nutrition during the fi… -16\n5 50 A \"Children with developmental disabilities are at heig… -21\n6 320 A \"Post-traumatic stress disorder (PTSD) is a common re… -22\n\n\nLet’s compare the average sentiment for the two journals in the dataset\n\nabstracts_sentiment %>% \n inner_join(abstracts, by=\"id\") %>% \n select(id, journal, abstract, sentiment) %>% \n group_by(journal) %>% \n summarize(mean_sentiment = mean(sentiment)) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\njournal\nmean_sentiment\n\n\n\n\nA\n-2.268293\n\n\nB\n-1.404255\n\n\nC\n2.010638\n\n\nD\n1.527778\n\n\n\n\n\n\n\n\n8.8.5.1 The vader package\nValence Aware Dictionary and sEntiment Reasoner (VADER) is a popular sentiment analysis tool, that takes the context of words into account so that good (a positive term) can be interpreted as negative when it is preceded by the word not (as in “not good”).\nYou can calculate the sentiment and get a data frame with vader_df() function. This is an example for the first abstract in our dataset.\n\nlibrary(vader)\nvader_df(abstracts$abstract[1])\n\n text\n1 Parenting behaviors have been shown to moderate the association between sensation seeking and antisocial behaviors. Data were obtained from the Boricua Youth Study, a longitudinal study of 2,491 Puerto Rican youth living in the South Bronx, New York, and the metropolitan area of San Juan, Puerto Rico. First, we examined the prospective relationship between sensation seeking and antisocial behaviors across 3 yearly waves and whether this relationship varied by sociodemographic factors. Second, we examined the moderating role of parenting behaviors-including parental monitoring, warmth, and coercive discipline-on the prospective relationship between sensation seeking and antisocial behaviors. Sensation seeking was a strong predictor of antisocial behaviors for youth across two different sociocultural contexts. High parental monitoring buffered the association between sensation seeking and antisocial behaviors, protecting individuals with this trait. Low parental warmth was associated with high levels of antisocial behaviors, regardless of the sensation seeking level. Among those with high parental warmth, sensation seeking predicted antisocial behaviors, but the levels of antisocial behaviors were never as high as those of youth with low parental warmth. Study findings underscore the relevance of person-family context interactions in the development of antisocial behaviors. Future interventions should focus on the interplay between individual vulnerabilities and family context to prevent the unhealthy expression of a trait that is present in many individuals.\n word_scores\n1 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.55, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1.65, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.9, 0, 0, 0, 0, 0.15, 0, -3.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}\n compound pos neu neg but_count\n1 0.153 0.058 0.896 0.047 1"
},
{
"objectID": "topic_modelling.html#topic-modelling",
"href": "topic_modelling.html#topic-modelling",
"title": "8 Topic Modelling",
"section": "8.9 Topic modelling",
"text": "8.9 Topic modelling\nIn this section we will use a very common topic modelling technique called Latent Dirichlet Allocation (LDA). It is a dimensionality reduction technique that takes groups a large number of documents into a smaller number of topics.\n\nabstracts_dtm <- abstracts_words %>%\n cast_dtm(document = id, # name of the column with your document id \n term = word, # name of the column with your tokens\n value = n) # name of the column with your token frequencies\n\n\n8.9.1 Generate topic models\nYou can use the code below, and replace abstracts_dtm with the name of your document-term matrix object, and change the value of k to the number of topics that you want.\n\nlibrary(topicmodels)\nabstracts_lda <- abstracts_dtm %>% \n LDA(k = 4, # Number of topics desired\n control = list(seed = 1234)) # Just leave this on as is\n\n\n\n8.9.2 Create a tidy text tibble with the topic-token association\nThe beta value is the strength of the association of a token with a topic. The higher the beta, the most strongly the term is associated to the topic.\n\ntopic_term_beta <- abstracts_lda %>% \n tidy(matrix = \"beta\")\n\n\ntopic_term_beta %>%\n group_by(topic) %>%\n slice_max(beta, n = 10) %>% \n mutate(term = reorder_within(term, beta, topic)) %>% \n ggplot() +\n aes(x=beta, y=term) +\n geom_col(show.legend = FALSE) +\n facet_wrap(~topic, scales = \"free\") +\n scale_y_reordered()\n\n\n\n\n\n\n8.9.3 Create a tidy text tibble with abstract-topic association\nThe gamma value is the strength of the association of a document with a topic. The higher the beta, the most strongly the document is associated to the topic.\n\nabstracts_topic <- abstracts_lda %>% \n tidy(matrix = \"gamma\")\n\n\nabstracts_topic %>% \n filter(as.numeric(document) <= 9) %>% \n group_by(document) %>%\n arrange(document, topic) %>% \n ggplot() + \n aes(x=gamma, y=reorder(topic, gamma)) +\n geom_col(show.legend = FALSE) +\n facet_wrap(~document, ncol = 3, scales = \"free\") +\n labs(y=\"Topic\")"
},
{
"objectID": "topic_modelling.html#other-resources",
"href": "topic_modelling.html#other-resources",
"title": "8 Topic Modelling",
"section": "8.10 Other resources",
"text": "8.10 Other resources\n\nProject gutenberg (https://www.gutenberg.org/) provides access to a large collection of free eBooks that can be downloaded in a plain text (.txt) format that is convenient for text-mining. the gutenbergr package allows you to search the project gutenberg collection and import books directly in R.\nText mining with R (https://www.tidytextmining.com/index.html) by Julia Silge and David Robinson is a great resource for beginners with lots of R examples. The book includes several examples that use the gutenbergr package."
},
{
"objectID": "summarizing_data.html#learning-objectives",
"href": "summarizing_data.html#learning-objectives",
"title": "6 Summarizing data",
"section": "6.1 Learning objectives",
"text": "6.1 Learning objectives\n\nFrequency tables\nDescriptive statistics"
},
{
"objectID": "summarizing_data.html#introduction",
"href": "summarizing_data.html#introduction",
"title": "6 Summarizing data",
"section": "6.2 Introduction",
"text": "6.2 Introduction\nIn this chapter, we use numbers and tables to summarize our dataset. These summaries can sometimes suffice to fulfill the goals of our analysis when these goals are descriptive. The summaries also help us (and our readers) get to know more about our data and help us ensure that our data is adequate to perform the intended analyses. First we will consider what types of variables we have in our dataset (this is related, but not the same as the R data types), and then we will go through the process of generating useful summaries for variables of different types."
},
{
"objectID": "summarizing_data.html#types-of-variable",
"href": "summarizing_data.html#types-of-variable",
"title": "6 Summarizing data",
"section": "6.3 Types of variable",
"text": "6.3 Types of variable\n\n6.3.1 Categorical variable\nCategorical variables are groups or categories. They can be represented by characters or numbers\n\nNominal variables represent categories or groups (e.g., gender, occupation, course, programs, university) and where there is no logical order between the different categories.\nOrdinal variables represent categories or groups that have a logical order (e.g., age groups)\n\n\n\n6.3.2 Numerical variables\nNumerical variables are represented by numbers\n\ndiscrete numerical variables can only take a certain number of values (like the numbers on a die or the number of pets a person has). Another way to think about those is that they are things that can be counted (number of cars, number of students, number of pets).\nContinuous variables can be measured and can theoretically take any value (e.g., the weight or height of a person, the distance between two cities, a price). They are things that can be measured.\n\n\n\n\n\n\n\nCategories represented with numbers\n\n\n\nIt is important to look at your data to understand what the values represent. Sometimes you may have groups that are represented with numbers. When deciding what type of statistical analysis is adequate for a given variable, you most likely will want to consider treating those variables as categorical and not numerical..\n\n\nThe following code creates a table that describes the variables included in the mpg dataset.\n\ntibble(\n Variable = colnames(mpg),\n Type = c(\"Nominal\",\n \"Nominal\",\n \"Continuous\",\n \"Discrete\",\n \"Discrete\",\n \"Nominal\",\n \"Nominal\",\n \"Continuous\",\n \"Continuous\",\n \"Nominal\",\n \"Nominal\"),\n Description = c(\"Manufacturer name\",\n \"Model name\",\n \"Engine displacement, in litres\",\n \"Year of manufacture\",\n \"Number of cylinders\",\n \"Type of transmission\",\n \"Type of drive train\",\n \"City miles per gallon\",\n \"Highway miles per gallon\",\n \"Fuel type\",\n \"Type of car\")\n ) %>% \n kbl(\n caption = \"Variables in the mpg dataset.\",\n align = c(\"l\",\"l\",\"l\")\n ) %>% \n kable_classic()\n\n\nVariables in the mpg dataset.\n\n\nVariable\nType\nDescription\n\n\n\n\nmanufacturer\nNominal\nManufacturer name\n\n\nmodel\nNominal\nModel name\n\n\ndispl\nContinuous\nEngine displacement, in litres\n\n\nyear\nDiscrete\nYear of manufacture\n\n\ncyl\nDiscrete\nNumber of cylinders\n\n\ntrans\nNominal\nType of transmission\n\n\ndrv\nNominal\nType of drive train\n\n\ncty\nContinuous\nCity miles per gallon\n\n\nhwy\nContinuous\nHighway miles per gallon\n\n\nfl\nNominal\nFuel type\n\n\nclass\nNominal\nType of car"
},
{
"objectID": "summarizing_data.html#summarizing-categorical-data",
"href": "summarizing_data.html#summarizing-categorical-data",
"title": "6 Summarizing data",
"section": "6.4 Summarizing categorical data",
"text": "6.4 Summarizing categorical data\nThere is not a lot that you can do with a single categorical variable other than reporting the frequency (number) and relative frequency (percentage) of observations for each category.\n\n6.4.1 Frequency\nWe can use the summarize() and n() functions of the dplyr package (included in the tidyverse) to create a table of frequencies for the manufacturer variable. the group_by() function specifies which categorical variable I want to summarize. If we don’t use the group_by() function, we obtain the number of observations (rows) in the whole tibble.\n\nmpg %>% \n group_by(manufacturer) %>% \n summarize(freq = n()) %>% # n() counts the number of observations for each group.\n kbl() %>% \n kable_classic()\n\n\n\n\nmanufacturer\nfreq\n\n\n\n\naudi\n18\n\n\nchevrolet\n19\n\n\ndodge\n37\n\n\nford\n25\n\n\nhonda\n9\n\n\nhyundai\n14\n\n\njeep\n8\n\n\nland rover\n4\n\n\nlincoln\n3\n\n\nmercury\n4\n\n\nnissan\n13\n\n\npontiac\n5\n\n\nsubaru\n14\n\n\ntoyota\n34\n\n\nvolkswagen\n27\n\n\n\n\n\n\n\n\n\n6.4.2 Relative frequency\nThe relative frequency is simply the frequency represented as a percentage rather than count. It is obtained by first computing the frequency and then calculating the relative frequency by dividing each counts by the sum of the counts: mutate(rel_freq = freq/sum(freq)).\n\nmpg %>% \n group_by(manufacturer) %>% \n summarize(freq = n()) %>% # n() counts the number of observations for each group.\n mutate(rel_freq = freq/sum(freq)) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nmanufacturer\nfreq\nrel_freq\n\n\n\n\naudi\n18\n0.0769231\n\n\nchevrolet\n19\n0.0811966\n\n\ndodge\n37\n0.1581197\n\n\nford\n25\n0.1068376\n\n\nhonda\n9\n0.0384615\n\n\nhyundai\n14\n0.0598291\n\n\njeep\n8\n0.0341880\n\n\nland rover\n4\n0.0170940\n\n\nlincoln\n3\n0.0128205\n\n\nmercury\n4\n0.0170940\n\n\nnissan\n13\n0.0555556\n\n\npontiac\n5\n0.0213675\n\n\nsubaru\n14\n0.0598291\n\n\ntoyota\n34\n0.1452991\n\n\nvolkswagen\n27\n0.1153846\n\n\n\n\n\n\n\n\n6.4.2.1 Rounding the values\nWhen we calculate the relative frequency, we obtain numbers with a lot of decimals. We can used the round() function to specify the number of decimals we want. The syntax is round(value, number of decimals).\n\nmpg %>% \n group_by(manufacturer) %>% \n summarize(freq = n()) %>% # n() counts the number of observations for each group.\n mutate(rel_freq = round(freq/sum(freq),3)) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nmanufacturer\nfreq\nrel_freq\n\n\n\n\naudi\n18\n0.077\n\n\nchevrolet\n19\n0.081\n\n\ndodge\n37\n0.158\n\n\nford\n25\n0.107\n\n\nhonda\n9\n0.038\n\n\nhyundai\n14\n0.060\n\n\njeep\n8\n0.034\n\n\nland rover\n4\n0.017\n\n\nlincoln\n3\n0.013\n\n\nmercury\n4\n0.017\n\n\nnissan\n13\n0.056\n\n\npontiac\n5\n0.021\n\n\nsubaru\n14\n0.060\n\n\ntoyota\n34\n0.145\n\n\nvolkswagen\n27\n0.115\n\n\n\n\n\n\n\n\n\n6.4.2.2 Converting the relative frequency to percentages\nAnother thing we might want to do is show the relative frequency as a percentage. This can be done by multiplying the relative frequency by 100.\n\nmpg %>% \n group_by(manufacturer) %>% \n summarize(freq = n()) %>% # n() counts the number of observations for each group.\n mutate(pct = round(freq/sum(freq)*100,1)) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nmanufacturer\nfreq\npct\n\n\n\n\naudi\n18\n7.7\n\n\nchevrolet\n19\n8.1\n\n\ndodge\n37\n15.8\n\n\nford\n25\n10.7\n\n\nhonda\n9\n3.8\n\n\nhyundai\n14\n6.0\n\n\njeep\n8\n3.4\n\n\nland rover\n4\n1.7\n\n\nlincoln\n3\n1.3\n\n\nmercury\n4\n1.7\n\n\nnissan\n13\n5.6\n\n\npontiac\n5\n2.1\n\n\nsubaru\n14\n6.0\n\n\ntoyota\n34\n14.5\n\n\nvolkswagen\n27\n11.5\n\n\n\n\n\n\n\n\n\n6.4.2.3 Ordering the categories\nI can use arrange() to reorder my table using alphabetical order or frequency. The syntax of the arrange function is arrange(x, variable to use for ordering). The variables are ordered in ascending order by default. To arrange your variable in descending order, you use desc() like this arrange(x, desc(variable to use for ordering)) . You can use arrange with numeric values or characters.\n\n6.4.2.3.1 Order by frequency (ascending)\n\nmpg %>% \n group_by(manufacturer) %>% \n summarize(freq = n()) %>% # n() counts the number of observations for each group.\n mutate(pct = round(freq/sum(freq)*100,1)) %>% \n arrange(freq) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nmanufacturer\nfreq\npct\n\n\n\n\nlincoln\n3\n1.3\n\n\nland rover\n4\n1.7\n\n\nmercury\n4\n1.7\n\n\npontiac\n5\n2.1\n\n\njeep\n8\n3.4\n\n\nhonda\n9\n3.8\n\n\nnissan\n13\n5.6\n\n\nhyundai\n14\n6.0\n\n\nsubaru\n14\n6.0\n\n\naudi\n18\n7.7\n\n\nchevrolet\n19\n8.1\n\n\nford\n25\n10.7\n\n\nvolkswagen\n27\n11.5\n\n\ntoyota\n34\n14.5\n\n\ndodge\n37\n15.8\n\n\n\n\n\n\n\n\n\n6.4.2.3.2 Order by frequency (decreasing)\n\nmpg %>% \n group_by(manufacturer) %>% \n summarize(freq = n()) %>% # n() counts the number of observations for each group.\n mutate(pct = round(freq/sum(freq)*100,1)) %>% \n arrange(desc(freq)) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nmanufacturer\nfreq\npct\n\n\n\n\ndodge\n37\n15.8\n\n\ntoyota\n34\n14.5\n\n\nvolkswagen\n27\n11.5\n\n\nford\n25\n10.7\n\n\nchevrolet\n19\n8.1\n\n\naudi\n18\n7.7\n\n\nhyundai\n14\n6.0\n\n\nsubaru\n14\n6.0\n\n\nnissan\n13\n5.6\n\n\nhonda\n9\n3.8\n\n\njeep\n8\n3.4\n\n\npontiac\n5\n2.1\n\n\nland rover\n4\n1.7\n\n\nmercury\n4\n1.7\n\n\nlincoln\n3\n1.3\n\n\n\n\n\n\n\n\n\n\n\n6.4.3 Adding a total\nIn chapter 3 we learned about the bind_rows() function that can be used to append a tibble to combine two tibbles. So the process for adding a new row with a total is:\n\nStore you frequency table in an object\nCreate a new tibble that contains the totals. (important: make sure that the column names of this tibble are exactly the same as your frequency table, otherwise, bind_rows() will create new columns).\nUse bind_rows() to add the tibble with the total to the frequency table.\n\n\n# Store you frequency table in an object\ntable <- mpg %>% \n group_by(manufacturer) %>% \n summarize(freq = n()) %>% # n() counts the number of observations for each group.\n mutate(pct = round(freq/sum(freq)*100,1)) %>% \n arrange(desc(freq))\n\n# Create a new tibble that contains the totals.\ntotals <- table %>% \n summarize(freq = sum(freq),\n pct = sum(pct)) %>% \n mutate(manufacturer = \"Total\") %>% \n select(manufacturer, freq, pct)\n \n# Use bind_rows() to combine the tibbles\n \ntable %>% \n bind_rows(totals) %>% \n kbl() %>% \n kable_classic()\n\n\n\n\nmanufacturer\nfreq\npct\n\n\n\n\ndodge\n37\n15.8\n\n\ntoyota\n34\n14.5\n\n\nvolkswagen\n27\n11.5\n\n\nford\n25\n10.7\n\n\nchevrolet\n19\n8.1\n\n\naudi\n18\n7.7\n\n\nhyundai\n14\n6.0\n\n\nsubaru\n14\n6.0\n\n\nnissan\n13\n5.6\n\n\nhonda\n9\n3.8\n\n\njeep\n8\n3.4\n\n\npontiac\n5\n2.1\n\n\nland rover\n4\n1.7\n\n\nmercury\n4\n1.7\n\n\nlincoln\n3\n1.3\n\n\nTotal\n234\n99.9"
},
{
"objectID": "summarizing_data.html#summarizing-numerical-data",
"href": "summarizing_data.html#summarizing-numerical-data",
"title": "6 Summarizing data",
"section": "6.5 Summarizing numerical data",
"text": "6.5 Summarizing numerical data\nSummarizing numerical data is not done with frequency tables but with statistical summaries that include various measures that can be divided into three groups:\n\nMeasures of centrality\nMeasures of dispersion\nMeasures of skewness\n\n\n6.5.1 Measures of centrality\n\n\n\n\n\n\n\n\n\nStatistic\ndescription\nformula\nR function\n\n\n\n\nMean\nThe sum of values divided by the number of observations\n\\[\n\\overline{X} = \\frac{\\sum{X}}{n}\n\\]\nmean(x)\n\n\nMedian\nthe middle value of the variable once sorted in ascending or descending order\nIf n is odd:\n\\[\nM_x = x_\\frac{n + 1}{2}\n\\]\nIf n is even:\n\\[\nM_x = \\frac{x_{(n/2)} + x_{(n/2)+1}}{2}\n\\]\nmedian(x)\n\n\nMode\nMost frequent value(s) of a variable\nN/A\nN/A (see below)\n\n\n\nWhile there are no functions in R to calculate the mode, you can create your own function. source: https://stackoverflow.com/questions/2547402/how-to-find-the-statistical-mode\n\nmodes <- function(x) {\n ux <- unique(x)\n tab <- tabulate(match(x, ux))\n ux[tab == max(tab)]\n}\n\nmodes(c(1,2,3,4,4,5,5,6,7,8,9))\n\n[1] 4 5\n\n\n\n\n6.5.2 Measures of dispersion\n\n\n\n\n\n\n\n\n\nStatistic\nDefinition\nFormula\nR function\n\n\n\n\nVariance (Var)\nExpected squared deviation from the mean. Measures how far numbers spread around the average\n\\[\nVar = \\frac{\\sum{(x_i-\\overline{x})^2}}{N}\n\\]\nvar(x)\n\n\nStandard deviation (SD\nSquare root of the vVariance.\n\\[\nSD = \\sqrt{\\frac{\\sum{(x_i-\\overline{x})^2}}{N}}\n\\]\nsd(x)\n\n\nMinimum (Min)\nMinimum value of a variable\nN/A\nmin(x)\n\n\nMaximum (Max)\nMaximum value of a variable\nN/A\nmax(x)\n\n\nQuartiles\nThe value under which 25% (Q1), 50% (Q2, also the median), and 75% (Q3) data points are found when arranged in increasing order.\n\nQ1 = uantile(x, 0.25)\nQ2 = quantile(x, 0.5)\nQ3 = quantile(x, 0.75)\n\n\n\n\n\n6.5.3 Measures of symmetry\nThe psych package includes two functions to calculate the skewness (skew()) and the kurtosis (kurtosi()). These measures tell you if the values deviate from the normal distribution. A skewness above 1 or below –1 indicates a skewed distribution to the right or the left, respectively. A kurtosis above 1 or below -1 indicates a distribution that is too peaked or too flat, respectively."
},
{
"objectID": "summarizing_data.html#creating-a-descriptive-statistics-summary",
"href": "summarizing_data.html#creating-a-descriptive-statistics-summary",
"title": "6 Summarizing data",
"section": "6.6 Creating a descriptive statistics summary",
"text": "6.6 Creating a descriptive statistics summary\nWe can easily create a table with the descriptive statistics summary for as many numerical variables as we want using the pivot_longer(), group_by() and summarize() functions that you already learned about. In the code below, I reduce the size of the fonts in my table with kable_style(font_size = 10) so that the table can fit on the page.\n\nlibrary(psych) # Load the psych library for the skew() and kurtosi() functions\n\nmpg %>%\n pivot_longer(c(\"displ\",\"hwy\",\"cty\"), # this is where we specify which variables to include\n names_to = \"variable\", \n values_to = \"value\") %>% \n group_by(variable) %>% \n summarize(n = n(),\n mean = mean(value),\n sd = sd(value),\n var = var(value),\n q1 = quantile(value,0.25),\n median = median(value),\n q3 = quantile(value,0.75),\n min = min(value),\n max = max(value),\n skew = skew(value),\n kurtosis = kurtosi(value)\n ) %>% \n kbl() %>% \n kable_styling(font_size = 10)\n\n\n\n\nvariable\nn\nmean\nsd\nvar\nq1\nmedian\nq3\nmin\nmax\nskew\nkurtosis\n\n\n\n\ncty\n234\n16.858974\n4.255946\n18.113074\n14.0\n17.0\n19.0\n9.0\n35\n0.7863773\n1.4305385\n\n\ndispl\n234\n3.471795\n1.291959\n1.669158\n2.4\n3.3\n4.6\n1.6\n7\n0.4386361\n-0.9105615\n\n\nhwy\n234\n23.440171\n5.954643\n35.457779\n18.0\n24.0\n27.0\n12.0\n44\n0.3645158\n0.1369447"
},
{
"objectID": "summarizing_data.html#summary",
"href": "summarizing_data.html#summary",
"title": "6 Summarizing data",
"section": "6.7 Summary",
"text": "6.7 Summary\nIn this chapter, we learned how to produce clear and well presented tables to summarize data and help us and our readers understand the data and ensure that it is adequate for the analyses."
},
{
"objectID": "summarizing_data.html#homework",
"href": "summarizing_data.html#homework",
"title": "6 Summarizing data",
"section": "6.8 Homework",
"text": "6.8 Homework\nThe homework for this week is lab 4, in which you will summarize a dataset of your choice."
},
{
"objectID": "publishing_with_r.html",
"href": "publishing_with_r.html",
"title": "5 Publishing with R",
"section": "",
"text": "5.1 Learning objectives",
"crumbs": [
"<span class='chapter-number'>5</span> <span class='chapter-title'>Publishing with R</span>"
]
}
]