diff --git a/Presentations-Ghana/2024-10/1-introduction-to-r.Rmd b/Presentations-Ghana/2024-10/1-introduction-to-r.Rmd index c54a081..a29622d 100644 --- a/Presentations-Ghana/2024-10/1-introduction-to-r.Rmd +++ b/Presentations-Ghana/2024-10/1-introduction-to-r.Rmd @@ -2,7 +2,7 @@ title: "Session 1 - Introduction to R" subtitle: "R training" author: "María Reyes Retana" -date: "The World Bank | December 2024" +date: "The World Bank | January 2025" output: xaringan::moon_reader: css: ["libs/remark-css/default.css", "libs/remark-css/metropolis.css", "libs/remark-css/metropolis-fonts.css"] @@ -67,7 +67,7 @@ knitr::include_graphics("img/template.png") # Table of contents 1. [Introduction](#intro) -1. [Data work and Statistical Programming](#data-work) +1. [Government Analytics and Statistical Programming](#data-work) 1. [Statistical Programming](#statistical-programming) 1. [Writing R code](#writing-r-code) 1. [Object Types](#object-types) @@ -91,7 +91,7 @@ name: intro ## About this training -- This is an **introduction** to data work and statistical programming in R +- This is an **introduction** to government analytics and statistical programming in R - The training does not require any background in statistical programming @@ -109,7 +109,7 @@ By the end of the training, you will know: - How to write **basic** R code -- A notion of how to conduct data work in R and how it differentiates from Excel +- A notion of how to conduct Government analytics in R and how it differentiates from Excel ![Description of GIF](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExZWN5OHVrMjkwNHY4YTltZGlqcHhjM2pybmpudWN4YXJ4aDEzN3d0NCZlcD12MV9naWZzX3NlYXJjaCZjdD1n/2IudUHdI075HL02Pkk/giphy.gif) @@ -119,32 +119,32 @@ By the end of the training, you will know: class: inverse, center, middle name: data-work -# Data work and Statistical Programming +# Government Analytics and Statistical Programming

--- -# Data work +# Government Analytics -For the context of this training, we'll call data work everything that: +For the context of this training, we'll call Government analytics everything that: 1. Starts with a data input 1. Runs some process with the data 1. Produces an output with the result -```{r echo = FALSE, out.width="90%"} +```{r echo = FALSE, out.width="70%"} knitr::include_graphics("img/session1/data-work.png") ``` --- -# Data work +# Government Analytics -- It's also possible to do data work with Excel -- However, we will show in this training why using statistical programming (through R) is a better way of conducting data work +- It's also possible to do Government analytics with Excel +- However, we will show in this training why using statistical programming (through R) is a better way of conducting Government analytics -```{r echo = FALSE, out.width="90%"} +```{r echo = FALSE, out.width="70%"} knitr::include_graphics("img/session1/data-work-excel-r.png") ``` @@ -174,7 +174,7 @@ knitr::include_graphics("img/session1/code-workflow.png") # Statistical Programming - Programming consists of producing instructions to a computer to do something -- In the context of data work, that "something" is statistical analysis or mathematical operations +- In the context of Government analytics, that "something" is statistical analysis or mathematical operations - Hence, statistical programming consists of producing instructions so our computers will conduct statistical analysis on data ```{r echo = FALSE, out.width="70%"} @@ -467,7 +467,7 @@ knitr::include_graphics("img/session1/exercise2.png") ## R scripts -- In other words: scripts contain the instructions you give to your computer when doing data work +- In other words: scripts contain the instructions you give to your computer when doing Government analytics ```{r echo = FALSE, out.width="80%"} knitr::include_graphics("img/session1/data-work-script.png") @@ -640,6 +640,8 @@ print(sum_example) ``` +❎In Excel: This is as when you have a column of numbers in Excel and want to calculate their total + --- # Functions in R @@ -658,7 +660,7 @@ knitr::include_graphics("img/session1/sum-result.png") - We also know about objects and functions. -- We haven't still introduced the data to our data work. That comes next +- We haven't still introduced the data to our Government analytics. That comes next ![](https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExaXg3NW5jd2MzY2ZweDlnbjI4c3dnMnI3dTVvbml0aTY3ampraDViYyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/y9XCVEKx02Q3tyHSD5/giphy.gif) @@ -678,7 +680,7 @@ name: data-in-r ## Exercise 4: Loading data into R - 1.- Go to this page: https://osf.io/2apht and download the file `department_staff_list.xlsx` + 1.- Go to this page: https://osf.io/g2ezw and download the file `department_staff_list.csv` ```{r echo = FALSE, out.width="60%"} knitr::include_graphics("img/session1/osf-screenshot.png") @@ -692,7 +694,7 @@ knitr::include_graphics("img/session1/osf-screenshot.png") There are different ways of importing data to R, one is using the point and click. Let's start with that one. - 2.- In RStudio, go to `File` > `Import Dataset` > `From Excel` and select the file `department_staff_list.xlsx` + 2.- In RStudio, go to `File` > `Import Dataset` > `From Text (base)` and select the file `department_staff_list.csv` + If you don't know where the file is, check in your `Downloads` folder @@ -720,7 +722,7 @@ knitr::include_graphics("img/session1/downloads.png") 5 - You will see that the second way to read it by code (using functions), and is what R is doing for you in the background. -```{r echo = FALSE, out.width="40%"} +```{r echo = FALSE, out.width="30%"} knitr::include_graphics("img/session1/import3.png") ``` @@ -753,7 +755,7 @@ knitr::include_graphics("img/session1/environment2.png") # Data in R -- Since dataframes are also objects, we can refer to them with their names (exm: `department_staff_list.xlsx`) +- Since dataframes are also objects, we can refer to them with their names (exm: `department_staff_list.csv`) - We'll see an example of that in the next exercise @@ -768,8 +770,7 @@ knitr::include_graphics("img/session1/environment2.png") - Let's use another function to see what's in there ```{r, echo=FALSE, include=FALSE, message=FALSE} -library("readxl") -department_staff_list <- read_xlsx("data/department_staff_list.xlsx") +department_staff_list <- read.csv("data/department_staff_list.csv") ``` @@ -781,19 +782,23 @@ glimpse(department_staff_list) # Data in R -## Exercise 5: Subset the data +## Exercise 5: Using our data + +Imagine you want to quickly find out all the distinct departments listed in your staff dataset. In ❎ Excel, you might manually scroll or use 'Remove Duplicates.' In R, you can use the unique() function for this purpose. -1. Use the following code to subset `department_staff_list` and leave only the observations who are "Female": +1. Use the following code to find all the unique departments in `department_staff_list`: ```{r, echo=TRUE, include=TRUE, message=FALSE, warning=FALSE} -df_female <- subset(department_staff_list, sex == "Female") +unique_departments <- unique(department_staff_list$department) ``` - + + Note that $ is used to access the `department` column from the dataset. + Note that we are using the arrow operator (`<-`) to store the result - + Note that there are **two equal signs** in the condition, not one - + Also note that you need to write `"Female"` enclosed in quotes and with uppercase `F`, because that's how it is in the data -2. Use `View(df_female)` to visualize the dataframe again and see how it changed (note the uppercase "V") +2.\ Use `print(unique_departments)`to display the unique departments: +```{r, echo=TRUE, include=TRUE, message=FALSE, warning=FALSE} +print(unique_departments) +``` + --- @@ -814,7 +819,7 @@ There is an important difference between using `<-` and not using it - Not using `<-` **simply displays the result** in the console. The input dataframe will remain unchanged and the result **will not be stored** ```{r} -subset(department_staff_list, sex == "Female") +unique(department_staff_list$department) ``` --- @@ -825,18 +830,18 @@ subset(department_staff_list, sex == "Female") - Using `<-` tells R that we want to **store the result in a new object**, which is the object at the left side of the arrow. This time the result will not be printed in the console but the new dataframe will show in the environment panel ```{r echo=FALSE, message=FALSE} -department_staff_list <- read_xlsx("data/department_staff_list.xlsx") +department_staff_list <- read.csv("data/department_staff_list.csv") ``` ```{r, message=FALSE} -df_female <- subset(department_staff_list, sex == "Female") +unique_departments <- unique(department_staff_list$department) ``` --- # Data in R -- R can store multiple dataframes in the environment. This is analogous to having different spreadsheets in the same Excel window +- R can store multiple dataframes in the environment. This is analogous to having different spreadsheets in the same ❎Excel window - Always remember that dataframes are just objects in R. R differentiates which dataframe the code refers to with the dataframe name @@ -905,7 +910,7 @@ knitr::include_graphics("img/session1/save.png") This first session focused on the basics for writing R code -```{r echo = FALSE, out.width="90%"} +```{r echo = FALSE, out.width="55%"} knitr::include_graphics("img/session1/session1.png") ``` @@ -916,7 +921,7 @@ knitr::include_graphics("img/session1/session1.png") In the next session we will learn how to get data ready to be exported as outputs -```{r echo = FALSE, out.width="90%"} +```{r echo = FALSE, out.width="60%"} knitr::include_graphics("img/session1/session2.png") ``` diff --git a/Presentations-Ghana/2024-10/1-introduction-to-r.html b/Presentations-Ghana/2024-10/1-introduction-to-r.html index 09c07bd..1e5bf1e 100644 --- a/Presentations-Ghana/2024-10/1-introduction-to-r.html +++ b/Presentations-Ghana/2024-10/1-introduction-to-r.html @@ -34,7 +34,7 @@ ### María Reyes Retana ] .date[ -### The World Bank | December 2024 +### The World Bank | January 2025 ] --- @@ -56,7 +56,7 @@ # Table of contents 1. [Introduction](#intro) -1. [Data work and Statistical Programming](#data-work) +1. [Government Analytics and Statistical Programming](#data-work) 1. [Statistical Programming](#statistical-programming) 1. [Writing R code](#writing-r-code) 1. [Object Types](#object-types) @@ -80,7 +80,7 @@ ## About this training -- This is an **introduction** to data work and statistical programming in R +- This is an **introduction** to government analytics and statistical programming in R - The training does not require any background in statistical programming @@ -98,7 +98,7 @@ - How to write **basic** R code -- A notion of how to conduct data work in R and how it differentiates from Excel +- A notion of how to conduct Government analytics in R and how it differentiates from Excel ![Description of GIF](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExZWN5OHVrMjkwNHY4YTltZGlqcHhjM2pybmpudWN4YXJ4aDEzN3d0NCZlcD12MV9naWZzX3NlYXJjaCZjdD1n/2IudUHdI075HL02Pkk/giphy.gif) @@ -108,30 +108,30 @@ class: inverse, center, middle name: data-work -# Data work and Statistical Programming +# Government Analytics and Statistical Programming <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- -# Data work +# Government Analytics -For the context of this training, we'll call data work everything that: +For the context of this training, we'll call Government analytics everything that: 1. Starts with a data input 1. Runs some process with the data 1. Produces an output with the result -<img src="img/session1/data-work.png" width="90%" style="display: block; margin: auto;" /> +<img src="img/session1/data-work.png" width="70%" style="display: block; margin: auto;" /> --- -# Data work +# Government Analytics -- It's also possible to do data work with Excel -- However, we will show in this training why using statistical programming (through R) is a better way of conducting data work +- It's also possible to do Government analytics with Excel +- However, we will show in this training why using statistical programming (through R) is a better way of conducting Government analytics -<img src="img/session1/data-work-excel-r.png" width="90%" style="display: block; margin: auto;" /> +<img src="img/session1/data-work-excel-r.png" width="70%" style="display: block; margin: auto;" /> --- @@ -157,7 +157,7 @@ # Statistical Programming - Programming consists of producing instructions to a computer to do something -- In the context of data work, that "something" is statistical analysis or mathematical operations +- In the context of Government analytics, that "something" is statistical analysis or mathematical operations - Hence, statistical programming consists of producing instructions so our computers will conduct statistical analysis on data <img src="img/session1/data-work-with-instructions.png" width="70%" style="display: block; margin: auto;" /> @@ -431,7 +431,7 @@ ## R scripts -- In other words: scripts contain the instructions you give to your computer when doing data work +- In other words: scripts contain the instructions you give to your computer when doing Government analytics <img src="img/session1/data-work-script.png" width="80%" style="display: block; margin: auto;" /> @@ -602,6 +602,8 @@ ``` +❎In Excel: This is as when you have a column of numbers in Excel and want to calculate their total + --- # Functions in R @@ -618,7 +620,7 @@ - We also know about objects and functions. -- We haven't still introduced the data to our data work. That comes next +- We haven't still introduced the data to our Government analytics. That comes next ![](https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExaXg3NW5jd2MzY2ZweDlnbjI4c3dnMnI3dTVvbml0aTY3ampraDViYyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/y9XCVEKx02Q3tyHSD5/giphy.gif) @@ -638,7 +640,7 @@ ## Exercise 4: Loading data into R - 1.- Go to this page: https://osf.io/2apht and download the file `department_staff_list.xlsx` + 1.- Go to this page: https://osf.io/g2ezw and download the file `department_staff_list.csv` <img src="img/session1/osf-screenshot.png" width="60%" style="display: block; margin: auto;" /> @@ -650,7 +652,7 @@ There are different ways of importing data to R, one is using the point and click. Let's start with that one. - 2.- In RStudio, go to `File` > `Import Dataset` > `From Excel` and select the file `department_staff_list.xlsx` + 2.- In RStudio, go to `File` > `Import Dataset` > `From Text (base)` and select the file `department_staff_list.csv` + If you don't know where the file is, check in your `Downloads` folder @@ -674,7 +676,7 @@ 5 - You will see that the second way to read it by code (using functions), and is what R is doing for you in the background. -<img src="img/session1/import3.png" width="40%" style="display: block; margin: auto;" /> +<img src="img/session1/import3.png" width="30%" style="display: block; margin: auto;" /> --- @@ -701,7 +703,7 @@ # Data in R -- Since dataframes are also objects, we can refer to them with their names (exm: `department_staff_list.xlsx`) +- Since dataframes are also objects, we can refer to them with their names (exm: `department_staff_list.csv`) - We'll see an example of that in the next exercise @@ -739,20 +741,40 @@ # Data in R -## Exercise 5: Subset the data +## Exercise 5: Using our data + +Imagine you want to quickly find out all the distinct departments listed in your staff dataset. In ❎ Excel, you might manually scroll or use 'Remove Duplicates.' In R, you can use the unique() function for this purpose. -1. Use the following code to subset `department_staff_list` and leave only the observations who are "Female": +1. Use the following code to find all the unique departments in `department_staff_list`: ``` r -df_female <- subset(department_staff_list, sex == "Female") +unique_departments <- unique(department_staff_list$department) ``` - + + Note that $ is used to access the `department` column from the dataset. + Note that we are using the arrow operator (`<-`) to store the result - + Note that there are **two equal signs** in the condition, not one - + Also note that you need to write `"Female"` enclosed in quotes and with uppercase `F`, because that's how it is in the data -2. Use `View(df_female)` to visualize the dataframe again and see how it changed (note the uppercase "V") +2.\ Use `print(unique_departments)`to display the unique departments: + +``` r +print(unique_departments) +``` + +``` +## [1] "State Protocol Dept" "Bureau of Languages" +## [3] "Controller & Accountant General" "Stool Lands" +## [5] "Co-operatives" "Factories Inspectorate" +## [7] "Dept of Labour" "Procurement & Supply CM" +## [9] "Management Services" "Public Records" +## [11] "Public Works" "Rural Housing" +## [13] "Parks & Gardens" "Births & Deaths" +## [15] "Department of Comm Devt" "Feeder Roads" +## [17] "Urban Roads" "Koforidua Training Center" +## [19] "Dept of Children" "Dept of Gender" +## [21] "Social Welfare" "Information Services" +## [23] "Rent Control" +``` + --- @@ -772,27 +794,22 @@ ``` r -subset(department_staff_list, sex == "Female") +unique(department_staff_list$department) ``` ``` -## # A tibble: 3,572 × 7 -## sex current_grade date_of_birth date_of_first_appoin…¹ -## <chr> <chr> <dttm> <dttm> -## 1 Female Director Fin. & Admin. 1964-09-25 00:00:00 1994-08-31 00:00:00 -## 2 Female Chief Exe. Officer 1965-06-15 00:00:00 1990-12-01 00:00:00 -## 3 Female Senior Records Officer 1974-04-29 00:00:00 2013-05-01 00:00:00 -## 4 Female Senior Procurement/supply … 1973-08-08 00:00:00 2003-09-15 00:00:00 -## 5 Female Snr. Private Secretary 1979-08-29 00:00:00 2002-01-07 00:00:00 -## 6 Female Private Secretary 1982-12-26 00:00:00 2003-03-12 00:00:00 -## 7 Female Private Secretary 1986-02-13 00:00:00 2019-10-30 00:00:00 -## 8 Female Stenographer Gd I 1986-11-14 00:00:00 2008-08-20 00:00:00 -## 9 Female Records Officer 1981-06-17 00:00:00 2018-06-16 00:00:00 -## 10 Female Principal Exe. Officer 1979-11-20 00:00:00 2019-12-23 00:00:00 -## # ℹ 3,562 more rows -## # ℹ abbreviated name: ¹​date_of_first_appointment -## # ℹ 3 more variables: senior_junior_staff <chr>, department <chr>, -## # years_of_service <dbl> +## [1] "State Protocol Dept" "Bureau of Languages" +## [3] "Controller & Accountant General" "Stool Lands" +## [5] "Co-operatives" "Factories Inspectorate" +## [7] "Dept of Labour" "Procurement & Supply CM" +## [9] "Management Services" "Public Records" +## [11] "Public Works" "Rural Housing" +## [13] "Parks & Gardens" "Births & Deaths" +## [15] "Department of Comm Devt" "Feeder Roads" +## [17] "Urban Roads" "Koforidua Training Center" +## [19] "Dept of Children" "Dept of Gender" +## [21] "Social Welfare" "Information Services" +## [23] "Rent Control" ``` --- @@ -803,21 +820,17 @@ - Using `<-` tells R that we want to **store the result in a new object**, which is the object at the left side of the arrow. This time the result will not be printed in the console but the new dataframe will show in the environment panel -``` -## Warning in read_fun(path = path, sheet_i = sheet, limits = limits, shim = shim, -## : NA inserted for an unsupported date prior to 1900 -``` ``` r -df_female <- subset(department_staff_list, sex == "Female") +unique_departments <- unique(department_staff_list$department) ``` --- # Data in R -- R can store multiple dataframes in the environment. This is analogous to having different spreadsheets in the same Excel window +- R can store multiple dataframes in the environment. This is analogous to having different spreadsheets in the same ❎Excel window - Always remember that dataframes are just objects in R. R differentiates which dataframe the code refers to with the dataframe name @@ -880,7 +893,7 @@ This first session focused on the basics for writing R code -<img src="img/session1/session1.png" width="90%" style="display: block; margin: auto;" /> +<img src="img/session1/session1.png" width="55%" style="display: block; margin: auto;" /> --- @@ -889,7 +902,7 @@ In the next session we will learn how to get data ready to be exported as outputs -<img src="img/session1/session2.png" width="90%" style="display: block; margin: auto;" /> +<img src="img/session1/session2.png" width="60%" style="display: block; margin: auto;" /> --- diff --git a/Presentations-Ghana/2024-10/1-introduction-to-r.pdf b/Presentations-Ghana/2024-10/1-introduction-to-r.pdf index f34bf35..ceede65 100644 Binary files a/Presentations-Ghana/2024-10/1-introduction-to-r.pdf and b/Presentations-Ghana/2024-10/1-introduction-to-r.pdf differ diff --git a/Presentations-Ghana/2024-10/2-data-wrangling.Rmd b/Presentations-Ghana/2024-10/2-data-wrangling.Rmd index 308af75..b4013bd 100644 --- a/Presentations-Ghana/2024-10/2-data-wrangling.Rmd +++ b/Presentations-Ghana/2024-10/2-data-wrangling.Rmd @@ -2,7 +2,7 @@ title: "Session 2 - Data wrangling" subtitle: "R training" author: "María Reyes Retana" -date: "The World Bank | December 2024" +date: "The World Bank | January 2025" output: xaringan::moon_reader: css: ["libs/remark-css/default.css", "libs/remark-css/metropolis.css", "libs/remark-css/metropolis-fonts.css"] @@ -180,14 +180,13 @@ knitr::include_graphics("img/session2/packages-options.png") # R Packages -We'll use two package in today's session: `dplyr` (really useful library for data wrangling) and `openxlsx` (to work with Excel files) +We'll use one package in today's session: `dplyr` (really useful library for data wrangling). ## Exercise 2: Installing R Packages 1. Install the R Packages by using `install.packages()` + `install.packages("dplyr")` - + `install.packages("openxlsx")` + Note the quotes (`" "`) in the packages names + **Introduce this code in the console**, not the script panel @@ -197,12 +196,6 @@ knitr::include_graphics("img/session2/dplyr-install.png") ``` ] -.pull-right[ -```{r echo = FALSE, out.width="85%"} -knitr::include_graphics("img/session2/open-install.png") -``` -] - --- @@ -218,7 +211,7 @@ knitr::include_graphics("img/session2/installing-dplyr.png") # R packages -Now that `dplyr` and `openxlsx ` are installed, we only need to load them to start using the functions they have. +Now that `dplyr` is installed, we only need to load them to start using the functions they have. ## Exercise 3: Loading packages @@ -229,7 +222,6 @@ Now that `dplyr` and `openxlsx ` are installed, we only need to load them to sta ```{r, warning=FALSE, message=FALSE} library(dplyr) -library(openxlsx) ``` + Run this code from the new script you just opened @@ -290,7 +282,7 @@ knitr::include_graphics("img/session2/data-wrangling.png") ## Getting your data ready -- Data wrangling is one of the most crucial and time-consuming aspects of data work +- Data wrangling is one of the most crucial and time-consuming aspects of Government analytics - It involves not only coding, but also the mental exercise of thinking what is the shape and condition that your dataframe needs to have in order to produce your desired output @@ -324,16 +316,16 @@ knitr::include_graphics("img/session2/dplyr.png") Note that this part of this is the same exercise we did in session 1, but it's okay to repeat it in order to start using a new RStudio session. **If you have RStudio open, start by closing the window and opening RStudio again**. .pull-left[ -1. Inyou new RStudio window, go to `File` > `Import Dataset` > `From Excel` and select again the file `department_staff_list.xslx` +1. Inyou new RStudio window, go to `File` > `Import Dataset` > `From Text(base)` and select again the file `department_staff_list.csv` + if you don't know where the file is, check in the `Downloads` folder - + if you need to download it again, it's here: https://osf.io/2apht + + if you need to download it again, it's here: https://osf.io/g2ezw 1. Make sure to select `Heading` > `Yes` in the next window 1. Select `Import` -1. Download this new file: https://osf.io/v6psa and repeat steps 1-3 with it +1. Download this new file: https://osf.io/tvfyr and repeat steps 1-3 with it ] .pull-right[ @@ -348,8 +340,8 @@ knitr::include_graphics("img/session1/import3.png") # Data wrangling ```{r, echo=FALSE} -department_staff_list <- read.xlsx("data/department_staff_list.xlsx") -department_staff_age <- read.xlsx("data/department_staff_age.xlsx") +department_staff_list <- read.csv("data/department_staff_list.csv") +department_staff_age <- read.csv("data/department_staff_age.csv") ``` ```{r echo = FALSE, out.width="85%"} @@ -362,15 +354,29 @@ knitr::include_graphics("img/session2/ex4.png") ## Note: loading data with a function -- You can also load Excel data with the function `read.xlsx()` instead of using this point-and-click approach +- You can also load Excel files with the function `read.csv()` instead of using this point-and-click approach -- The **argument** of `read.xlsx()` is the path in your computer where your data is. For example +- The **argument** of `read.csv()` is the path in your computer where your data is. For example ```{r, eval=FALSE} -department_staff_list <- read.xlsx("C:/Users/wb614536/Downloads/department_staff_list.xlsx") +department_staff_list <-read.csv("C:/Users/wb614536/Downloads/department_staff_list.csv") ``` -- As usual, you need to save the result of `read.xlsx()` into a dataframe object with the arrow operator (`<-`) for it to be stored in the environment +- As usual, you need to save the result of `read.csv()` into a dataframe object with the arrow operator (`<-`) for it to be stored in the environment + +--- + +# Note on file paths + +- A **file path** tells R where to find your file on your computer. It is like giving R the directions to your file. + +- If you downloaded the data and haven't moved it, the data path for the deparment_list dataset will be probably something like: `"C:/Users/wb614536/Downloads/department_staff_list.csv"` + +- You can find the path of a file by right-clicking>properties>and seeing the location, or by + +```{r echo = FALSE, out.width="85%"} +knitr::include_graphics("img/session2/path.png") +``` --- @@ -402,78 +408,9 @@ glimpse(department_staff_age) - We will only use this second dataframe in one of the next exercises, but we load it now because it's in general a good practice to have data loaded into the memory so it's ready to be used. -- For the next exercises, we will face mention scenarios that could show up in doing your annual reports on in day-to-day operations. - ---- - -# The pipe ( %>% ) operator - - -- Before diving into data wrangling, we will to introduce a **super useful tool**: - **the pipe (`%>%`)**. +- For the next exercises, we will propose scenarios that could show up while doing your annual reports or in day-to-day operations. -- The pipe is part of the `dplyr` package. It helps to write code in a way that is easier to read and understand. - -- Reading and understanding multiple operations can be difficult. - -- The pipe operator (**%>%**) can help with this. - ---- - -# The pipe operator - - -- With the pipe, code reads from left to right, top to bottom, which is more intuitive. - -- %>% can be read as "then" and simplifies code structure. - -For example let's see at this mock get to work sequence example: - -**Without pipe ( %>% )** (hard to read and understand the sequence) - -```{r, eval=FALSE} -go_to_work(make_breakfast(work_out(brush_teeth(wake_up(Mer))))) -``` - -**With pipe ( %>% )** (the order is clear) - -```{r, eval=FALSE} -Mer %>% - wake_up() %>% - brush_teeth() %>% - work_out() %>% - make_breakfast() %>% - go_to_work() -``` - ---- - -# The pipe - -Normally the functions that we will use are organized around a set of verbs, or actions to be taken. - --- - -* Most *verbs* work as follows: - - -$$\text{verb}(\underbrace{\text{data.frame}}_{\text{1st argument}}, \underbrace{\text{what to do}}_\text{2nd argument})$$ - --- - -* Alternatively you can (**should**) use the `pipe` operator `%>%`: - -$$\underbrace{\text{data.frame}}_{\text{1st argument}} \underbrace{\text{ %>% }}_{\text{"pipe" operator}} \text{verb}(\underbrace{\text{what to do}}_\text{2nd argument})$$ - -We will start using the pipe from this point. Please ask if something is not clear. - -**Tip**💡: Use Shift + Ctrl/Cmd + M as a shortcut for the pipe operator. - ---- - -# Questions? - -![](https://media.giphy.com/media/XHVmD4RyXgSjd8aUMb/giphy.gif?cid=790b7611auicodssx8z4zly6u6z5k4pd0r1i13drege3yunc&ep=v1_gifs_search&rid=giphy.gif&ct=g) +- We will do everything using **functions** from the `dplyr` package, that is already in our environment. --- @@ -493,15 +430,17 @@ knitr::include_graphics("img/session2/filter.png") ``` -## Data work request +## Government analytics request **Scenario 1**: Imagine you want to include in the annual reports, the number of total female employees in the department staff: +❎In Excel: The command that we will learn here, is similar to using filter in Excel. + --- # Filtering and sorting -## Data work request +## Government analytics request .pull-left[ *We will use the data from our dataframe department_staff_list* @@ -518,13 +457,15 @@ knitr::include_graphics("img/session2/filtering-sorting-planning.png") # Filtering and sorting -## Data work request +## Government analytics request Now let's say that we are also interested at a first glance of the females that recently joined the department. We would have to do the following 1. Keeping only the female employees 1. Sorting by years of service +❎In Excel: We would filter and then use arrange by years of service. + --- @@ -534,13 +475,12 @@ Now let's say that we are also interested at a first glance of the females that Use `filter()` for this: +Remember how we use functions in this case 1st argument is **data**, second argument is **filter** (sex takes the value female in this case) + ```{r, eval=FALSE} -department_female <- department_staff_list %>% # pipe takes data as first argument "and then" - filter(sex == "Female") # we filter by Female +temp1 <- filter(department_staff_list, sex == "Female") ``` -Remember that we said we were starting to use the pipe (%>%). The pipe takes our data as the first argument, and then we need to pass only the instruction (filter) - ```{r echo = FALSE, out.width="45%"} knitr::include_graphics("img/session2/filtering-sorting1.png") ``` @@ -553,10 +493,7 @@ knitr::include_graphics("img/session2/filtering-sorting1.png") Use the function `arrange()` to sort. Sortings are ascending by default in R (this will get us from less years of service to more years of service). ```{r eval=FALSE} -department_female <- department_staff_list %>% - filter(sex == "Female") %>% # pipe takes previous instruction "and then" - arrange(years_of_service) # arrange by years of service - +department_female <- arrange(temp1, years_of_service) # arrange by years of service ``` ```{r echo = FALSE, out.width="45%"} @@ -576,9 +513,10 @@ We can write the whole code for this in our exercise script. (You can copy and p 2.- Sort by `years_of_service`: ```{r, eval=FALSE} -department_female <- department_staff_list %>% - filter( sex == "Female") %>% - arrange(years_of_service) +temp1 <- filter(department_staff_list, sex == "Female") # filter by female + +department_female <- arrange(temp1, years_of_service) # order by years of service + ``` --- @@ -589,7 +527,7 @@ Some notes: .pull-left[ - `filter()` and `arrange()` are all functions from `dplyr`. Remember you have to always load `dplyr` first with `library(dplyr)` to be able to use them -- The resulting dataframe is `department_female` +- The resulting dataframe is `department_female` (and we also have `temp1`) ] .pull-right[ ```{r echo = FALSE, out.width="95%"} @@ -636,17 +574,23 @@ knitr::include_graphics("img/session2/mutate.png") ## Example -Using our data frame let's say that we want to include a variable that for any reason instead of years of service we want days of service. We would do this by doing the following: +Using our data frame let's say that we want to include a variable that instead of years of service we want days of service. We would do this by doing the following: ```{r, eval=FALSE} -example_mutate <- department_staff_list %>% - mutate(days_of_service = years_of_service*365) +example_mutate <- mutate(department_staff_list, days_of_service = years_of_service*365) ``` -We will not use that variable for our datawork examples, but this is a really useful data wrangling function. +We will not use that variable for our government analytics examples, but this is a really useful data wrangling function. + +--- + +# Questions? + +![](https://media.giphy.com/media/XHVmD4RyXgSjd8aUMb/giphy.gif?cid=790b7611auicodssx8z4zly6u6z5k4pd0r1i13drege3yunc&ep=v1_gifs_search&rid=giphy.gif&ct=g) --- + class: inverse, center, middle name: merging @@ -659,21 +603,21 @@ name: merging # Merging dataframes -Merging data is a common task in data analysis, especially when working with large datasets. +Merging data is a common task in data analysis, especially when working with data from multiple departments. Let's see how it would apply to our dataframes. -## Data work request +## Government analytics request **Scenario 2:** -*Let's imagine that for our annual report we are also interested in the age distribution from the employees* +*Let's imagine that for our annual report we are also interested in the age distribution from the employees* but these are in different datasets --- # Merging dataframes -## Data work request +## Government analytics request .pull-left[ *Use the data `department_staff_list` that you already know with the department_staff_age* @@ -692,11 +636,10 @@ knitr::include_graphics("img/session2/merging-planning.png") To do this we will use `left_join()` to merge the dataframes: -- The arguments of the function are 1. our previous selected data (passed by the pipe) and 2. the dataframe we want to merge to the first one +- The arguments of the function are 1. the "principal" dataset and 2. the dataframe we want to merge (paste) into that one ```{r eval=FALSE} -deparment_staff_list_age <- deparment_staff_list %>% # Our original data frame " %>% and then" - left_join(department_staff_age) # our second data frame +department_age <- left_join(department_staff_list, department_staff_age) # Our original data frame ``` @@ -713,8 +656,7 @@ knitr::include_graphics("img/session2/merging.png") - You can copy and paste the code to your exercise script. ```{r eval=FALSE} -deparment_age <- department_staff_list %>% # Our original data frame "and then" - left_join(department_staff_age) # our second data frame +deparment_age <- left_join(deparment_staff_list, department_staff_age) # Our original data frame ``` ```{r echo = FALSE, out.width="95%"} knitr::include_graphics("img/session2/exercise6.png") @@ -735,7 +677,7 @@ knitr::include_graphics("img/session2/group_by.png") # Grouping and summarizing -## Data work request +## Government analytics request **Let's say that we are interested again in the number of employees by department** @@ -745,8 +687,8 @@ We will use our department_staff_list dataset to do this. ```{r, eval=FALSE} -employees_by_deparment <- department_staff_list %>% - group_by(department) +employees_by_deparment <- group_by(department_staff_list, department) + ``` @@ -756,7 +698,7 @@ If we **only** do this, this won't do anything, to complete the function we need # Grouping and summarizing -## Data work request +## Government analytics request summarize works in a similar way to mutate: @@ -767,20 +709,23 @@ variable_name = some_calculation In this case the `some_calculation` will be to count the number of employees ```{r, warning=FALSE, message=FALSE} -employees_by_deparment <- department_staff_list %>% - group_by(department) %>% - summarise(number = n()) +temp1 <- group_by(department_staff_list, department) +employees_by_department <- summarise(temp1, number = n()) + ``` --- # Grouping and summarizing -## Data work request +## Government analytics request This will create the following dataframe/table: ```{r, warning=FALSE, message=FALSE, echo=FALSE} -employees_by_deparment +employees_by_department +``` +```{r, echo=FALSE} +write.csv(employees_by_department, "data/employees_by_department.csv", row.names = FALSE) ``` --- @@ -800,7 +745,7 @@ These were two examples we chose to show different possible data wrangling opera | Append dataframes | `bind_rows()` | | Deduplicate | `distinct()` | | Collapse and create summary indicators | `group_by()`, `summarize()` | -| Pass a result as the first argument for the next function | `%>%` (operator, not function)| +| Pass a result as the first argument for the next function | `%>%` (operator, not function (**tomorrow**))| --- @@ -815,7 +760,7 @@ name: exporting-outputs # Exporting outputs -- Until now, we've seen full examples of part 1 and 2 of the data work pipeline +- Until now, we've seen full examples of part 1 and 2 of the Government analytics pipeline - What about exporting outputs? ```{r echo = FALSE, out.width="90%"} @@ -830,9 +775,9 @@ knitr::include_graphics("img/session2/data-work-progress.png") ## Exporting dataframes -- We can export it as an Excel file with the function `write.xlsx()` +- We can export it as a csv file with the function `write.csv()` -- `write.xlsx()` creates a Excel file with the dataframe +- `write.csv()` creates a csv file with the dataframe - It takes two basic arguments: @@ -848,16 +793,14 @@ knitr::include_graphics("img/session2/data-work-progress.png") 1. Use this code to export the results of the last two exercises: ```{r eval=FALSE} -write.xslx(employees_by_deparment, - "employees_by_deparment", - row.names = FALSE) +write.csv(employees_by_deparment, "employees_by_deparment", row.names = FALSE) ``` --- # Exporting outputs -Now `employees_by_deparment.xlsx` (probably in your `Documents` folder). +Now `employees_by_deparment.csv` (probably in your `Documents` folder). ```{r echo = FALSE, out.width="70%"} knitr::include_graphics("img/session2/exported-csv.png") @@ -869,12 +812,10 @@ knitr::include_graphics("img/session2/exported-csv.png") ## Some notes on file paths -- The second argument of `write.xlsx()` specifies the file path we export the dataframe to +- The second argument of `write.csv()` specifies the file path we export the dataframe to ```{r eval=FALSE} -write.xslx(employees_by_deparment, - "employees_by_deparment", - row.names = FALSE) +write.csv(employees_by_deparment, "employees_by_deparment", row.names = FALSE) ``` - You can include any path in your computer and R will write the file in that location @@ -920,7 +861,7 @@ knitr::include_graphics("img/session2/save.png") # Wrapping up -## Data work pipeline +## Government analytics pipeline ```{r echo = FALSE, out.width="85%"} knitr::include_graphics("img/session2/data-work-final.png") @@ -930,7 +871,7 @@ knitr::include_graphics("img/session2/data-work-final.png") # Wrapping up -## Data work pipeline +## Government analytics pipeline ```{r echo = FALSE, out.width="85%"} knitr::include_graphics("img/session2/data-work-day-2.png") @@ -968,8 +909,7 @@ Imaging I only want a list with ID and department. Use `select()` for this: ```{r eval=FALSE} -temp1 <- department_staff_list %>% - select(ID, department) +temp1 <- select(department_staff_list, ID, department) ``` --- @@ -984,10 +924,11 @@ For our current data this does not apply, but this will be really useful for bud This would look something like this ```{r eval=FALSE} -budget_2024 <- budget_data %>% - group_by(year, product) %>% - summarise(total = sum(income), - average = mean(income)) +budget_2024 <- group_by(budget_data, year, product) + +budget_2024 <- summarise(budget_2024, + total = sum(income), + average = mean(income)) ``` --- diff --git a/Presentations-Ghana/2024-10/2-data-wrangling.html b/Presentations-Ghana/2024-10/2-data-wrangling.html index c5728f0..c867453 100644 --- a/Presentations-Ghana/2024-10/2-data-wrangling.html +++ b/Presentations-Ghana/2024-10/2-data-wrangling.html @@ -34,7 +34,7 @@ ### María Reyes Retana ] .date[ -### The World Bank | December 2024 +### The World Bank | January 2025 ] --- @@ -159,14 +159,13 @@ # R Packages -We'll use two package in today's session: `dplyr` (really useful library for data wrangling) and `openxlsx` (to work with Excel files) +We'll use one package in today's session: `dplyr` (really useful library for data wrangling). ## Exercise 2: Installing R Packages 1. Install the R Packages by using `install.packages()` + `install.packages("dplyr")` - + `install.packages("openxlsx")` + Note the quotes (`" "`) in the packages names + **Introduce this code in the console**, not the script panel @@ -174,10 +173,6 @@ <img src="img/session2/dplyr-install.png" width="85%" style="display: block; margin: auto;" /> ] -.pull-right[ -<img src="img/session2/open-install.png" width="85%" style="display: block; margin: auto;" /> -] - --- @@ -191,7 +186,7 @@ # R packages -Now that `dplyr` and `openxlsx ` are installed, we only need to load them to start using the functions they have. +Now that `dplyr` is installed, we only need to load them to start using the functions they have. ## Exercise 3: Loading packages @@ -203,7 +198,6 @@ ``` r library(dplyr) -library(openxlsx) ``` + Run this code from the new script you just opened @@ -256,7 +250,7 @@ ## Getting your data ready -- Data wrangling is one of the most crucial and time-consuming aspects of data work +- Data wrangling is one of the most crucial and time-consuming aspects of Government analytics - It involves not only coding, but also the mental exercise of thinking what is the shape and condition that your dataframe needs to have in order to produce your desired output @@ -286,16 +280,16 @@ Note that this part of this is the same exercise we did in session 1, but it's okay to repeat it in order to start using a new RStudio session. **If you have RStudio open, start by closing the window and opening RStudio again**. .pull-left[ -1. Inyou new RStudio window, go to `File` > `Import Dataset` > `From Excel` and select again the file `department_staff_list.xslx` +1. Inyou new RStudio window, go to `File` > `Import Dataset` > `From Text(base)` and select again the file `department_staff_list.csv` + if you don't know where the file is, check in the `Downloads` folder - + if you need to download it again, it's here: https://osf.io/2apht + + if you need to download it again, it's here: https://osf.io/g2ezw 1. Make sure to select `Heading` > `Yes` in the next window 1. Select `Import` -1. Download this new file: https://osf.io/v6psa and repeat steps 1-3 with it +1. Download this new file: https://osf.io/tvfyr and repeat steps 1-3 with it ] .pull-right[ @@ -317,16 +311,28 @@ ## Note: loading data with a function -- You can also load Excel data with the function `read.xlsx()` instead of using this point-and-click approach +- You can also load Excel files with the function `read.csv()` instead of using this point-and-click approach -- The **argument** of `read.xlsx()` is the path in your computer where your data is. For example +- The **argument** of `read.csv()` is the path in your computer where your data is. For example ``` r -department_staff_list <- read.xlsx("C:/Users/wb614536/Downloads/department_staff_list.xlsx") +department_staff_list <-read.csv("C:/Users/wb614536/Downloads/department_staff_list.csv") ``` -- As usual, you need to save the result of `read.xlsx()` into a dataframe object with the arrow operator (`<-`) for it to be stored in the environment +- As usual, you need to save the result of `read.csv()` into a dataframe object with the arrow operator (`<-`) for it to be stored in the environment + +--- + +# Note on file paths + +- A **file path** tells R where to find your file on your computer. It is like giving R the directions to your file. + +- If you downloaded the data and haven't moved it, the data path for the deparment_list dataset will be probably something like: `"C:/Users/wb614536/Downloads/department_staff_list.csv"` + +- You can find the path of a file by right-clicking>properties>and seeing the location, or by + +<img src="img/session2/path.png" width="85%" style="display: block; margin: auto;" /> --- @@ -342,9 +348,9 @@ ``` ``` -## Rows: 8,754 +## Rows: 8,714 ## Columns: 6 -## $ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,… +## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,… ## $ sex <chr> "Female", "Male", "Female", "Female", "Female", "F… ## $ current_grade <chr> "Director Fin. & Admin.", "Deputy Director (Admin.… ## $ senior_junior_staff <chr> "senior", "senior", "senior", "senior", "senior", … @@ -366,9 +372,9 @@ ``` ``` -## Rows: 8,754 +## Rows: 8,714 ## Columns: 2 -## $ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,… +## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,… ## $ age <dbl> 60.15058, 41.74949, 59.43053, 50.55989, 51.28268, 45.22656, 41.900… ``` @@ -378,80 +384,9 @@ - We will only use this second dataframe in one of the next exercises, but we load it now because it's in general a good practice to have data loaded into the memory so it's ready to be used. -- For the next exercises, we will face mention scenarios that could show up in doing your annual reports on in day-to-day operations. - ---- - -# The pipe ( %>% ) operator - - -- Before diving into data wrangling, we will to introduce a **super useful tool**: - **the pipe (`%>%`)**. - -- The pipe is part of the `dplyr` package. It helps to write code in a way that is easier to read and understand. - -- Reading and understanding multiple operations can be difficult. +- For the next exercises, we will propose scenarios that could show up while doing your annual reports or in day-to-day operations. -- The pipe operator (**%>%**) can help with this. - ---- - -# The pipe operator - - -- With the pipe, code reads from left to right, top to bottom, which is more intuitive. - -- %>% can be read as "then" and simplifies code structure. - -For example let's see at this mock get to work sequence example: - -**Without pipe ( %>% )** (hard to read and understand the sequence) - - -``` r -go_to_work(make_breakfast(work_out(brush_teeth(wake_up(Mer))))) -``` - -**With pipe ( %>% )** (the order is clear) - - -``` r -Mer %>% - wake_up() %>% - brush_teeth() %>% - work_out() %>% - make_breakfast() %>% - go_to_work() -``` - ---- - -# The pipe - -Normally the functions that we will use are organized around a set of verbs, or actions to be taken. - --- - -* Most *verbs* work as follows: - - -`$$\text{verb}(\underbrace{\text{data.frame}}_{\text{1st argument}}, \underbrace{\text{what to do}}_\text{2nd argument})$$` - --- - -* Alternatively you can (**should**) use the `pipe` operator `%>%`: - -`$$\underbrace{\text{data.frame}}_{\text{1st argument}} \underbrace{\text{ %>% }}_{\text{"pipe" operator}} \text{verb}(\underbrace{\text{what to do}}_\text{2nd argument})$$` - -We will start using the pipe from this point. Please ask if something is not clear. - -**Tip**💡: Use Shift + Ctrl/Cmd + M as a shortcut for the pipe operator. - ---- - -# Questions? - -![](https://media.giphy.com/media/XHVmD4RyXgSjd8aUMb/giphy.gif?cid=790b7611auicodssx8z4zly6u6z5k4pd0r1i13drege3yunc&ep=v1_gifs_search&rid=giphy.gif&ct=g) +- We will do everything using **functions** from the `dplyr` package, that is already in our environment. --- @@ -469,15 +404,17 @@ <img src="img/session2/filter.png" width="85%" style="display: block; margin: auto;" /> -## Data work request +## Government analytics request **Scenario 1**: Imagine you want to include in the annual reports, the number of total female employees in the department staff: +❎In Excel: The command that we will learn here, is similar to using filter in Excel. + --- # Filtering and sorting -## Data work request +## Government analytics request .pull-left[ *We will use the data from our dataframe department_staff_list* @@ -492,13 +429,15 @@ # Filtering and sorting -## Data work request +## Government analytics request Now let's say that we are also interested at a first glance of the females that recently joined the department. We would have to do the following 1. Keeping only the female employees 1. Sorting by years of service +❎In Excel: We would filter and then use arrange by years of service. + --- @@ -508,14 +447,13 @@ Use `filter()` for this: +Remember how we use functions in this case 1st argument is **data**, second argument is **filter** (sex takes the value female in this case) + ``` r -department_female <- department_staff_list %>% # pipe takes data as first argument "and then" - filter(sex == "Female") # we filter by Female +temp1 <- filter(department_staff_list, sex == "Female") ``` -Remember that we said we were starting to use the pipe (%>%). The pipe takes our data as the first argument, and then we need to pass only the instruction (filter) - <img src="img/session2/filtering-sorting1.png" width="45%" style="display: block; margin: auto;" /> --- @@ -527,9 +465,7 @@ ``` r -department_female <- department_staff_list %>% - filter(sex == "Female") %>% # pipe takes previous instruction "and then" - arrange(years_of_service) # arrange by years of service +department_female <- arrange(temp1, years_of_service) # arrange by years of service ``` <img src="img/session2/filtering-sorting2.png" width="45%" style="display: block; margin: auto;" /> @@ -548,9 +484,9 @@ ``` r -department_female <- department_staff_list %>% - filter( sex == "Female") %>% - arrange(years_of_service) +temp1 <- filter(department_staff_list, sex == "Female") # filter by female + +department_female <- arrange(temp1, years_of_service) # order by years of service ``` --- @@ -561,7 +497,7 @@ .pull-left[ - `filter()` and `arrange()` are all functions from `dplyr`. Remember you have to always load `dplyr` first with `library(dplyr)` to be able to use them -- The resulting dataframe is `department_female` +- The resulting dataframe is `department_female` (and we also have `temp1`) ] .pull-right[ <img src="img/session2/ex5.png" width="95%" style="display: block; margin: auto;" /> @@ -605,18 +541,24 @@ ## Example -Using our data frame let's say that we want to include a variable that for any reason instead of years of service we want days of service. We would do this by doing the following: +Using our data frame let's say that we want to include a variable that instead of years of service we want days of service. We would do this by doing the following: ``` r -example_mutate <- department_staff_list %>% - mutate(days_of_service = years_of_service*365) +example_mutate <- mutate(department_staff_list, days_of_service = years_of_service*365) ``` -We will not use that variable for our datawork examples, but this is a really useful data wrangling function. +We will not use that variable for our government analytics examples, but this is a really useful data wrangling function. + +--- + +# Questions? + +![](https://media.giphy.com/media/XHVmD4RyXgSjd8aUMb/giphy.gif?cid=790b7611auicodssx8z4zly6u6z5k4pd0r1i13drege3yunc&ep=v1_gifs_search&rid=giphy.gif&ct=g) --- + class: inverse, center, middle name: merging @@ -629,21 +571,21 @@ # Merging dataframes -Merging data is a common task in data analysis, especially when working with large datasets. +Merging data is a common task in data analysis, especially when working with data from multiple departments. Let's see how it would apply to our dataframes. -## Data work request +## Government analytics request **Scenario 2:** -*Let's imagine that for our annual report we are also interested in the age distribution from the employees* +*Let's imagine that for our annual report we are also interested in the age distribution from the employees* but these are in different datasets --- # Merging dataframes -## Data work request +## Government analytics request .pull-left[ *Use the data `department_staff_list` that you already know with the department_staff_age* @@ -660,12 +602,11 @@ To do this we will use `left_join()` to merge the dataframes: -- The arguments of the function are 1. our previous selected data (passed by the pipe) and 2. the dataframe we want to merge to the first one +- The arguments of the function are 1. the "principal" dataset and 2. the dataframe we want to merge (paste) into that one ``` r -deparment_staff_list_age <- deparment_staff_list %>% # Our original data frame " %>% and then" - left_join(department_staff_age) # our second data frame +department_age <- left_join(department_staff_list, department_staff_age) # Our original data frame ``` @@ -681,8 +622,7 @@ ``` r -deparment_age <- department_staff_list %>% # Our original data frame "and then" - left_join(department_staff_age) # our second data frame +deparment_age <- left_join(deparment_staff_list, department_staff_age) # Our original data frame ``` <img src="img/session2/exercise6.png" width="95%" style="display: block; margin: auto;" /> --- @@ -699,7 +639,7 @@ # Grouping and summarizing -## Data work request +## Government analytics request **Let's say that we are interested again in the number of employees by department** @@ -709,8 +649,7 @@ ``` r -employees_by_deparment <- department_staff_list %>% - group_by(department) +employees_by_deparment <- group_by(department_staff_list, department) ``` If we **only** do this, this won't do anything, to complete the function we need to use this with `summarise()` @@ -719,7 +658,7 @@ # Grouping and summarizing -## Data work request +## Government analytics request summarize works in a similar way to mutate: @@ -732,15 +671,14 @@ ``` r -employees_by_deparment <- department_staff_list %>% - group_by(department) %>% - summarise(number = n()) +temp1 <- group_by(department_staff_list, department) +employees_by_department <- summarise(temp1, number = n()) ``` --- # Grouping and summarizing -## Data work request +## Government analytics request This will create the following dataframe/table: @@ -749,19 +687,20 @@ ## # A tibble: 23 × 2 ## department number ## <chr> <int> -## 1 Births & Deaths 179 -## 2 Bureau of Languages 59 +## 1 Births & Deaths 172 +## 2 Bureau of Languages 58 ## 3 Co-operatives 293 ## 4 Controller & Accountant General 3776 ## 5 Department of Comm Devt 109 ## 6 Dept of Children 101 -## 7 Dept of Gender 101 -## 8 Dept of Labour 293 -## 9 Factories Inspectorate 122 -## 10 Feeder Roads 103 +## 7 Dept of Gender 100 +## 8 Dept of Labour 290 +## 9 Factories Inspectorate 121 +## 10 Feeder Roads 102 ## # ℹ 13 more rows ``` + --- # More wrangling operations @@ -779,7 +718,7 @@ | Append dataframes | `bind_rows()` | | Deduplicate | `distinct()` | | Collapse and create summary indicators | `group_by()`, `summarize()` | -| Pass a result as the first argument for the next function | `%>%` (operator, not function)| +| Pass a result as the first argument for the next function | `%>%` (operator, not function (**tomorrow**))| --- @@ -794,7 +733,7 @@ # Exporting outputs -- Until now, we've seen full examples of part 1 and 2 of the data work pipeline +- Until now, we've seen full examples of part 1 and 2 of the Government analytics pipeline - What about exporting outputs? <img src="img/session2/data-work-progress.png" width="90%" style="display: block; margin: auto;" /> @@ -807,9 +746,9 @@ ## Exporting dataframes -- We can export it as an Excel file with the function `write.xlsx()` +- We can export it as a csv file with the function `write.csv()` -- `write.xlsx()` creates a Excel file with the dataframe +- `write.csv()` creates a csv file with the dataframe - It takes two basic arguments: @@ -826,16 +765,14 @@ ``` r -write.xslx(employees_by_deparment, - "employees_by_deparment", - row.names = FALSE) +write.csv(employees_by_deparment, "employees_by_deparment", row.names = FALSE) ``` --- # Exporting outputs -Now `employees_by_deparment.xlsx` (probably in your `Documents` folder). +Now `employees_by_deparment.csv` (probably in your `Documents` folder). <img src="img/session2/exported-csv.png" width="70%" style="display: block; margin: auto;" /> @@ -845,13 +782,11 @@ ## Some notes on file paths -- The second argument of `write.xlsx()` specifies the file path we export the dataframe to +- The second argument of `write.csv()` specifies the file path we export the dataframe to ``` r -write.xslx(employees_by_deparment, - "employees_by_deparment", - row.names = FALSE) +write.csv(employees_by_deparment, "employees_by_deparment", row.names = FALSE) ``` - You can include any path in your computer and R will write the file in that location @@ -893,7 +828,7 @@ # Wrapping up -## Data work pipeline +## Government analytics pipeline <img src="img/session2/data-work-final.png" width="85%" style="display: block; margin: auto;" /> @@ -901,7 +836,7 @@ # Wrapping up -## Data work pipeline +## Government analytics pipeline <img src="img/session2/data-work-day-2.png" width="85%" style="display: block; margin: auto;" /> @@ -936,8 +871,7 @@ ``` r -temp1 <- department_staff_list %>% - select(ID, department) +temp1 <- select(department_staff_list, ID, department) ``` --- @@ -953,10 +887,11 @@ ``` r -budget_2024 <- budget_data %>% - group_by(year, product) %>% - summarise(total = sum(income), - average = mean(income)) +budget_2024 <- group_by(budget_data, year, product) + +budget_2024 <- summarise(budget_2024, + total = sum(income), + average = mean(income)) ``` --- diff --git a/Presentations-Ghana/2024-10/2-data-wrangling.pdf b/Presentations-Ghana/2024-10/2-data-wrangling.pdf index 71d2512..cf93770 100644 Binary files a/Presentations-Ghana/2024-10/2-data-wrangling.pdf and b/Presentations-Ghana/2024-10/2-data-wrangling.pdf differ diff --git a/Presentations-Ghana/2024-10/3-descriptive-statistics.Rmd b/Presentations-Ghana/2024-10/3-descriptive-statistics.Rmd index 51d48cd..3ddc0d9 100644 --- a/Presentations-Ghana/2024-10/3-descriptive-statistics.Rmd +++ b/Presentations-Ghana/2024-10/3-descriptive-statistics.Rmd @@ -1,8 +1,8 @@ --- -title: "Session 3 - Descriptive statistics" -subtitle: "R training" +title: "Session 3 - Descriptive Statistics" +subtitle: "R Training" author: "María Reyes Retana" -date: "The World Bank | December 2024" +date: "The World Bank | January 2025" output: xaringan::moon_reader: css: ["libs/remark-css/default.css", @@ -17,6 +17,7 @@ output: countIncrementalSlides: false --- + ```{r setup, include = FALSE} # Load packages library(knitr) @@ -25,6 +26,8 @@ library(here) library(dplyr) library(modelsummary) library(huxtable) +library(readxl) +library(kableExtra) here::i_am("3-descriptive-statistics.Rmd") options(htmltools.dir.version = FALSE) opts_chunk$set( @@ -91,29 +94,46 @@ name: intro # Introduction - We learned yesterday how to conduct statistical programming and export the results in `.csv` files -- However, sometime we might need more refined tables than simple (and ugly) CSVs +- However, sometime we might need more refined tables than simple (and ugly) csvs ```{r echo = FALSE, out.width="95%"} -knitr::include_graphics("img/session3/data-work-descriptive-stats.png") +knitr::include_graphics("img/session3/data-work-final-table.png") ``` --- # Introduction -- That's what today's session is about, along with an explanation of the pipes (`%>%`) +- That's what today's session is about, along with an introduction of the pipe (`%>%`) ```{r echo = FALSE, out.width="95%"} -knitr::include_graphics("img/session3/data-work-descriptive-stats.png") +knitr::include_graphics("img/session3/data-work-final-table.png") ``` --- # Introduction -## Exercise 1a: Getting the libraries for today's session +## Relevance to your work + +In your **annual reports**, you often include: +- Summary tables like **revenue by year**, **spending across categories**, or **staff counts by department**. +- These tables help summarize key trends and patterns for decision-makers. + +```{r echo = FALSE, out.width="45%"} +knitr::include_graphics("img/session3/programmed_psrl_2023.png") +``` + +- Today, we will practice **creating similar summary tables** using mock data. +- While we’re using a simple dataset today, the same steps can be applied to your own data for reports. + +--- + +# Introduction + +## Exercise 1a: Getting the packages for today's session -We're going to use two R libraries in this session: `modelsummary` and `huxtable`. +We're going to use two R packages in this session: `modelsummary`, `huxtable` and ` dplyr`. 1. Install `modelsummary` and `huxtable`: @@ -126,6 +146,30 @@ install.packages("huxtable") knitr::include_graphics("img/session3/install.png") ``` +## Exercise 1b: Download and load the data we'll use + +1. First of all let's open a script. `File`>`New File`>`R Script` + +```{r echo = FALSE, out.width="55%"} +knitr::include_graphics("img/session3/r-script.png") +``` + +2. Add the packages we will use today + +```{r, eval=FALSE} +library(modelsummary) +library(huxtable) +library(readxl) +library(dplyr) +``` + +3. Load the data we will use. (remember our two methods point click and code) + +```{r} +department_staff_final <- read.csv("data/department_staff_final.csv") +``` + + --- # Introduction @@ -133,9 +177,9 @@ knitr::include_graphics("img/session3/install.png") ## Exercise 1b: Download and load the data we'll use .pull-left[ -1. Go to https://osf.io/z8snr and download the file +1. Go to https://osf.io/zqa6j and download the file -1. In RStudio, go to `File` > `Import Dataset` > `From Text (base)` and select the file `small_business_2019_all.csv` +1. In RStudio, go to `File` > `Import Dataset` > `From Text (base)` and select the file `department_staff_final.csv` + If you don't know where the file is, remember to check in your `Downloads` folder @@ -164,14 +208,10 @@ knitr::include_graphics("img/session3/environment.png") ## Recap: always know your data! -- This data is similar to the one we used before -- Every row is one business in one tax period (month) -- `modified_id` is a business identifier -- We also have information about the region, firm age, monthly income, VAT liability -- There is one more variable we didn't see before: `group` contains the group the firm was assigned to in a random experiment +- This is the data that we used yesterday! -```{r echo = FALSE, out.width="40%"} -knitr::include_graphics("img/session3/df.png") +```{r, echo=TRUE} +glimpse(department_staff_final) ``` --- @@ -185,98 +225,118 @@ name: piping --- -# Piping +# The pipe ( %>% ) operator -- Before we start producing more refined outputs, we need to cover piping -- You probably remember this piece of code from one of yesterday's exercise: +- Before diving into the contents of today's session, we will to introduce a **super useful tool**: + **the pipe (`%>%`)**. -```{r eval= FALSE} -# Filter only businesses in Tbilisi: -temp1 <- filter(small_business_2019, region == "Tbilisi") +- The pipe is part of the `dplyr` package. It helps to write code in a way that is easier to read and understand. -# Sort previous result by income, descending order: -temp2 <- arrange(temp1, -income) +- Reading and understanding multiple operations can be difficult. -# Keep only the 50 first businesses after sorting: -df_tbilisi_50 <- filter(temp2, row_number() <= 50) -``` +- The pipe operator (**%>%**) can help with this. --- -# Piping +# The pipe operator -This code works, but the problem with it is that it makes us generate unnecessary intermediate dataframes (`temp1`, `temp2`) that store results temporarily -```{r echo = FALSE, out.width="75%"} -knitr::include_graphics("img/session3/temp_dfs.png") +- With the pipe, code reads from left to right, top to bottom, which is more intuitive. + +- %>% can be read as "then" and simplifies code structure. + +For example let's see at this mock get to work sequence example: + +**Without pipe ( %>% )** (hard to read and understand the sequence) + +```{r, eval=FALSE} +go_to_work(make_breakfast(work_out(brush_teeth(wake_up(Mer))))) +``` + +**With pipe ( %>% )** (the order is clear) + +```{r, eval=FALSE} +Mer %>% + wake_up() %>% + brush_teeth() %>% + work_out() %>% + make_breakfast() %>% + go_to_work() ``` --- -# Piping +# The pipe -Instead, we can use pipes to **pass the results of a function and apply a new function on top of it** +As we saw yesterday the functions are normally organized around a set of verbs, or actions to be taken. + +-- + +* Most *verbs* work as follows: -.pull-left[ -```{r eval= FALSE} -# Filter only businesses in Tbilisi: -temp1 <- filter(small_business_2019, - region == "Tbilisi") -# Sort previous result by income, descending order: -temp2 <- arrange(temp1, - -income) +$$\text{verb}(\underbrace{\text{data.frame}}_{\text{1st argument}}, \underbrace{\text{what to do}}_\text{2nd argument})$$ -# Keep only the 50 first businesses after sorting: -df_tbilisi_50 <- filter(temp2, - row_number() <= 50) +-- + +* Alternatively you can (**should**) use the `pipe` operator `%>%`: + +$$\underbrace{\text{data.frame}}_{\text{1st argument}} \underbrace{\text{ %>% }}_{\text{"pipe" operator}} \text{verb}(\underbrace{\text{what to do}}_\text{2nd argument})$$ + +We will start using the pipe from this point. Please ask if something is not clear. + +**Tip**💡: Use Shift + Ctrl/Cmd + M as a shortcut for the pipe operator. + +--- + +# The pipe + +- You probably remember this piece of code from one of yesterday's exercise: + +```{r eval= FALSE} +# Filter only female employees: +temp1 <- filter(department_staff_list, sex == "Female") # filter by female + +# Sort previous result by years of service +department_female <- arrange(temp1, years_of_service) # order by years of service ``` -] -.pull-right[ -```{r eval=FALSE} -# The same but with pipes: -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) +--- + +# The pipe + +This code works, but the problem with it is that it makes us generate unnecessary intermediate dataframes (`temp1`) that store results temporarily + +```{r echo = FALSE, out.width="75%"} +knitr::include_graphics("img/session3/temp_dfs.png") ``` -] --- -# Piping +# The pipe + +Instead, we can use pipes to **pass the results of a function and apply a new function on top of it** (just like Mer's waking up sequence) .pull-left[ ```{r eval= FALSE} -# Filter only businesses in Tbilisi: -temp1 <- filter(small_business_2019, - region == "Tbilisi") - -# Sort previous result by income, descending order: -temp2 <- arrange(temp1, - -income) +# Filter only female employees: +temp1 <- filter(department_staff_list, sex == "Female") # filter by female -# Keep only the 50 first businesses after sorting: -df_tbilisi_50 <- filter(temp2, - row_number() <= 50) +# Sort previous result by years of service +department_female <- arrange(temp1, years_of_service) # order by years of service ``` ] .pull-right[ ```{r eval=FALSE} # The same but with pipes: -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) +department_female <- filter(department_staff_list, sex == "Female") %>% + arrange(years_of_service) ``` ] -There are several important details to notice here: - -1.- The resulting dataframe `df_tbilisi_50` is **the same in both cases** +- The usefulness of the pipe (%>%) becomes more evident when the code starts to get more complicated. --- @@ -284,31 +344,25 @@ There are several important details to notice here: .pull-left[ ```{r eval= FALSE} -# Filter only businesses in Tbilisi: -temp1 <- filter(small_business_2019, - region == "Tbilisi") - -# Sort previous result by income, descending order: -temp2 <- arrange(temp1, - -income) +# Filter only female employees: +temp1 <- filter(department_staff_list, sex == "Female") # filter by female -# Keep only the 50 first businesses after sorting: -df_tbilisi_50 <- filter(temp2, - row_number() <= 50) +# Sort previous result by years of service +department_female <- arrange(temp1, years_of_service) # order by years of service ``` ] .pull-right[ ```{r eval=FALSE} # The same but with pipes: -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) +department_female <- filter(department_staff_list, sex == "Female") %>% + arrange(years_of_service) ``` ] -2.- The name of the resulting dataframe is now defined in the first line of this data wrangling operation. This is because **R evaluates lines with consecutive pipes as if they were a single line** +There are several important details to notice here: + +1.- The resulting dataframe `department_female` is **the same in both cases** --- @@ -316,31 +370,23 @@ df_tbilisi_50 <- filter(small_business_2019, .pull-left[ ```{r eval= FALSE} -# Filter only businesses in Tbilisi: -temp1 <- filter(small_business_2019, - region == "Tbilisi") - -# Sort previous result by income, descending order: -temp2 <- arrange(temp1, - -income) +# Filter only female employees: +temp1 <- filter(department_staff_list, sex == "Female") # filter by female -# Keep only the 50 first businesses after sorting: -df_tbilisi_50 <- filter(temp2, - row_number() <= 50) +# Sort previous result by years of service +department_female <- arrange(temp1, years_of_service) # order by years of service ``` ] .pull-right[ ```{r eval=FALSE} # The same but with pipes: -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) +department_female <- filter(department_staff_list, sex == "Female") %>% + arrange(years_of_service) ``` ] -3.- Notice that the functions `arrange()` and `filter()` used after the pipes now have only **one argument instead of two**. This is because when using pipes the first argument is implied to be result of the function before the pipes +2.- Notice that the functions `arrange()` and `filter()` used after the pipes now have only **one argument instead of two**. This is because when using pipes the first argument is implied to be result of the function before the pipes --- @@ -350,128 +396,139 @@ df_tbilisi_50 <- filter(small_business_2019, 1. Apply the same filtering and sorting now with pipes +**Solution** + ```{r eval=FALSE} -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) +department_female <- filter(department_staff_final, sex == "Female") %>% + arrange(years_of_service) ``` +* Note: that our dataframe now is named `depatment_staf_final` (is the joined dataframe from yesterday's session) + --- # Piping Now we will not have any annoying intermediate results stored in our environment! -.pull-left[ -```{r eval=FALSE} -# The same but with pipes: -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) -``` -] +- Good code is code that is both correct (does what it's supposed to) and it's easy to understand -.pull-right[ -```{r echo = FALSE, out.width="85%"} -knitr::include_graphics("img/session3/env-pipes.png") -``` -] +- Piping is **instrumental for writing good code in R** --- # Piping -Lastly, we can also add more formatting to this code to improve its clarity even more: +## Always use pipes! + +Now that you now about the power of the pipes, use them wisely! .pull-left[ -```{r eval=FALSE} -# Previous solution -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) -``` +- Remember that pipes are part of the library `dplyr`, you need to load it before using them + +- Pipes also improve code clarity drastically + +- Many R coders use pipes and internet examples assume you know them + +- **We'll use pipes now in the next examples and exercises of the rest of this training** ] .pull-right[ -```{r eval=FALSE} -# The same with better spacing -df_tbilisi_50 <- - small_business_2019 %>% - filter(region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) +```{r echo = FALSE, out.width="45%"} +knitr::include_graphics("img/session3/pipes-joke.png") ``` ] --- -# Piping +class: inverse, center, middle +name: quick-summary-stats -.pull-left[ -```{r eval=FALSE} -# Previous solution -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) -``` -] +# Quick summary statistics -.pull-right[ -```{r eval=FALSE} -# The same with better spacing -df_tbilisi_50 <- - small_business_2019 %>% - filter(region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) -``` -] +

-- Good code is code that is both correct (does what it's supposed to) and it's easy to understand +--- -- Piping is **instrumental for writing good code in R** +# Quick Summary Statistics ---- +## Refresher: Grouping and Summarizing Data -# Piping +Yesterday, we learned how to: +1. **Group data** using `group_by()` +2. **Summarize** results with `summarise()` -## Always use pipes! +This is a powerful tool to create **summary tables**—such as totals, averages, or counts—that are essential for your annual reports. -Now that you now about the power of the pipes, use them wisely! +### Example -.pull-left[ -- Remember that pipes are part of the library `dplyr`, you need to load it before using them +```{r, eval=FALSE} +# Summarize total revenue by month +summary_table <- psrl_data %>% + group_by(Month) %>% + summarise(Total_Revenue = sum(Revenue, na.rm = TRUE)) -- Pipes also improve code clarity drastically +print(summary_table) +``` -- Many R coders use pipes and internet examples assume you know them +--- -- **We'll use pipes now in the next examples and exercises of the rest of this training** +# Quick summary statistics + +```{r, echo=FALSE, eval=TRUE, include=FALSE} +# Creating a mock PSRL dataset +psrl_data <- data.frame( + Month = c("Jan-23", "Feb-23", "Mar-23", "Apr-23", "May-23", "Jun-23", + "Jul-23", "Aug-23", "Sep-23", "Oct-23", "Nov-23", "Dec-23"), + Diesel = c(120, 130, 125, 140, 150, 155, 160, 170, 165, 180, 190, 200), + LPG = c(60, 65, 63, 70, 75, 78, 80, 85, 83, 90, 95, 100), + Petrol = c(100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155), + MGO_F = c(30, 35, 33, 40, 42, 45, 50, 48, 47, 55, 58, 60), + Gas_Oil = c(25, 30, 28, 35, 38, 40, 42, 45, 48, 50, 52, 55), + Unified_Naphtha = c(15, 18, 20, 22, 24, 26, 28, 30, 32, 35, 38, 40), + Total = c(350, 383, 379, 422, 449, 469, 490, 513, 515, 555, 583, 610) +) + +# View the dataset +print(psrl_data) + +``` + +Applying This to Your Work +.pull-left[ +**Annual Report Table** +![Programmed PSRL for 2023](img/session3/programmed_psrl_2023.png) +*Monthly totals for different products.* ] .pull-right[ -```{r echo = FALSE, out.width="45%"} -knitr::include_graphics("img/session3/pipes-joke.png") +**Code to Recreate the Table** + +```{r} +# Summarize all numeric columns by Month +summary_table <- psrl_data %>% + group_by(Month) %>% + summarise(across(where(is.numeric), \(x) sum(x, na.rm = TRUE))) + ``` ] --- -class: inverse, center, middle -name: quick-summary-stats - # Quick summary statistics -

+```{r} +# Print the summary table +print(summary_table) +``` + --- # Quick summary statistics +## Beyond Basic Summaries: Customizing Your Results + We learned yesterday how to produce dataframes with results and export them. ## But what if you want to ... ? @@ -531,14 +588,14 @@ datasummary_skim( 1. Load `modelsummary` with `library(modelsummary)` -1. Use `datasummary_skim()` to create a descriptive statistics table for `small_business_all` +1. Use `datasummary_skim()` to create a descriptive statistics table for `department_staff` ```{r echo=FALSE} -small_business_2019_all <- read.csv("data/small_business_2019_all.csv") +department_staff <- read.csv("data/department_staff_final.csv") ``` -```{r eval=FALSE} -datasummary_skim(small_business_2019_all) +```{r eval=FALSE, warning=FALSE, message=FALSE} +datasummary_skim(department_staff) ``` --- @@ -547,8 +604,8 @@ datasummary_skim(small_business_2019_all) You should be seeing this result in the lower right panel of RStudio. -```{r echo=FALSE} -datasummary_skim(small_business_2019_all) +```{r echo = FALSE, out.width="55%"} +knitr::include_graphics("img/session3/datasummary_skim.png") ``` --- @@ -559,16 +616,16 @@ datasummary_skim(small_business_2019_all) - To summarize categorical variables, use the argument `type = "categorical"` -```{r eval=FALSE} -datasummary_skim(small_business_2019_all, type = "categorical") +```{r eval=FALSE, warning=FALSE, message=FALSE} +datasummary_skim(department_staff, type = "categorical") ``` --- # Quick summary statistics -```{r echo=FALSE} -datasummary_skim(small_business_2019_all, type = "categorical") +```{r echo = FALSE, out.width="55%"} +knitr::include_graphics("img/session3/datasummary_cat.png") ``` --- @@ -577,10 +634,6 @@ datasummary_skim(small_business_2019_all, type = "categorical") - `datasummary_skim()` is convenient because it's fast, easy, and shows a lot of information -```{r echo = FALSE} -datasummary_skim(small_business_2019_all) -``` - - But what if we wanted to customize what to show? that's when we use `datasummary()` instead, also from the library `modelsummary` --- @@ -621,36 +674,28 @@ datasummary( ## Exercise 4: -Create a summary statistics table showing the nuber of observations, mean, standard deviation, minimum, and maximum for variables `age`, `income`, and `vat_liability` of the dataframe `small_business_2019_all` +Create a summary statistics table showing the number of observations, mean, standard deviation, minimum, and maximum for variables `years_of_service` of the dataframe `department_staff` 1. Use `datasummary()` for this: ```{r eval=FALSE} datasummary( - age + income + vat_liability ~ N + Mean + SD + Min + Max, - small_business_2019_all + years_of_service ~ N + Mean + SD + Min + Max, + department_staff ) ``` ---- - -# Customized summary statistics - -```{r echo=FALSE} -datasummary( - age + income + vat_liability ~ N + Mean + SD + Min + Max, - small_business_2019_all -) +```{r echo = FALSE, out.width="55%"} +knitr::include_graphics("img/session3/custom.png") ``` - --- # Customized summary statistics ```{r eval=FALSE} datasummary( - age + income + vat_liability ~ N + Mean + SD + Min + Max, # this is the formula - small_business_2019_all # this is the data + years_of_service ~ N + Mean + SD + Min + Max, # this is the formula + department_staff # this is the data ) ``` @@ -661,14 +706,17 @@ Some notes: - The formula should always be defined as: rows ~ columns - The rows and columns in the formula are separated by a plus (`+`) sign + +In Excel ❎ you would need to calculate each of the statistics in a new table, by selecting the data and using the appropriate formula. + --- # Customized summary statistics ```{r eval=FALSE} datasummary( - age + income + vat_liability ~ N + Mean + SD + Min + Max, # this is the formula - small_business_2019_all # this is the data + years_of_service ~ N + Mean + SD + Min + Max, # this is the formula + department_staff # this is the data ) ``` @@ -699,12 +747,14 @@ Remember that both `datasummary_skim()` and `datasummary()` have an optional arg For example: -```{r eval=FALSE} -datasummary_skim(small_business_2019_all, +```{r warning=FALSE, message=FALSE} +datasummary_skim(department_staff, output = "quick_stats.docx") ``` -Will export the result to the `Documents` folder (in Windows) in a Word file named `quick_stats.docx` +Will export the result to the `Documents` folder (in Windows) in a Word file named `quick_stats.docx` + +*Note* for this code to work we would need to install an extra package `pandoc` --- @@ -740,13 +790,13 @@ Noticed that we're missing Excel? ## Exercise 5: Export a table to Excel -1. Load `huxtable` with `library(huxtable)` +1. Load `huxtable` with `library(huxtable)` (we already did this at the beginning of the session) 1. Run the following code to export the result of `datasummary_skim()` to Excel: ```{r eval=FALSE} # Store the table in a new object -stats_table <- datasummary_skim(small_business_2019_all, output = "huxtable") +stats_table <- datasummary_skim(department_staff, output = "huxtable") # Export this new object to Excel with quick_xlsx() quick_xlsx(stats_table, file = "quick_stats.xlsx") @@ -766,19 +816,21 @@ knitr::include_graphics("img/session3/quick-stats-output.png") # Exporting tables -And you can open it with Excel for further customization if you want +And you can open it with Excel for further customization if you want... ```{r echo = FALSE, out.width="65%"} knitr::include_graphics("img/session3/quick-stats-excel.png") ``` +- However... remember that any manual changes will be hard to track affecting the reproducibility of your work. + --- # Exporting tables ```{r eval=FALSE} # Store the table in a new object -stats_table <- datasummary_skim(small_business_2019_all, output = "huxtable") +stats_table <- datasummary_skim(department_staff, output = "huxtable") # Export this new object to Excel with quick_xlsx() quick_xlsx(stats_table, file = "quick_stats.xlsx") @@ -818,14 +870,14 @@ stats_table %>% # Center cells in first row set_align(1, everywhere, "center") %>% # Set a theme for quick formatting - theme_basic() + theme_blue() ``` ] .pull-right[ .small[ ```{r echo=FALSE, message=FALSE, warning=FALSE} -stats_table <- datasummary_skim(small_business_2019_all, output = "huxtable") +stats_table <- datasummary_skim(department_staff, output = "huxtable") # Format table stats_table %>% @@ -833,7 +885,7 @@ stats_table <- datasummary_skim(small_business_2019_all, output = "huxtable") set_header_cols(1, TRUE) %>% # Use first column as row header set_number_format(everywhere, 2:ncol(.), "%9.0f") %>% # Don't round large numbers set_align(1, everywhere, "center") %>% # Centralize cells in first row - theme_basic() # Set a theme for quick formatting + theme_blue() # Set a theme for quick formatting ``` ] ] @@ -858,7 +910,7 @@ stats_table_custom <- stats_table %>% # Center cells in first row set_align(1, everywhere, "center") %>% # Set a theme for quick formatting - theme_basic() + theme_blue() ``` ] @@ -893,7 +945,7 @@ stats_table_custom <- stats_table %>% # <---- here set_header_cols(1, TRUE) %>% set_number_format(everywhere, 2:ncol(.), "%9.0f") %>% set_align(1, everywhere, "center") %>% - theme_basic() + theme_blue() ``` This is the object that we export later with `quick_xslx()` @@ -928,12 +980,37 @@ knitr::include_graphics("img/session3/stats-custom.png") # Customizing table outputs -We used `theme_basic()` to give a minimalistic, basic theme to the table. Other available themes are: +We used `theme_blue()`. Other available themes are: ```{r echo = FALSE, out.width="75%"} knitr::include_graphics("img/session3/themes.png") ``` +--- + +# Use it on your work + +### Key Takeaways: +- This was a **basic example** with a few variables from your staff list, but the **possibilities are endless**. +- With this and the contents from yesterday's session, you can create **summaries** of **anything you can think of**. + +### Real-World Example: + **Annual Report: Programmed vs. Billings for 2023 (In GH¢)** + +```{r echo = FALSE, out.width="40%"} +knitr::include_graphics("img/session3/annual_report.png") +``` + +- If we have the data, you can easily create summaries like this directly in R. +- Once written, the code can be re-used for the next year or quarter. + +--- + +# Questions? + +![](https://media.giphy.com/media/XHVmD4RyXgSjd8aUMb/giphy.gif?cid=790b7611auicodssx8z4zly6u6z5k4pd0r1i13drege3yunc&ep=v1_gifs_search&rid=giphy.gif&ct=g) + + --- class: inverse, center, middle @@ -942,7 +1019,6 @@ name: wrapping-up # Wrapping up

- --- # Wrapping up @@ -1013,3 +1089,6 @@ class: inverse, center, middle # Thanks! // ¡Gracias! // Obrigado!

+```{r echo = FALSE, out.width="80%"} +knitr::include_graphics("img/session3/you-can-r.png") +``` diff --git a/Presentations-Ghana/2024-10/3-descriptive-statistics.html b/Presentations-Ghana/2024-10/3-descriptive-statistics.html index 354d08b..68cb7be 100644 --- a/Presentations-Ghana/2024-10/3-descriptive-statistics.html +++ b/Presentations-Ghana/2024-10/3-descriptive-statistics.html @@ -1,9 +1,9 @@ - Session 3 - Descriptive statistics + Session 3 - Descriptive Statistics - + @@ -16,8 +16,6 @@ - - @@ -28,16 +26,16 @@ class: center, middle, inverse, title-slide .title[ -# Session 3 - Descriptive statistics +# Session 3 - Descriptive Statistics ] .subtitle[ -## R training - Georgia RS-WB DIME +## R Training ] .author[ -### Luis Eduardo San Martin +### María Reyes Retana ] .date[ -### The World Bank | September 2023 +### The World Bank | January 2025 ] --- @@ -45,6 +43,7 @@ + <style type="text/css"> @media print { .has-continuation { @@ -53,7 +52,9 @@ } </style> -# Table of contents // სარჩევი +<img src="img/template.png" width="100%" style="display: block; margin: auto;" /> + +# Table of contents - [Introduction](#intro) - [Piping](#piping) @@ -68,55 +69,94 @@ class: inverse, center, middle name: intro -# Introduction // გაცნობა +# Introduction <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- -# Introduction // გაცნობა +# Introduction - We learned yesterday how to conduct statistical programming and export the results in `.csv` files -- However, sometime we might need more refined tables than simple (and ugly) CSVs +- However, sometime we might need more refined tables than simple (and ugly) csvs -<img src="img/session3/data-work-descriptive-stats.png" width="95%" style="display: block; margin: auto;" /> +<img src="img/session3/data-work-final-table.png" width="95%" style="display: block; margin: auto;" /> --- -# Introduction // გაცნობა +# Introduction -- That's what today's session is about, along with an explanation of the pipes (`%>%`) +- That's what today's session is about, along with an introduction of the pipe (`%>%`) -<img src="img/session3/data-work-descriptive-stats.png" width="95%" style="display: block; margin: auto;" /> +<img src="img/session3/data-work-final-table.png" width="95%" style="display: block; margin: auto;" /> + +--- + +# Introduction + +## Relevance to your work + +In your **annual reports**, you often include: +- Summary tables like **revenue by year**, **spending across categories**, or **staff counts by department**. +- These tables help summarize key trends and patterns for decision-makers. + +<img src="img/session3/programmed_psrl_2023.png" width="45%" style="display: block; margin: auto;" /> + +- Today, we will practice **creating similar summary tables** using mock data. +- While we’re using a simple dataset today, the same steps can be applied to your own data for reports. --- -# Introduction // გაცნობა +# Introduction -## Exercise 1a: Getting the libraries for today's session +## Exercise 1a: Getting the packages for today's session -We're going to use two R libraries in this session: `modelsummary` and `huxtable`. +We're going to use two R packages in this session: `modelsummary`, `huxtable` and ` dplyr`. 1. Install `modelsummary` and `huxtable`: -```r +``` r install.packages("modelsummary") install.packages("huxtable") ``` <img src="img/session3/install.png" width="55%" style="display: block; margin: auto;" /> +## Exercise 1b: Download and load the data we'll use + +1. First of all let's open a script. `File`>`New File`>`R Script` + +<img src="img/session3/r-script.png" width="55%" style="display: block; margin: auto;" /> + +2. Add the packages we will use today + + +``` r +library(modelsummary) +library(huxtable) +library(readxl) +library(dplyr) +``` + +3. Load the data we will use. (remember our two methods point click and code) + + +``` r +department_staff_final <- read.csv("data/department_staff_final.csv") +``` + + --- -# Introduction // გაცნობა +# Introduction ## Exercise 1b: Download and load the data we'll use .pull-left[ -1. Go to https://osf.io/z8snr and download the file +1. Go to https://osf.io/zqa6j and download the file -1. In RStudio, go to `File` > `Import Dataset` > `From Text (base)` and select the file `small_business_2019_all.csv` +1. In RStudio, go to `File` > `Import Dataset` > `From Text (base)` and select the file `department_staff_final.csv` + If you don't know where the file is, remember to check in your `Downloads` folder @@ -129,7 +169,7 @@ --- -# Introduction // გაცნობა +# Introduction You should have one dataframe loaded in the environment after this. @@ -137,17 +177,27 @@ --- -# Introduction // გაცნობა +# Introduction ## Recap: always know your data! -- This data is similar to the one we used before -- Every row is one business in one tax period (month) -- `modified_id` is a business identifier -- We also have information about the region, firm age, monthly income, VAT liability -- There is one more variable we didn't see before: `group` contains the group the firm was assigned to in a random experiment +- This is the data that we used yesterday! -<img src="img/session3/df.png" width="40%" style="display: block; margin: auto;" /> + +``` r +glimpse(department_staff_final) +``` + +``` +## Rows: 8,645 +## Columns: 6 +## $ sex <chr> "Female", "Male", "Female", "Female", "Female", "F… +## $ current_grade <chr> "Director Fin. & Admin.", "Deputy Director (Admin.… +## $ senior_junior_staff <chr> "senior", "senior", "senior", "senior", "senior", … +## $ department <chr> "State Protocol Dept", "State Protocol Dept", "Sta… +## $ years_of_service <dbl> 30.220397, 16.049281, 33.968515, 11.553730, 21.180… +## $ age <dbl> 60.15058, 41.74949, 59.43053, 50.55989, 51.28268, … +``` --- @@ -160,101 +210,121 @@ --- -# Piping +# The pipe ( %>% ) operator -- Before we start producing more refined outputs, we need to cover piping -- You probably remember this piece of code from one of yesterday's exercise: +- Before diving into the contents of today's session, we will to introduce a **super useful tool**: + **the pipe (`%>%`)**. + +- The pipe is part of the `dplyr` package. It helps to write code in a way that is easier to read and understand. + +- Reading and understanding multiple operations can be difficult. + +- The pipe operator (**%>%**) can help with this. + +--- +# The pipe operator -```r -# Filter only businesses in Tbilisi: -temp1 <- filter(small_business_2019, region == "Tbilisi") -# Sort previous result by income, descending order: -temp2 <- arrange(temp1, -income) +- With the pipe, code reads from left to right, top to bottom, which is more intuitive. -# Keep only the 50 first businesses after sorting: -df_tbilisi_50 <- filter(temp2, row_number() <= 50) +- %>% can be read as "then" and simplifies code structure. + +For example let's see at this mock get to work sequence example: + +**Without pipe ( %>% )** (hard to read and understand the sequence) + + +``` r +go_to_work(make_breakfast(work_out(brush_teeth(wake_up(Mer))))) +``` + +**With pipe ( %>% )** (the order is clear) + + +``` r +Mer %>% + wake_up() %>% + brush_teeth() %>% + work_out() %>% + make_breakfast() %>% + go_to_work() ``` --- -# Piping +# The pipe -This code works, but the problem with it is that it makes us generate unnecessary intermediate dataframes (`temp1`, `temp2`) that store results temporarily +As we saw yesterday the functions are normally organized around a set of verbs, or actions to be taken. -<img src="img/session3/temp_dfs.png" width="75%" style="display: block; margin: auto;" /> +-- ---- +* Most *verbs* work as follows: -# Piping -Instead, we can use pipes to **pass the results of a function and apply a new function on top of it** +`$$\text{verb}(\underbrace{\text{data.frame}}_{\text{1st argument}}, \underbrace{\text{what to do}}_\text{2nd argument})$$` -.pull-left[ +-- -```r -# Filter only businesses in Tbilisi: -temp1 <- filter(small_business_2019, - region == "Tbilisi") +* Alternatively you can (**should**) use the `pipe` operator `%>%`: -# Sort previous result by income, descending order: -temp2 <- arrange(temp1, - -income) +`$$\underbrace{\text{data.frame}}_{\text{1st argument}} \underbrace{\text{ %>% }}_{\text{"pipe" operator}} \text{verb}(\underbrace{\text{what to do}}_\text{2nd argument})$$` -# Keep only the 50 first businesses after sorting: -df_tbilisi_50 <- filter(temp2, - row_number() <= 50) -``` -] +We will start using the pipe from this point. Please ask if something is not clear. -.pull-right[ +**Tip**💡: Use Shift + Ctrl/Cmd + M as a shortcut for the pipe operator. -```r -# The same but with pipes: -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) +--- + +# The pipe + +- You probably remember this piece of code from one of yesterday's exercise: + + +``` r +# Filter only female employees: +temp1 <- filter(department_staff_list, sex == "Female") # filter by female + +# Sort previous result by years of service +department_female <- arrange(temp1, years_of_service) # order by years of service ``` -] --- -# Piping +# The pipe -.pull-left[ +This code works, but the problem with it is that it makes us generate unnecessary intermediate dataframes (`temp1`) that store results temporarily + +<img src="img/session3/temp_dfs.png" width="75%" style="display: block; margin: auto;" /> -```r -# Filter only businesses in Tbilisi: -temp1 <- filter(small_business_2019, - region == "Tbilisi") +--- + +# The pipe + +Instead, we can use pipes to **pass the results of a function and apply a new function on top of it** (just like Mer's waking up sequence) -# Sort previous result by income, descending order: -temp2 <- arrange(temp1, - -income) +.pull-left[ + +``` r +# Filter only female employees: +temp1 <- filter(department_staff_list, sex == "Female") # filter by female -# Keep only the 50 first businesses after sorting: -df_tbilisi_50 <- filter(temp2, - row_number() <= 50) +# Sort previous result by years of service +department_female <- arrange(temp1, years_of_service) # order by years of service ``` ] .pull-right[ -```r +``` r # The same but with pipes: -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) +department_female <- filter(department_staff_list, sex == "Female") %>% + arrange(years_of_service) ``` ] -There are several important details to notice here: - -1.- The resulting dataframe `df_tbilisi_50` is **the same in both cases** +- The usefulness of the pipe (%>%) becomes more evident when the code starts to get more complicated. --- @@ -262,33 +332,27 @@ .pull-left[ -```r -# Filter only businesses in Tbilisi: -temp1 <- filter(small_business_2019, - region == "Tbilisi") - -# Sort previous result by income, descending order: -temp2 <- arrange(temp1, - -income) +``` r +# Filter only female employees: +temp1 <- filter(department_staff_list, sex == "Female") # filter by female -# Keep only the 50 first businesses after sorting: -df_tbilisi_50 <- filter(temp2, - row_number() <= 50) +# Sort previous result by years of service +department_female <- arrange(temp1, years_of_service) # order by years of service ``` ] .pull-right[ -```r +``` r # The same but with pipes: -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) +department_female <- filter(department_staff_list, sex == "Female") %>% + arrange(years_of_service) ``` ] -2.- The name of the resulting dataframe is now defined in the first line of this data wrangling operation. This is because **R evaluates lines with consecutive pipes as if they were a single line** +There are several important details to notice here: + +1.- The resulting dataframe `department_female` is **the same in both cases** --- @@ -296,33 +360,25 @@ .pull-left[ -```r -# Filter only businesses in Tbilisi: -temp1 <- filter(small_business_2019, - region == "Tbilisi") - -# Sort previous result by income, descending order: -temp2 <- arrange(temp1, - -income) +``` r +# Filter only female employees: +temp1 <- filter(department_staff_list, sex == "Female") # filter by female -# Keep only the 50 first businesses after sorting: -df_tbilisi_50 <- filter(temp2, - row_number() <= 50) +# Sort previous result by years of service +department_female <- arrange(temp1, years_of_service) # order by years of service ``` ] .pull-right[ -```r +``` r # The same but with pipes: -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) +department_female <- filter(department_staff_list, sex == "Female") %>% + arrange(years_of_service) ``` ] -3.- Notice that the functions `arrange()` and `filter()` used after the pipes now have only **one argument instead of two**. This is because when using pipes the first argument is implied to be result of the function before the pipes +2.- Notice that the functions `arrange()` and `filter()` used after the pipes now have only **one argument instead of two**. This is because when using pipes the first argument is implied to be result of the function before the pipes --- @@ -332,130 +388,141 @@ 1. Apply the same filtering and sorting now with pipes +**Solution** + -```r -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) +``` r +department_female <- filter(department_staff_final, sex == "Female") %>% + arrange(years_of_service) ``` +* Note: that our dataframe now is named `depatment_staf_final` (is the joined dataframe from yesterday's session) + --- # Piping Now we will not have any annoying intermediate results stored in our environment! -.pull-left[ - -```r -# The same but with pipes: -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) -``` -] +- Good code is code that is both correct (does what it's supposed to) and it's easy to understand -.pull-right[ -<img src="img/session3/env-pipes.png" width="85%" style="display: block; margin: auto;" /> -] +- Piping is **instrumental for writing good code in R** --- # Piping -Lastly, we can also add more formatting to this code to improve its clarity even more: +## Always use pipes! + +Now that you now about the power of the pipes, use them wisely! .pull-left[ +- Remember that pipes are part of the library `dplyr`, you need to load it before using them -```r -# Previous solution -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) -``` +- Pipes also improve code clarity drastically + +- Many R coders use pipes and internet examples assume you know them + +- **We'll use pipes now in the next examples and exercises of the rest of this training** ] .pull-right[ - -```r -# The same with better spacing -df_tbilisi_50 <- - small_business_2019 %>% - filter(region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) -``` +<img src="img/session3/pipes-joke.png" width="45%" style="display: block; margin: auto;" /> ] --- -# Piping +class: inverse, center, middle +name: quick-summary-stats -.pull-left[ +# Quick summary statistics -```r -# Previous solution -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) -``` -] +<html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> -.pull-right[ +--- -```r -# The same with better spacing -df_tbilisi_50 <- - small_business_2019 %>% - filter(region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) -``` -] +# Quick Summary Statistics -- Good code is code that is both correct (does what it's supposed to) and it's easy to understand +## Refresher: Grouping and Summarizing Data -- Piping is **instrumental for writing good code in R** +Yesterday, we learned how to: +1. **Group data** using `group_by()` +2. **Summarize** results with `summarise()` ---- +This is a powerful tool to create **summary tables**—such as totals, averages, or counts—that are essential for your annual reports. -# Piping +### Example -## Always use pipes! -Now that you now about the power of the pipes, use them wisely! +``` r +# Summarize total revenue by month +summary_table <- psrl_data %>% + group_by(Month) %>% + summarise(Total_Revenue = sum(Revenue, na.rm = TRUE)) -.pull-left[ -- Remember that pipes are part of the library `dplyr`, you need to load it before using them +print(summary_table) +``` -- Pipes also improve code clarity drastically +--- -- Many R coders use pipes and internet examples assume you know them +# Quick summary statistics -- **We'll use pipes now in the next examples and exercises of the rest of this training** + + +Applying This to Your Work +.pull-left[ +**Annual Report Table** +![Programmed PSRL for 2023](img/session3/programmed_psrl_2023.png) +*Monthly totals for different products.* ] .pull-right[ -<img src="img/session3/pipes-joke.png" width="45%" style="display: block; margin: auto;" /> +**Code to Recreate the Table** + + +``` r +# Summarize all numeric columns by Month +summary_table <- psrl_data %>% + group_by(Month) %>% + summarise(across(where(is.numeric), \(x) sum(x, na.rm = TRUE))) +``` ] --- -class: inverse, center, middle -name: quick-summary-stats +# Quick summary statistics -# Quick summary statistics // სწრაფი შემაჯამებელი სტატისტიკა -<html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> +``` r +# Print the summary table +print(summary_table) +``` + +``` +## # A tibble: 12 × 8 +## Month Diesel LPG Petrol MGO_F Gas_Oil Unified_Naphtha Total +## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> +## 1 Apr-23 140 70 115 40 35 22 422 +## 2 Aug-23 170 85 135 48 45 30 513 +## 3 Dec-23 200 100 155 60 55 40 610 +## 4 Feb-23 130 65 105 35 30 18 383 +## 5 Jan-23 120 60 100 30 25 15 350 +## 6 Jul-23 160 80 130 50 42 28 490 +## 7 Jun-23 155 78 125 45 40 26 469 +## 8 Mar-23 125 63 110 33 28 20 379 +## 9 May-23 150 75 120 42 38 24 449 +## 10 Nov-23 190 95 150 58 52 38 583 +## 11 Oct-23 180 90 145 55 50 35 555 +## 12 Sep-23 165 83 140 47 48 32 515 +``` + --- # Quick summary statistics +## Beyond Basic Summaries: Customizing Your Results + We learned yesterday how to produce dataframes with results and export them. ## But what if you want to ... ? @@ -495,7 +562,7 @@ For example: -```r +``` r datasummary_skim( data, output = "default", @@ -514,13 +581,13 @@ 1. Load `modelsummary` with `library(modelsummary)` -1. Use `datasummary_skim()` to create a descriptive statistics table for `small_business_all` +1. Use `datasummary_skim()` to create a descriptive statistics table for `department_staff` -```r -datasummary_skim(small_business_2019_all) +``` r +datasummary_skim(department_staff) ``` --- @@ -529,143 +596,7 @@ You should be seeing this result in the lower right panel of RStudio. -<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> - <thead> - <tr> - <th style="text-align:left;"> </th> - <th style="text-align:right;"> Unique (#) </th> - <th style="text-align:right;"> Missing (%) </th> - <th style="text-align:right;"> Mean </th> - <th style="text-align:right;"> SD </th> - <th style="text-align:right;"> Min </th> - <th style="text-align:right;"> Median </th> - <th style="text-align:right;"> Max </th> - <th style="text-align:right;"> </th> - </tr> - </thead> -<tbody> - <tr> - <td style="text-align:left;"> modified_id </td> - <td style="text-align:right;"> 984 </td> - <td style="text-align:right;"> 0 </td> - <td style="text-align:right;"> 5448915.1 </td> - <td style="text-align:right;"> 3758602.4 </td> - <td style="text-align:right;"> 19832.0 </td> - <td style="text-align:right;"> 5008712.0 </td> - <td style="text-align:right;"> 12296912.0 </td> - <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> - .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { - fill: none; - stroke: #000000; - stroke-linecap: round; - stroke-linejoin: round; - stroke-miterlimit: 10.00; - } - .svglite text { - white-space: pre; - } - </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> -</g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.71" y="3.22" width="3.62" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="5.33" y="7.10" width="3.62" height="4.56" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="8.95" y="7.52" width="3.62" height="4.14" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="12.57" y="7.52" width="3.62" height="4.14" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="16.19" y="6.84" width="3.62" height="4.83" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="19.81" y="7.05" width="3.62" height="4.62" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="23.43" y="7.94" width="3.62" height="3.72" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="27.05" y="9.04" width="3.62" height="2.62" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.67" y="9.72" width="3.62" height="1.94" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.29" y="9.20" width="3.62" height="2.47" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="37.91" y="4.95" width="3.62" height="6.71" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="41.53" y="8.04" width="3.62" height="3.62" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="45.15" y="11.03" width="3.62" height="0.63" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> -</td> - </tr> - <tr> - <td style="text-align:left;"> taxperiod </td> - <td style="text-align:right;"> 12 </td> - <td style="text-align:right;"> 0 </td> - <td style="text-align:right;"> 201906.7 </td> - <td style="text-align:right;"> 3.4 </td> - <td style="text-align:right;"> 201901.0 </td> - <td style="text-align:right;"> 201907.0 </td> - <td style="text-align:right;"> 201912.0 </td> - <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> - .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { - fill: none; - stroke: #000000; - stroke-linecap: round; - stroke-linejoin: round; - stroke-miterlimit: 10.00; - } - .svglite text { - white-space: pre; - } - </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> -</g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.78" y="3.22" width="4.04" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="5.82" y="7.44" width="4.04" height="4.22" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="9.86" y="8.19" width="4.04" height="3.47" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="13.90" y="6.64" width="4.04" height="5.02" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="17.94" y="7.17" width="4.04" height="4.49" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="21.98" y="8.03" width="4.04" height="3.63" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.02" y="6.85" width="4.04" height="4.81" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.06" y="7.12" width="4.04" height="4.54" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.10" y="6.00" width="4.04" height="5.67" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="38.14" y="6.91" width="4.04" height="4.76" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.18" y="7.28" width="4.04" height="4.38" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> -</td> - </tr> - <tr> - <td style="text-align:left;"> age </td> - <td style="text-align:right;"> 30 </td> - <td style="text-align:right;"> 0 </td> - <td style="text-align:right;"> 14.0 </td> - <td style="text-align:right;"> 8.4 </td> - <td style="text-align:right;"> 1.0 </td> - <td style="text-align:right;"> 13.0 </td> - <td style="text-align:right;"> 30.0 </td> - <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> - .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { - fill: none; - stroke: #000000; - stroke-linecap: round; - stroke-linejoin: round; - stroke-miterlimit: 10.00; - } - .svglite text { - white-space: pre; - } - </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> -</g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="0.25" y="6.61" width="3.07" height="5.05" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="3.31" y="5.27" width="3.07" height="6.39" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="6.38" y="3.22" width="3.07" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="9.44" y="5.82" width="3.07" height="5.84" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="12.51" y="5.43" width="3.07" height="6.23" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="15.57" y="5.98" width="3.07" height="5.68" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="18.64" y="6.69" width="3.07" height="4.97" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="21.70" y="6.06" width="3.07" height="5.60" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="24.77" y="6.53" width="3.07" height="5.13" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="27.83" y="7.72" width="3.07" height="3.95" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.90" y="7.08" width="3.07" height="4.58" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="33.96" y="6.77" width="3.07" height="4.89" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="37.03" y="6.61" width="3.07" height="5.05" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="40.09" y="7.24" width="3.07" height="4.42" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="43.16" y="8.98" width="3.07" height="2.68" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> -</td> - </tr> - <tr> - <td style="text-align:left;"> income </td> - <td style="text-align:right;"> 721 </td> - <td style="text-align:right;"> 0 </td> - <td style="text-align:right;"> 3283.9 </td> - <td style="text-align:right;"> 8242.4 </td> - <td style="text-align:right;"> 0.0 </td> - <td style="text-align:right;"> 906.8 </td> - <td style="text-align:right;"> 139394.5 </td> - <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> - .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { - fill: none; - stroke: #000000; - stroke-linecap: round; - stroke-linejoin: round; - stroke-miterlimit: 10.00; - } - .svglite text { - white-space: pre; - } - </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> -</g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.78" y="3.22" width="3.19" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="4.97" y="11.32" width="3.19" height="0.34" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="8.15" y="11.52" width="3.19" height="0.15" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="11.34" y="11.60" width="3.19" height="0.063" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="14.53" y="11.64" width="3.19" height="0.027" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="17.72" y="11.65" width="3.19" height="0.0091" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.91" y="11.65" width="3.19" height="0.0091" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="24.10" y="11.65" width="3.19" height="0.0091" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="27.28" y="11.65" width="3.19" height="0.0091" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.47" y="11.66" width="3.19" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="33.66" y="11.66" width="3.19" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="36.85" y="11.66" width="3.19" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="40.04" y="11.66" width="3.19" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="43.23" y="11.65" width="3.19" height="0.0091" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> -</td> - </tr> - <tr> - <td style="text-align:left;"> vat_liability </td> - <td style="text-align:right;"> 721 </td> - <td style="text-align:right;"> 0 </td> - <td style="text-align:right;"> 591.1 </td> - <td style="text-align:right;"> 1483.6 </td> - <td style="text-align:right;"> 0.0 </td> - <td style="text-align:right;"> 163.2 </td> - <td style="text-align:right;"> 25091.0 </td> - <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> - .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { - fill: none; - stroke: #000000; - stroke-linecap: round; - stroke-linejoin: round; - stroke-miterlimit: 10.00; - } - .svglite text { - white-space: pre; - } - </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> -</g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.78" y="3.22" width="3.54" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="5.32" y="11.38" width="3.54" height="0.29" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="8.86" y="11.56" width="3.54" height="0.099" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="12.41" y="11.60" width="3.54" height="0.063" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="15.95" y="11.64" width="3.54" height="0.018" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="19.49" y="11.65" width="3.54" height="0.0090" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="23.03" y="11.64" width="3.54" height="0.018" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.58" y="11.65" width="3.54" height="0.0090" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.12" y="11.66" width="3.54" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="33.66" y="11.66" width="3.54" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="37.20" y="11.66" width="3.54" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="40.75" y="11.66" width="3.54" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="44.29" y="11.65" width="3.54" height="0.0090" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> -</td> - </tr> -</tbody> -</table> +<img src="img/session3/datasummary_skim.png" width="55%" style="display: block; margin: auto;" /> --- @@ -676,98 +607,15 @@ - To summarize categorical variables, use the argument `type = "categorical"` -```r -datasummary_skim(small_business_2019_all, type = "categorical") +``` r +datasummary_skim(department_staff, type = "categorical") ``` --- # Quick summary statistics -<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> - <thead> - <tr> - <th style="text-align:left;"> </th> - <th style="text-align:left;"> </th> - <th style="text-align:right;"> N </th> - <th style="text-align:right;"> % </th> - </tr> - </thead> -<tbody> - <tr> - <td style="text-align:left;"> region </td> - <td style="text-align:left;"> Guria </td> - <td style="text-align:right;"> 259 </td> - <td style="text-align:right;"> 25.9 </td> - </tr> - <tr> - <td style="text-align:left;"> </td> - <td style="text-align:left;"> ImereTI-Racha-Lechkhum-kv.SvaneTi </td> - <td style="text-align:right;"> 37 </td> - <td style="text-align:right;"> 3.7 </td> - </tr> - <tr> - <td style="text-align:left;"> </td> - <td style="text-align:left;"> KaxeTi </td> - <td style="text-align:right;"> 270 </td> - <td style="text-align:right;"> 27.0 </td> - </tr> - <tr> - <td style="text-align:left;"> </td> - <td style="text-align:left;"> Kvemo KarTli </td> - <td style="text-align:right;"> 9 </td> - <td style="text-align:right;"> 0.9 </td> - </tr> - <tr> - <td style="text-align:left;"> </td> - <td style="text-align:left;"> Samegrelo-Z.SvaneTi </td> - <td style="text-align:right;"> 28 </td> - <td style="text-align:right;"> 2.8 </td> - </tr> - <tr> - <td style="text-align:left;"> </td> - <td style="text-align:left;"> Samtskhe-Javakheti </td> - <td style="text-align:right;"> 7 </td> - <td style="text-align:right;"> 0.7 </td> - </tr> - <tr> - <td style="text-align:left;"> </td> - <td style="text-align:left;"> Shida KarTli </td> - <td style="text-align:right;"> 17 </td> - <td style="text-align:right;"> 1.7 </td> - </tr> - <tr> - <td style="text-align:left;"> </td> - <td style="text-align:left;"> Tbilisi </td> - <td style="text-align:right;"> 373 </td> - <td style="text-align:right;"> 37.3 </td> - </tr> - <tr> - <td style="text-align:left;"> group </td> - <td style="text-align:left;"> No notifications to be sent </td> - <td style="text-align:right;"> 286 </td> - <td style="text-align:right;"> 28.6 </td> - </tr> - <tr> - <td style="text-align:left;"> </td> - <td style="text-align:left;"> Notification sent on Day 13 and Day 15 </td> - <td style="text-align:right;"> 226 </td> - <td style="text-align:right;"> 22.6 </td> - </tr> - <tr> - <td style="text-align:left;"> </td> - <td style="text-align:left;"> Notification sent only on Day 13 </td> - <td style="text-align:right;"> 247 </td> - <td style="text-align:right;"> 24.7 </td> - </tr> - <tr> - <td style="text-align:left;"> </td> - <td style="text-align:left;"> Notification sent only on Day 15 </td> - <td style="text-align:right;"> 241 </td> - <td style="text-align:right;"> 24.1 </td> - </tr> -</tbody> -</table> +<img src="img/session3/datasummary_cat.png" width="55%" style="display: block; margin: auto;" /> --- @@ -775,144 +623,6 @@ - `datasummary_skim()` is convenient because it's fast, easy, and shows a lot of information -<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> - <thead> - <tr> - <th style="text-align:left;"> </th> - <th style="text-align:right;"> Unique (#) </th> - <th style="text-align:right;"> Missing (%) </th> - <th style="text-align:right;"> Mean </th> - <th style="text-align:right;"> SD </th> - <th style="text-align:right;"> Min </th> - <th style="text-align:right;"> Median </th> - <th style="text-align:right;"> Max </th> - <th style="text-align:right;"> </th> - </tr> - </thead> -<tbody> - <tr> - <td style="text-align:left;"> modified_id </td> - <td style="text-align:right;"> 984 </td> - <td style="text-align:right;"> 0 </td> - <td style="text-align:right;"> 5448915.1 </td> - <td style="text-align:right;"> 3758602.4 </td> - <td style="text-align:right;"> 19832.0 </td> - <td style="text-align:right;"> 5008712.0 </td> - <td style="text-align:right;"> 12296912.0 </td> - <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> - .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { - fill: none; - stroke: #000000; - stroke-linecap: round; - stroke-linejoin: round; - stroke-miterlimit: 10.00; - } - .svglite text { - white-space: pre; - } - </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> -</g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.71" y="3.22" width="3.62" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="5.33" y="7.10" width="3.62" height="4.56" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="8.95" y="7.52" width="3.62" height="4.14" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="12.57" y="7.52" width="3.62" height="4.14" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="16.19" y="6.84" width="3.62" height="4.83" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="19.81" y="7.05" width="3.62" height="4.62" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="23.43" y="7.94" width="3.62" height="3.72" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="27.05" y="9.04" width="3.62" height="2.62" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.67" y="9.72" width="3.62" height="1.94" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.29" y="9.20" width="3.62" height="2.47" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="37.91" y="4.95" width="3.62" height="6.71" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="41.53" y="8.04" width="3.62" height="3.62" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="45.15" y="11.03" width="3.62" height="0.63" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> -</td> - </tr> - <tr> - <td style="text-align:left;"> taxperiod </td> - <td style="text-align:right;"> 12 </td> - <td style="text-align:right;"> 0 </td> - <td style="text-align:right;"> 201906.7 </td> - <td style="text-align:right;"> 3.4 </td> - <td style="text-align:right;"> 201901.0 </td> - <td style="text-align:right;"> 201907.0 </td> - <td style="text-align:right;"> 201912.0 </td> - <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> - .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { - fill: none; - stroke: #000000; - stroke-linecap: round; - stroke-linejoin: round; - stroke-miterlimit: 10.00; - } - .svglite text { - white-space: pre; - } - </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> -</g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.78" y="3.22" width="4.04" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="5.82" y="7.44" width="4.04" height="4.22" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="9.86" y="8.19" width="4.04" height="3.47" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="13.90" y="6.64" width="4.04" height="5.02" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="17.94" y="7.17" width="4.04" height="4.49" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="21.98" y="8.03" width="4.04" height="3.63" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.02" y="6.85" width="4.04" height="4.81" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.06" y="7.12" width="4.04" height="4.54" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="34.10" y="6.00" width="4.04" height="5.67" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="38.14" y="6.91" width="4.04" height="4.76" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="42.18" y="7.28" width="4.04" height="4.38" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> -</td> - </tr> - <tr> - <td style="text-align:left;"> age </td> - <td style="text-align:right;"> 30 </td> - <td style="text-align:right;"> 0 </td> - <td style="text-align:right;"> 14.0 </td> - <td style="text-align:right;"> 8.4 </td> - <td style="text-align:right;"> 1.0 </td> - <td style="text-align:right;"> 13.0 </td> - <td style="text-align:right;"> 30.0 </td> - <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> - .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { - fill: none; - stroke: #000000; - stroke-linecap: round; - stroke-linejoin: round; - stroke-miterlimit: 10.00; - } - .svglite text { - white-space: pre; - } - </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> -</g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="0.25" y="6.61" width="3.07" height="5.05" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="3.31" y="5.27" width="3.07" height="6.39" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="6.38" y="3.22" width="3.07" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="9.44" y="5.82" width="3.07" height="5.84" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="12.51" y="5.43" width="3.07" height="6.23" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="15.57" y="5.98" width="3.07" height="5.68" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="18.64" y="6.69" width="3.07" height="4.97" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="21.70" y="6.06" width="3.07" height="5.60" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="24.77" y="6.53" width="3.07" height="5.13" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="27.83" y="7.72" width="3.07" height="3.95" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.90" y="7.08" width="3.07" height="4.58" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="33.96" y="6.77" width="3.07" height="4.89" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="37.03" y="6.61" width="3.07" height="5.05" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="40.09" y="7.24" width="3.07" height="4.42" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="43.16" y="8.98" width="3.07" height="2.68" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> -</td> - </tr> - <tr> - <td style="text-align:left;"> income </td> - <td style="text-align:right;"> 721 </td> - <td style="text-align:right;"> 0 </td> - <td style="text-align:right;"> 3283.9 </td> - <td style="text-align:right;"> 8242.4 </td> - <td style="text-align:right;"> 0.0 </td> - <td style="text-align:right;"> 906.8 </td> - <td style="text-align:right;"> 139394.5 </td> - <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> - .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { - fill: none; - stroke: #000000; - stroke-linecap: round; - stroke-linejoin: round; - stroke-miterlimit: 10.00; - } - .svglite text { - white-space: pre; - } - </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> -</g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.78" y="3.22" width="3.19" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="4.97" y="11.32" width="3.19" height="0.34" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="8.15" y="11.52" width="3.19" height="0.15" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="11.34" y="11.60" width="3.19" height="0.063" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="14.53" y="11.64" width="3.19" height="0.027" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="17.72" y="11.65" width="3.19" height="0.0091" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="20.91" y="11.65" width="3.19" height="0.0091" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="24.10" y="11.65" width="3.19" height="0.0091" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="27.28" y="11.65" width="3.19" height="0.0091" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.47" y="11.66" width="3.19" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="33.66" y="11.66" width="3.19" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="36.85" y="11.66" width="3.19" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="40.04" y="11.66" width="3.19" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="43.23" y="11.65" width="3.19" height="0.0091" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> -</td> - </tr> - <tr> - <td style="text-align:left;"> vat_liability </td> - <td style="text-align:right;"> 721 </td> - <td style="text-align:right;"> 0 </td> - <td style="text-align:right;"> 591.1 </td> - <td style="text-align:right;"> 1483.6 </td> - <td style="text-align:right;"> 0.0 </td> - <td style="text-align:right;"> 163.2 </td> - <td style="text-align:right;"> 25091.0 </td> - <td style="text-align:right;"> <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" class="svglite" width="48.00pt" height="12.00pt" viewBox="0 0 48.00 12.00"><defs><style type="text/css"> - .svglite line, .svglite polyline, .svglite polygon, .svglite path, .svglite rect, .svglite circle { - fill: none; - stroke: #000000; - stroke-linecap: round; - stroke-linejoin: round; - stroke-miterlimit: 10.00; - } - .svglite text { - white-space: pre; - } - </style></defs><rect width="100%" height="100%" style="stroke: none; fill: none;"></rect><defs><clipPath id="cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw"><rect x="0.00" y="0.00" width="48.00" height="12.00"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwwLjAwfDEyLjAw)"> -</g><defs><clipPath id="cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw"><rect x="0.00" y="2.88" width="48.00" height="9.12"></rect></clipPath></defs><g clip-path="url(#cpMC4wMHw0OC4wMHwyLjg4fDEyLjAw)"><rect x="1.78" y="3.22" width="3.54" height="8.44" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="5.32" y="11.38" width="3.54" height="0.29" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="8.86" y="11.56" width="3.54" height="0.099" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="12.41" y="11.60" width="3.54" height="0.063" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="15.95" y="11.64" width="3.54" height="0.018" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="19.49" y="11.65" width="3.54" height="0.0090" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="23.03" y="11.64" width="3.54" height="0.018" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="26.58" y="11.65" width="3.54" height="0.0090" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="30.12" y="11.66" width="3.54" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="33.66" y="11.66" width="3.54" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="37.20" y="11.66" width="3.54" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="40.75" y="11.66" width="3.54" height="0.00" style="stroke-width: 0.38; fill: #000000;"></rect><rect x="44.29" y="11.65" width="3.54" height="0.0090" style="stroke-width: 0.38; fill: #000000;"></rect></g></svg> -</td> - </tr> -</tbody> -</table> - - But what if we wanted to customize what to show? that's when we use `datasummary()` instead, also from the library `modelsummary` --- @@ -920,7 +630,7 @@ class: inverse, center, middle name: customized-summary-stats -# Customized summary statistics // მორგებული შემაჯამებელი სტატისტიკა +# Customized summary statistics <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> @@ -941,7 +651,7 @@ ] -```r +``` r datasummary( var1 + var2 + var3 ~ stat1 + stat2 + stat3 + stat4, data = data @@ -954,70 +664,28 @@ ## Exercise 4: -Create a summary statistics table showing the nuber of observations, mean, standard deviation, minimum, and maximum for variables `age`, `income`, and `vat_liability` of the dataframe `small_business_2019_all` +Create a summary statistics table showing the number of observations, mean, standard deviation, minimum, and maximum for variables `years_of_service` of the dataframe `department_staff` 1. Use `datasummary()` for this: -```r +``` r datasummary( - age + income + vat_liability ~ N + Mean + SD + Min + Max, - small_business_2019_all + years_of_service ~ N + Mean + SD + Min + Max, + department_staff ) ``` ---- - -# Customized summary statistics - -<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> - <thead> - <tr> - <th style="text-align:left;"> </th> - <th style="text-align:right;"> N </th> - <th style="text-align:right;"> Mean </th> - <th style="text-align:right;"> SD </th> - <th style="text-align:right;"> Min </th> - <th style="text-align:right;"> Max </th> - </tr> - </thead> -<tbody> - <tr> - <td style="text-align:left;"> age </td> - <td style="text-align:right;"> 1000 </td> - <td style="text-align:right;"> 14.00 </td> - <td style="text-align:right;"> 8.37 </td> - <td style="text-align:right;"> 1.00 </td> - <td style="text-align:right;"> 30.00 </td> - </tr> - <tr> - <td style="text-align:left;"> income </td> - <td style="text-align:right;"> 1000 </td> - <td style="text-align:right;"> 3283.87 </td> - <td style="text-align:right;"> 8242.45 </td> - <td style="text-align:right;"> 0.00 </td> - <td style="text-align:right;"> 139394.52 </td> - </tr> - <tr> - <td style="text-align:left;"> vat_liability </td> - <td style="text-align:right;"> 1000 </td> - <td style="text-align:right;"> 591.10 </td> - <td style="text-align:right;"> 1483.64 </td> - <td style="text-align:right;"> 0.00 </td> - <td style="text-align:right;"> 25091.01 </td> - </tr> -</tbody> -</table> - +<img src="img/session3/custom.png" width="55%" style="display: block; margin: auto;" /> --- # Customized summary statistics -```r +``` r datasummary( - age + income + vat_liability ~ N + Mean + SD + Min + Max, # this is the formula - small_business_2019_all # this is the data + years_of_service ~ N + Mean + SD + Min + Max, # this is the formula + department_staff # this is the data ) ``` @@ -1028,15 +696,18 @@ - The formula should always be defined as: rows ~ columns - The rows and columns in the formula are separated by a plus (`+`) sign + +In Excel ❎ you would need to calculate each of the statistics in a new table, by selecting the data and using the appropriate formula. + --- # Customized summary statistics -```r +``` r datasummary( - age + income + vat_liability ~ N + Mean + SD + Min + Max, # this is the formula - small_business_2019_all # this is the data + years_of_service ~ N + Mean + SD + Min + Max, # this is the formula + department_staff # this is the data ) ``` @@ -1055,29 +726,31 @@ class: inverse, center, middle name: exporting-tables -# Exporting tables // მაგიდების ექსპორტი +# Exporting tables <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- -# Exporting tables // მაგიდების ექსპორტი +# Exporting tables Remember that both `datasummary_skim()` and `datasummary()` have an optional argument named *output*? We can use it to specify a file path for an output file. For example: -```r -datasummary_skim(small_business_2019_all, +``` r +datasummary_skim(department_staff, output = "quick_stats.docx") ``` -Will export the result to the `Documents` folder (in Windows) in a Word file named `quick_stats.docx` +Will export the result to the `Documents` folder (in Windows) in a Word file named `quick_stats.docx` + +*Note* for this code to work we would need to install an extra package `pandoc` --- -# Exporting tables // მაგიდების ექსპორტი +# Exporting tables The file type of the output is dictated by the file extension. For example: @@ -1093,7 +766,7 @@ --- -# Exporting tables // მაგიდების ექსპორტი +# Exporting tables ## That's because the functions of `modelsummary` can't export to Excel @@ -1105,18 +778,18 @@ --- -# Exporting tables // მაგიდების ექსპორტი +# Exporting tables ## Exercise 5: Export a table to Excel -1. Load `huxtable` with `library(huxtable)` +1. Load `huxtable` with `library(huxtable)` (we already did this at the beginning of the session) 1. Run the following code to export the result of `datasummary_skim()` to Excel: -```r +``` r # Store the table in a new object -stats_table <- datasummary_skim(small_business_2019_all, output = "huxtable") +stats_table <- datasummary_skim(department_staff, output = "huxtable") # Export this new object to Excel with quick_xlsx() quick_xlsx(stats_table, file = "quick_stats.xlsx") @@ -1124,7 +797,7 @@ --- -# Exporting tables // მაგიდების ექსპორტი +# Exporting tables Now the result will show in your `Documents` folder @@ -1132,20 +805,22 @@ --- -# Exporting tables // მაგიდების ექსპორტი +# Exporting tables -And you can open it with Excel for further customization if you want +And you can open it with Excel for further customization if you want... <img src="img/session3/quick-stats-excel.png" width="65%" style="display: block; margin: auto;" /> +- However... remember that any manual changes will be hard to track affecting the reproducibility of your work. + --- -# Exporting tables // მაგიდების ექსპორტი +# Exporting tables -```r +``` r # Store the table in a new object -stats_table <- datasummary_skim(small_business_2019_all, output = "huxtable") +stats_table <- datasummary_skim(department_staff, output = "huxtable") # Export this new object to Excel with quick_xlsx() quick_xlsx(stats_table, file = "quick_stats.xlsx") @@ -1162,7 +837,7 @@ class: inverse, center, middle name: customizing-table-outputs -# Customizing table outputs // ცხრილის შედეგების მორგება +# Customizing table outputs <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> @@ -1174,7 +849,7 @@ .pull-left[ -```r +``` r # We start with stats_table: stats_table %>% # Use first row as table header @@ -1186,25 +861,19 @@ # Center cells in first row set_align(1, everywhere, "center") %>% # Set a theme for quick formatting - theme_basic() + theme_blue() ``` ] .pull-right[ .small[ - +
- - - - - + - + - - - +
Unique (#)Missing (%)MeanSDMinMedianMax
modified_id 984 0 5448915 3758602 19832 5008712 12296912
taxperiod 12 0 201907 3 201901 201907 201912
UniqueMissing Pct.MeanSDMinMedianMax
age 30 0 14 8 1 13 30
years_of_service 2916 0 15 10 0 14 51
income 721 0 3284 8242 0 907 139395
vat_liability 721 0 591 1484 0 163 25091
age 5998 0 44 9 18 44 67
] @@ -1220,7 +889,7 @@ 1.- Customize `stats_table` in a new object called `stats_table_custom` -```r +``` r stats_table_custom <- stats_table %>% # Use first row as table header set_header_rows(1, TRUE) %>% @@ -1231,7 +900,7 @@ # Center cells in first row set_align(1, everywhere, "center") %>% # Set a theme for quick formatting - theme_basic() + theme_blue() ``` ] @@ -1239,7 +908,7 @@ 2.- Export `stats_table_custom` to a file named `stats-custom.xlsx` with `quick_xlsx()` -```r +``` r quick_xlsx( stats_table_custom, file = "stats-custom.xlsx" @@ -1260,19 +929,19 @@ Notice that here in the first part of the exercise we stored the result in a new object -```r +``` r stats_table_custom <- stats_table %>% # <---- here set_header_rows(1, TRUE) %>% set_header_cols(1, TRUE) %>% set_number_format(everywhere, 2:ncol(.), "%9.0f") %>% set_align(1, everywhere, "center") %>% - theme_basic() + theme_blue() ``` This is the object that we export later with `quick_xslx()` -```r +``` r quick_xlsx( stats_table_custom, file = "stats-custom.xlsx" @@ -1298,22 +967,44 @@ # Customizing table outputs -We used `theme_basic()` to give a minimalistic, basic theme to the table. Other available themes are: +We used `theme_blue()`. Other available themes are: <img src="img/session3/themes.png" width="75%" style="display: block; margin: auto;" /> +--- + +# Use it on your work + +### Key Takeaways: +- This was a **basic example** with a few variables from your staff list, but the **possibilities are endless**. +- With this and the contents from yesterday's session, you can create **summaries** of **anything you can think of**. + +### Real-World Example: + **Annual Report: Programmed vs. Billings for 2023 (In GH¢)** + +<img src="img/session3/annual_report.png" width="40%" style="display: block; margin: auto;" /> + +- If we have the data, you can easily create summaries like this directly in R. +- Once written, the code can be re-used for the next year or quarter. + +--- + +# Questions? + +![](https://media.giphy.com/media/XHVmD4RyXgSjd8aUMb/giphy.gif?cid=790b7611auicodssx8z4zly6u6z5k4pd0r1i13drege3yunc&ep=v1_gifs_search&rid=giphy.gif&ct=g) + + --- class: inverse, center, middle name: wrapping-up -# Wrapping up // შეფუთვა +# Wrapping up <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> - --- -# Wrapping up // შეფუთვა +# Wrapping up ## Save your work! @@ -1323,7 +1014,7 @@ --- -# Wrapping up // შეფუთვა +# Wrapping up ## What else is available? @@ -1343,7 +1034,7 @@ --- -# Wrapping up // შეფუთვა +# Wrapping up ## What else is available? @@ -1354,7 +1045,7 @@ --- -# Wrapping up // შეფუთვა +# Wrapping up ## This session @@ -1362,7 +1053,7 @@ --- -# Wrapping up // შეფუთვა +# Wrapping up ## Next session (last one) @@ -1372,9 +1063,10 @@ class: inverse, center, middle -# Thanks! // მადლობა! // ¡Gracias! // Obrigado! +# Thanks! // ¡Gracias! // Obrigado! <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> +<img src="img/session3/you-can-r.png" width="80%" style="display: block; margin: auto;" /> diff --git a/Presentations-Ghana/2024-10/3-descriptive-statistics.pdf b/Presentations-Ghana/2024-10/3-descriptive-statistics.pdf index 0cf0dc0..a67af9d 100644 Binary files a/Presentations-Ghana/2024-10/3-descriptive-statistics.pdf and b/Presentations-Ghana/2024-10/3-descriptive-statistics.pdf differ diff --git a/Presentations-Ghana/2024-10/4-data-visualization.Rmd b/Presentations-Ghana/2024-10/4-data-visualization.Rmd index 12f513c..e69fa09 100644 --- a/Presentations-Ghana/2024-10/4-data-visualization.Rmd +++ b/Presentations-Ghana/2024-10/4-data-visualization.Rmd @@ -2,7 +2,7 @@ title: "Session 4 - Data visualization" subtitle: "R training" author: "María Reyes Retana" -date: "The World Bank | December 2024" +date: "The World Bank | January 2025" output: xaringan::moon_reader: css: ["libs/remark-css/default.css", @@ -70,9 +70,8 @@ knitr::include_graphics("img/template.png") - [Introduction](#intro) - [The grammar of graphics](#grammar-of-graphics) - [Bar plots](#bar-plots) +- [Pie charts](#pie-charts) - [Line plots](#line-plots) -- [Text encodings](#text-encodings) -- [Scatter plots](#scatter-plots) - [Wrapping up](#wrapping-up) --- @@ -106,12 +105,49 @@ knitr::include_graphics("img/session4/data-work-data-vis.png") - The input for that code is a wrangled dataframe, dataframe that is ready to be used -```{r echo = FALSE, out.width="80%"} +```{r echo = FALSE, out.width="70%"} knitr::include_graphics("img/session4/data-vis.png") ``` --- +# Introduction + +## From Your Annual Reports + +In your annual reports, the most common visualizations include: + +1. **Bar Graphs**: Used to compare categories (e.g., spending across departments or revenue by month). +2. **Pie Charts**: Used to show proportions (e.g., product share, budget breakdown). + +.pull-left[ +**Bar Graph Example** +![Bar Graph](img/session4/bar-chart.png) +] + +.pull-right[ +**Pie Chart Example** +![Pie Chart](img/session4/pie-chart.png) +] + +--- + +# Introduction + +## What You Will Learn + +Today, I will teach you how to: + +1. **Recreate these common visualizations** (and more) using code. +2. Build clean and reproducible charts that you can easily include in your annual reports. +3. Explore additional visualization techniques to make your data more impactful. + +We will keep using the data we have been using from the beginning to keep this simple, but the code could be recycled to fit the data of your annual reports. + +Let’s get started! + +--- + # Introduction ## Data visualization in R @@ -134,6 +170,14 @@ knitr::include_graphics("img/session4/ggplot2.png") --- +# ggplot2 + +```{r echo = FALSE, out.width="70%"} +knitr::include_graphics("img/session4/ggplot2-dib.png") +``` + +--- + class: inverse, center, middle name: grammar-of-graphics @@ -143,11 +187,47 @@ name: grammar-of-graphics --- +# The grammar of graphics + +## What is ggplot2? + +- **ggplot2** is a powerful and flexible tool for creating data visualizations in R. +- It combines **philosophy + functions** into a well-organized framework. + +### Things to Keep in Mind: +1. ggplot2 may feel like **a lot to learn**, but let's do it step by step. +2. Today, we’ll cover the basics, but you’ll also get resources to keep exploring on your own. + +--- + +# The grammar of graphics + +## The Structure of ggplot2 + +Creating a plot with ggplot2 requires **three basic components**: + +1. **Data**: The dataset you want to visualize. +2. **Aesthetics (aes)**: How you map your data to visual elements (e.g., x-axis, y-axis, color). +3. **Geometry (geom)**: The type of plot you want (e.g., bar graph, scatter plot). + +(and many more but we will keep it simple in this presentation) + +--- + # The grammar of graphics ## The grammar of graphics in `ggplot2` -```{r echo = FALSE, out.width="100%"} +I will use the table of employees by department we created in session two to start with this. + +You can also download it from here: https://osf.io/th6qk if you don't have it. + +```{r} +employees_by_department <- read.csv("data/employees_by_department.csv") +``` + + +```{r echo = FALSE, out.width="50%"} knitr::include_graphics("img/session4/grammar-of-graphics.png") ``` @@ -157,7 +237,7 @@ knitr::include_graphics("img/session4/grammar-of-graphics.png") ## The grammar of graphics in `ggplot2` -```{r echo = FALSE, out.width="100%"} +```{r echo = FALSE, out.width="70%"} knitr::include_graphics("img/session4/grammar-of-graphics2.png") ``` @@ -167,7 +247,7 @@ knitr::include_graphics("img/session4/grammar-of-graphics2.png") ## The grammar of graphics in `ggplot2` -```{r echo = FALSE, out.width="100%"} +```{r echo = FALSE, out.width="70%"} knitr::include_graphics("img/session4/grammar-of-graphics3x.png") ``` @@ -177,7 +257,7 @@ knitr::include_graphics("img/session4/grammar-of-graphics3x.png") ## The grammar of graphics in `ggplot2` -```{r echo = FALSE, out.width="100%"} +```{r echo = FALSE, out.width="70%"} knitr::include_graphics("img/session4/grammar-of-graphics3y.png") ``` @@ -187,7 +267,7 @@ knitr::include_graphics("img/session4/grammar-of-graphics3y.png") ## The grammar of graphics in `ggplot2` -```{r echo = FALSE, out.width="100%"} +```{r echo = FALSE, out.width="70%"} knitr::include_graphics("img/session4/grammar-of-graphics4gc.png") ``` @@ -197,7 +277,7 @@ knitr::include_graphics("img/session4/grammar-of-graphics4gc.png") ## The grammar of graphics in `ggplot2` -```{r echo = FALSE, out.width="100%"} +```{r echo = FALSE, out.width="70%"} knitr::include_graphics("img/session4/grammar-of-graphics4l.png") ``` @@ -227,11 +307,11 @@ library(ggplot2) 1. Produce a basic bar plot with the following code: ```{r eval = FALSE} -ggplot(small_business_2019_all) + - aes(x = taxperiod, - y = vat_liability) + +ggplot(employees_by_department) + + aes(x = department, + y = number) + geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month") + labs(title = "Number of employees by department, 2024") ``` --- @@ -239,40 +319,45 @@ ggplot(small_business_2019_all) + # Bar plots ```{r echo=FALSE} -small_business_2019_all <- read.csv("data/small_business_2019_all.csv") +employees_by_department <- read.csv("data/employees_by_department.csv") ``` This result should be displayed in the lower right panel of your RStudio window -```{r echo=FALSE, out.width="70%"} -ggplot(small_business_2019_all) + - aes(x = taxperiod, - y = income) + +```{r echo=FALSE, out.width="60%"} +ggplot(employees_by_department) + + aes(x = department, + y = number) + geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month") + labs(title = "Number of employees by department, 2024") ``` +In Excel ❎ you would select the data and insert bar graph. + --- # Bar plots -This plot looks acceptable but it can be improved: +This plot doesn't look great yet! .pull-left[ -- `taxperiod` is a variable representing months but R doesn't know it and it's showing the x-axis with decimals. We need to tell R that those values shouldn't be changed +- `department` is too crowded as the names are long, but R does not know this. We need to tell R that those labels should be rotated - We can center the title -- We should add axis labels (instead of just "vat_liability" and "taxperiod") +- We should add axis labels (instead of just variable names) + +- We can add color ] .pull-right[ ```{r echo=FALSE, out.width="100%"} -ggplot(small_business_2019_all) + - aes(x = taxperiod, - y = vat_liability) + +ggplot(employees_by_department) + + aes(x = department, + y = number) + geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month") + labs(title = "Number of employees by department, 2024") + ``` ] @@ -285,19 +370,20 @@ ggplot(small_business_2019_all) + 1. Use the following code to improve the aesthetics of your plot ```{r eval=FALSE} -ggplot(small_business_2019_all) + - aes(x = taxperiod, - y = vat_liability) + - geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month", - # x-axis title - x = "Month", - # y-axis title - y = "Georgian Lari") + - # telling R not to break the x-axis - scale_x_continuous(breaks = 201901:201912) + - # centering plot title - theme(plot.title = element_text(hjust = 0.5)) +ggplot(employees_by_department) + + aes(x = department, + y = number) + + geom_col(fill = "#9370DB") + #<< + labs( + title = "Number of employees by department, 2024", # title + x = "Department", #<< + y = "Number" #<< + ) + + # Title and subtitles + theme( + plot.title = element_text(hjust = 0.5), #<< + axis.text.x = element_text(angle = 45, hjust = 1) #<< + ) ``` --- @@ -307,32 +393,91 @@ ggplot(small_business_2019_all) + Now this looks better: ```{r echo=FALSE, out.width="70%"} -ggplot(small_business_2019_all) + - # transforming taxperiod to categorical value with as.factor() - aes(x = as.factor(taxperiod), - y = vat_liability) + - geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month", - # x-axis title - x = "Month", - # y-axis title - y = "VAT liability") + - # centering plot title - theme(plot.title = element_text(hjust = 0.5)) +ggplot(employees_by_department) + + aes(x = department, + y = number) + + geom_col(fill = "#9370DB") + + labs( + title = "Number of employees by department, 2024", + # x-axis title + x = "Department", + # y-axis title + y = "Number" + ) + + # Centering plot title + theme( + plot.title = element_text(hjust = 0.5), + axis.text.x = element_text(angle = 45, hjust = 1) # Rotating x-axis labels + ) +``` + +--- + +# Bar plots + +## Exercise 1c: Improve your bar plot + +But we can actually make this even better. I would reorder by amount of employees, and add labels. + +```{r, eval=FALSE} +# Create the bar plot +ggplot(employees_by_department) + + aes(x = reorder(department, -number), y = number) + #<< # Reorder bars by `number` + geom_col(fill = "#9370DB") + + geom_text( #<< + aes(label = number), #<< + angle = 90 #<< + ) + #<< + labs( + title = "Number of employees by department, 2024", + x = "Department", + y = "Number" + ) + + theme( + plot.title = element_text(hjust = 0.5), + axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis labels + ) +``` + +--- + +# Bar plots + +Now this is ready to be in one of your reports. + +```{r, echo=FALSE} +# Create the bar plot +ggplot(employees_by_department) + + aes(x = reorder(department, -number), y = number) + #<< # Reorder bars by `number` + geom_col(fill = "#9370DB") + + geom_text( #<< + aes(label = number), #<< + angle = 90 #<< + ) + #<< + labs( + title = "Number of employees by department, 2024", + x = "Department", + y = "Number" + ) + + theme( + plot.title = element_text(hjust = 0.5), + axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis labels + ) ``` + --- # Bar plots -## Exercise 1c: save your plot +## Exercise 1d: save your plot Now that your plot looks good, you can save it into an output with `ggsave()` 1. Use this code to save your plot: ```{r eval = FALSE} -ggsave("vat_liability_small_2019.png", +ggsave("employees_by_department.png", width = 20, height = 10, units = "cm") @@ -358,576 +503,213 @@ knitr::include_graphics("img/session4/ggsave-ex1.png") --- -# Bar plots +# You Did Your First Plot! 🎉 -## A note about syntax +## The Possibilities Are Endless -- Data visualization usually requires several iterations to add new elements to your initial code and improve your plot +- Congratulations on creating your **first plot**! +- From here, you can explore countless options to visualize your data: + - Bar graphs, line plots, scatterplots, pie charts, and more. +- Experimentation is key! Don’t be afraid to try new things or make mistakes. -- `ggplot2` adds new elements to a visualization with the symbol `+` +## You Have the Power 💪 -- More customization means that your code can easily become quite long. Using spaces and line breaks helps for clarity, but **there is just no way around it** +- Data visualization is a skill that grows with practice. +- Use online resources and communities like: + - [ggplot2 Documentation](https://ggplot2.tidyverse.org/) + - [ggplot presentation](https://pkg.garrickadenbuie.com/gentle-ggplot2) -- In programming this is known as **heavy syntax** +--- -.pull-left[ -.small[ -```{r eval = FALSE} -# Exercise 1: -ggplot(small_business_2019_all) + - aes(x = taxperiod, - y = vat_liability) + - geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month") -``` -] -] +# Now the sky is the limit! 🎉 -.pull-right[ -.small[ -```{r eval=FALSE} -# Exercise 2 -ggplot(small_business_2019_all) + - aes(x = as.factor(taxperiod), - y = vat_liability) + - geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month", - x = "Month", - y = "Georgian Lari") + - theme(plot.title = element_text(hjust = 0.5)) -``` -] +.pull-center[ +![](https://media.giphy.com/media/5C472t1RGNuq4/giphy.gif?cid=790b76118bi2mzh4evmg1ohc9xiot3g9fdv7mi1u96xeb4s5&ep=v1_gifs_search&rid=giphy.gif&ct=g) ] --- class: inverse, center, middle -name: line-plots +name: pie-charts -# Line plots +# Pie charts

--- -# Line plots - -- In data visualization, we call **encodings** to the geometry selected to represent the data visually +# Pie charts -- In the previous examples, we did that when we used `geom_col()`. This tells R that our encoding to represent the data is **bars** (also called columns) +## From Bar Plots to Pie Charts -- `ggplot2` has several different options for encodings in data visualization, for example: **lines** +- We just created a **bar plot** to visualize the number of employees by department. +- Let’s now use the **same data** to create a **pie chart**, a common way to show proportions. -- The encoding for lines in `ggplot2` is `geom_lines()` +### Geoms and Graph Types +Remember: The **geom** controls the type of graph we create. +- **`geom_col()`**: Creates bar plots. +- **`geom_bar()`** with **coord_polar()**: Converts data into a pie chart. --- -# Line plots +# Pie Charts -- Line plots are a nice option to encode numeric values and include different categories of a second variable at the same time +## Exercise 2a: Create a Pie Chart -```{r echo=FALSE, warning=FALSE, message=FALSE} -df_group_month <- small_business_2019_all %>% - select(group, taxperiod, vat_liability) %>% - group_by(group, taxperiod) %>% - summarize(total = sum(vat_liability)) -``` +1. Use the following code to create a pie chart: -```{r echo=FALSE, out.width="65%"} -ggplot(df_group_month) + - aes(x = taxperiod, - y = total) + - geom_line(aes(color = group)) + - labs(title = "Total VAT liability of small businesses in 2019 by experiment group", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(legend.text = element_text(size = 7), # don't forget the comma! - axis.text.x=element_text(size=6)) +```{r, eval=FALSE} +ggplot(employees_by_department) + + aes(x = "", y = number, fill = department) + + geom_bar(stat = "identity", width = 1) + + coord_polar(theta = "y") + + labs( + title = "Proportion of employees by department, 2024" + ) + + theme_void() ``` - --- -# Line plots - -- However, they also usually require some additional data wrangling compared to bar plots: data should be collapsed at the level specified in the x-axis and grouping variable +# Pie chart -- This is going to be clearer in the next exercise. The plot we'll produce is below +You should see this on your Plots panel. -```{r echo=FALSE, out.width="65%"} -ggplot(df_group_month) + - aes(x = taxperiod, - y = total) + - geom_line(aes(color = group)) + - labs(title = "Total VAT liability of small businesses in 2019 by experiment group", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(legend.text = element_text(size = 7), # don't forget the comma! - axis.text.x=element_text(size=6)) +```{r, echo=FALSE, out.width="60%"} +ggplot(employees_by_department) + + aes(x = "", y = number, fill = department) + + geom_bar(stat = "identity", width = 1) + + coord_polar(theta = "y") + + labs( + title = "Proportion of employees by department, 2024" + ) + + theme_void() ``` --- -# Line plots +# Pie chart -## Exercise 2a: Collapse your data at the month-group level +## Exercise 2b -Use the following code to create a dataframe collapsed at the month-group level and calculate the total VAT liability for each group in each month +The chart is alright, but we can customize it further.. What about changing the colors? -```{r eval = FALSE} -df_group_month <- small_business_2019_all %>% - select(group, taxperiod, vat_liability) %>% - group_by(group, taxperiod) %>% - summarize(total = sum(vat_liability)) -``` - ---- -# Line plots - -.pull-left[ -The result will look like this, you can explore it with `View(df_group_month)`. -] - -.pull-right[ -```{r echo = FALSE, out.width="85%"} -knitr::include_graphics("img/session4/df-group-month.png") -``` -] - ---- - -# Line plots - -## Exercise 2b: Create a line plot of VAT liability by month and group - -```{r eval=FALSE} -ggplot(df_group_month) + - aes(x = taxperiod, - y = total) + - geom_line(aes(color = group)) + - labs(title = "Total VAT liability of small businesses in 2019 by experiment group", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) -``` - ---- - -# Line plots - -Your result possibly looks like this: - -```{r echo=FALSE, warning=FALSE, message=FALSE, out.width="75%"} -ggplot(df_group_month) + - aes(x = taxperiod, - y = total) + - geom_line(aes(color = group)) + - labs(title = "Total VAT liability of small businesses in 2019 by experiment group", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) +```{r, eval=FALSE} +ggplot(employees_by_department) + + aes(x = "", y = number, fill = department) + + geom_bar(stat = "identity", width = 1) + + coord_polar(theta = "y") + + labs( + title = "Proportion of employees by department, 2024" + ) + + theme_void() + + scale_fill_viridis_d(option = "D") #<< ``` --- -# Line plots - -Something looks off, doesn't it? We need to **make the legend labels and x-axis texts smaller** for them without overlapping each other and to **remove the centering of the title** so it's not cropped +# Pie chart -```{r echo=FALSE, out.width="75%"} -ggplot(df_group_month) + - aes(x = taxperiod, - y = total) + - geom_line(aes(color = group)) + - labs(title = "Total VAT liability of small businesses in 2019 by experiment group", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) -``` - ---- +## Exercise 2c -# Line plots -## Exercise 2c: Continue improving your plot +Now this is ready to go into our report, so let's save it. -1. Add the argument `legend.text=element_text(size=7)` inside `theme()` to decrease the legend text size -1. Add the argument `axis.text.x=element_text(size=6)` inside `theme()` to decrease the x-axis text size - + note that both arguments need to be separated by a comma -1. Remove `plot.title = element_text(hjust = 0.5)` from `theme()` to remove the centering of the plot title - -The result should be this: - -```{eval=FALSE} -ggplot(df_group_month) + - aes(x = taxperiod, - y = total) + - geom_line(aes(color = group)) + - labs(title = "Total VAT liability of small businesses in 2019 by experiment group", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(legend.text = element_text(size = 7), # don't forget the comma! - axis.text.x=element_text(size=6)) +```{r, echo=FALSE, out.width="60%"} +ggplot(employees_by_department) + + aes(x = "", y = number, fill = department) + + geom_bar(stat = "identity", width = 1) + + coord_polar(theta = "y") + + labs( + title = "Proportion of employees by department, 2024" + ) + + theme_void() + + scale_fill_viridis_d(option = "D") #<< ``` ---- - -# Line plots - -```{r echo=FALSE, out.width="75%"} -ggplot(df_group_month) + - aes(x = taxperiod, - y = total) + - geom_line(aes(color = group)) + - labs(title = "Total VAT liability of small businesses in 2019 by experiment group", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(legend.text = element_text(size = 7), # don't forget the comma! - axis.text.x=element_text(size=6)) -``` --- -# Line plots +# Pie chart -## Exercise 2d: save your plot +## Exercise 2c -Use this code to save your plot: +Now this is ready to go into our report, so let's save it. -```{r eval=FALSE} -ggsave("vat_liability_small_2019_by_group.png", - width = 20, - height = 10, - units = "cm") +```{r, eval=FALSE} +ggsave("employees_by_department_pie.png") ``` ---- - -# Line plots - -```{r echo = FALSE, out.width="50%"} +```{r echo = FALSE, out.width="40%"} knitr::include_graphics("img/session4/ggsave-ex2.png") ``` ---- - -# Line plots - -Choosing the right encoding for your data can be tricky. It depends on what you want to show in your plot and how much information you want it to show - - + Bar plots show less information than line plots in general, but they are good for cases when you have only one numeric variable and one categorical variable to show - + Line plots can show more information and includes multiple groups to add an additional dimension to your data, but they are not visually appealing when your data varies a lot from one category to another - -.pull-left[ -```{r echo=FALSE, out.width="100%"} -ggplot(small_business_2019_all) + - aes(x = as.factor(taxperiod), - y = vat_liability) + - geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month", - x = "Month", - y = "Georgian Lari") + - theme(plot.title = element_text(hjust = 0.5)) -``` -] - -.pull-right[ -```{r echo=FALSE, out.width="100%"} -ggplot(df_group_month) + - aes(x = taxperiod, - y = total) + - geom_line(aes(color = group)) + - labs(title = "Total VAT liability of small businesses in 2019 by experiment group", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(legend.text = element_text(size = 7), # don't forget the comma! - axis.text.x=element_text(size=6)) -``` -] --- class: inverse, center, middle -name: text-encodings +name: line-plots -# Text encodings +# Line plots

--- -# Text encodings - -- Geometric shapes such as bars and plots are not the only way to encode data in plots - -- We can also use text directly to represent information in the plot as in the example below. This is called **text encodings** - -- Text encodings can be combined with geometric encodings to highlight information or provide important additional details to your visualization - -```{r echo=FALSE} - df_month <- small_business_2019_all %>% - select(taxperiod, vat_liability) %>% - group_by(taxperiod) %>% - summarize(total = sum(vat_liability)) -``` - -```{r echo=FALSE, out.width="57%"} - ggplot(df_month) + - aes(x = taxperiod, - y = total) + - geom_col() + - geom_text(aes(label = round(total)), - position = position_dodge(width = 1), - vjust = -0.5, - size = 3) + - labs(title = "Total VAT liability of small businesses in 2019 by month", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) -``` - ---- - -# Text encodings - -- Using text encodings in a bar plot is nice - -- However, it also requires additional data wrangling: the data needs to be collapsed at the same level of the x-axis to be able to add encodings - -- We'll see this in the next exercise - -```{r echo=FALSE, out.width="57%"} - ggplot(df_month) + - aes(x = taxperiod, - y = total) + - geom_col() + - geom_text(aes(label = round(total)), - position = position_dodge(width = 1), - vjust = -0.5, - size = 3) + - labs(title = "Total VAT liability of small businesses in 2019 by month", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) -``` - ---- - -# Text encodings - -## Exercise 3a: Collapse your data at the month level - -This is similar to what we did for exercise 2a, except that this time the collapsing is at month level instead of the month-group level. Use the code below to store the collapsed dataframe in `month_df` - -```{r eval=FALSE} -df_month <- small_business_2019_all %>% - select(taxperiod, vat_liability) %>% - group_by(taxperiod) %>% - summarize(total = sum(vat_liability)) -``` - ---- - -# Text encodings - -This is the result, you can use `View(df_month)` to display it - -```{r echo = FALSE, out.width="55%"} -knitr::include_graphics("img/session4/df-month.png") -``` - ---- - -# Text encodings - -## Exercise 3b: Add encodings to your bar plot - -Use the collapsed dataframe `df_month` and the former code from exercise 1 to add encodings to your bar plot. The result is the code below: - -```{r eval=FALSE} - ggplot(df_month) + - aes(x = taxperiod, - y = total) + - geom_col() + - # Note that the text encodings are added here - geom_text(aes(label = total), - position = position_dodge(width = 1), - vjust = -0.5, - size = 3) + - labs(title = "Total VAT liability of small businesses in 2019 by month", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) -``` - ---- - -# Text encodings - -This is result is nice but the text encodings have several decimal places. We can improve it by rounding `total` with the function `round()` - -```{r echo=FALSE, out.width="75%"} - ggplot(df_month) + - aes(x = taxperiod, - y = total) + - geom_col() + - # Note that the text encodings are added here - geom_text(aes(label = total), - position = position_dodge(width = 1), - vjust = -0.5, - size = 3) + - labs(title = "Total VAT liability of small businesses in 2019 by month", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) -``` - ---- - -# Text encodings - -## Exercise 3c: Improve your plot once again - -Replace `total` with `round(total)` in `geom_text()`. The result should be: - -```{r eval=FALSE} - ggplot(df_month) + - aes(x = taxperiod, - y = total) + - geom_col() + - geom_text(aes(label = round(total)), # <--- the change goes here - position = position_dodge(width = 1), - vjust = -0.5, - size = 3) + - labs(title = "Total VAT liability of small businesses in 2019 by month", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) -``` - ---- - -# Text encodings +# Line plots -```{r echo=FALSE, out.width="75%"} - ggplot(df_month) + - aes(x = taxperiod, - y = total) + - geom_col() + - geom_text(aes(label = round(total)), # <--- the change goes here - position = position_dodge(width = 1), - vjust = -0.5, - size = 3) + - labs(title = "Total VAT liability of small businesses in 2019 by month", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) -``` +Another Common Plot: Line Plots ---- +- **Line plots** are commonly used to show **trends over time** or changes across a sequence. +- Examples include visualizing **monthly sales**, **yearly growth**, or **temperature changes**. -# Text encodings -## Don't forget to save your plot! +**Description** +- Each point represents a value at a specific time. +- Lines connect the points to show a continuous trend. -Use this code to save this last plot: -```{r eval=FALSE} -ggsave("vat_liability_small_2019_text.png", - width = 20, - height = 10, - units = "cm") -``` --- -class: inverse, center, middle -name: scatter-plots - -# Scatter plots - -

- ---- - -# Scatter plots - -- We're going to explore one more type of encoding for data visualizations: **scatter plots** - -- Scatter plots are useful when you have two continuous numeric variables and want to show that there might to a correlation between them or visualize outliers (values that stand out from the rest because they are very extreme) +# Line plots +.pull-left[ +**Code Example** -- We use the encoding `geom_line()` for scatterplots +```{r, eval=FALSE} +library(ggplot2) -```{r echo=FALSE, out.width="55%"} -ggplot(small_business_2019_all) + - aes(x = age, - y = vat_liability) + - geom_point() + - labs(title = "VAT liability versus age for small businesses in 2019", - x = "Age of firm (years)", - y = "VAT liability") + - theme(plot.title = element_text(hjust = 0.5)) -``` - ---- - -# Scatter plots - -## Exercise 4: Create a scatter plot - -Use the following code to reproduce the scatter plot of the last slide: - -```{r eval=FALSE} -ggplot(small_business_2019_all) + - aes(x = age, - y = vat_liability) + - geom_point() + - labs(title = "VAT liability versus age for small businesses in 2019", - x = "Age of firm (years)", - y = "VAT liability") + - theme(plot.title = element_text(hjust = 0.5)) +# Using built-in `economics` dataset from ggplot2 +ggplot(economics) + + aes(x = date, y = unemploy) + + geom_line(color = "#9370DB", size = 1) + + labs( + title = "Unemployment Over Time", + x = "Date", + y = "Number of Unemployed" + ) + + theme_minimal() ``` +] ---- +.pull-right[ +**Output Example** -# Scatter plots +```{r, echo=FALSE, warning=FALSE, message=FALSE} +library(ggplot2) -```{r echo=FALSE, out.width="80%"} -ggplot(small_business_2019_all) + - aes(x = age, - y = vat_liability) + - geom_point() + - labs(title = "VAT liability versus age for small businesses in 2019", - x = "Age of firm (years)", - y = "VAT liability") + - theme(plot.title = element_text(hjust = 0.5)) +# Using built-in `economics` dataset from ggplot2 +ggplot(economics) + + aes(x = date, y = unemploy) + + geom_line(color = "#9370DB", size = 1) + + labs( + title = "Unemployment Over Time", + x = "Date", + y = "Number of Unemployed" + ) + + theme_minimal() ``` +] ---- - -# Scatter plots - -Lastly, remember to save your scatter plot with: - -```{r eval=FALSE} -ggsave("scatter_age_vat.png", - width = 20, - height = 10, - units = "cm")) -``` --- @@ -942,9 +724,9 @@ name: wrapping-up # Wrapping up -## Other encodings in `ggplot2` +## More in `ggplot2` -This table lists several of the most popular encoding types in `ggplot2` +This table lists several of the most popular encoding types in `ggplot2`. Also see more [here](https://ggplot2.tidyverse.org/reference/) | Encoding | Function in `ggplot2` | | -------- | --------------------- | @@ -972,40 +754,48 @@ knitr::include_graphics("img/session4/save.png") --- -# Wrapping up +# Wrap-Up: Looking Ahead 🚀 -## Data work pipeline +## Key Takeaways -```{r echo = FALSE, out.width="95%"} -knitr::include_graphics("img/session4/data-work-final.png") -``` +- Today, you learned how **data + code** can create powerful visualizations and tables for your annual reports. +- **Why this matters:** + - Code is **reusable**: Use it for next quarters or years without starting from scratch. + - Code is **transparent**: Everyone can see and verify all the calculations. --- -# Wrapping up +# Wrap-Up: Looking Ahead 🚀 -## Looking ahead +## From Data to Annual Reports -```{r echo = FALSE, out.width="95%"} -knitr::include_graphics("img/session4/data-work-expanded.png") -``` +.pull-center[ +**Data + Code** → **Annual Report** +![](img/data_code_report.png) +*Reproducible workflows save time and improve accuracy.* +] + +## What’s Next? + +- **Tomorrow**: Bring your own data, graphs, or tables. + - We’ll have a **long hands-on session** where you can: + - Ask questions about how to code specific visualizations or tables. + - Work with your own data to make real progress. --- -# Wrapping up +# Wrap-Up: Looking Ahead 🚀 -## Looking ahead +## Final Thoughts 💡 -- [Connecting R with SQL databases](https://solutions.posit.co/connections/db/getting-started/connect-to-database/). Some libraries: - + `dbConnect` - + `dbplyr` - + `DBI` +- This is **new and challenging**, but it’s also incredibly **powerful** and **useful**. +- The only way to get better is to **keep practicing**. +- Experiment, google/chatgpt-it, ask questions, and remember—this will make your reports clearer, faster, and less error-prone! -- [More on data wrangling](https://raw.githack.com/worldbank/dime-r-training/main/Presentations/03-data-wrangling.html#1). More libraries: - + `tidyr` - + `janitor` +Keep going—you’ve got this! + +![](https://media.giphy.com/media/kF9C1IrWKrltu/giphy.gif?cid=ecf05e47wy5wj5rjlofflyz9z3p0xgl9a7r5asamv89oy299&ep=v1_gifs_search&rid=giphy.gif&ct=g) -- [More on data visualization](https://raw.githack.com/worldbank/dime-r-training/main/Presentations/04-data-visualization.html#1) --- @@ -1014,3 +804,24 @@ class: inverse, center, middle # Thanks! // ¡Gracias! // Obrigado!

+```{r echo = FALSE, out.width="60%"} +knitr::include_graphics("img/session4/welcome-r.png") +``` + +--- + +# Resources for Data Visualization 📚 + +## Learn More About ggplot2 + +- **ggplot2 Documentation** + [ggplot2.tidyverse.org](https://ggplot2.tidyverse.org) + +- **R Graphics Cookbook** (Online resource for practical examples) + [r-graphics.org](https://r-graphics.org) + +- **R for Data Science**: Chapter on Data Visualization + [r4ds.had.co.nz](https://r4ds.had.co.nz/data-visualisation.html) + +- **ggplot2 Cheatsheet** (Quick reference for all functions) + [Download the cheatsheet here](https://github.com/rstudio/cheatsheets/raw/main/data-visualization-2.1.pdf) \ No newline at end of file diff --git a/Presentations-Ghana/2024-10/4-data-visualization.html b/Presentations-Ghana/2024-10/4-data-visualization.html index cc542b1..90b0b00 100644 --- a/Presentations-Ghana/2024-10/4-data-visualization.html +++ b/Presentations-Ghana/2024-10/4-data-visualization.html @@ -3,7 +3,7 @@ Session 4 - Data visualization - + @@ -29,13 +29,13 @@ # Session 4 - Data visualization ] .subtitle[ -## R training - Georgia RS-WB DIME +## R training ] .author[ -### Luis Eduardo San Martin +### María Reyes Retana ] .date[ -### The World Bank | September 2023 +### The World Bank | January 2025 ] --- @@ -51,14 +51,15 @@ } </style> -# Table of contents // სარჩევი +<img src="img/template.png" width="100%" style="display: block; margin: auto;" /> + +# Table of contents - [Introduction](#intro) - [The grammar of graphics](#grammar-of-graphics) - [Bar plots](#bar-plots) +- [Pie charts](#pie-charts) - [Line plots](#line-plots) -- [Text encodings](#text-encodings) -- [Scatter plots](#scatter-plots) - [Wrapping up](#wrapping-up) --- @@ -66,13 +67,13 @@ class: inverse, center, middle name: intro -# Introduction // გაცნობა +# Introduction <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- -# Introduction // გაცნობა +# Introduction ## About this session @@ -80,7 +81,7 @@ --- -# Introduction // გაცნობა +# Introduction ## Data visualization in the data work pipeline @@ -90,407 +91,317 @@ - The input for that code is a wrangled dataframe, dataframe that is ready to be used -<img src="img/session4/data-vis.png" width="80%" style="display: block; margin: auto;" /> +<img src="img/session4/data-vis.png" width="70%" style="display: block; margin: auto;" /> --- -# Introduction // გაცნობა +# Introduction -## Data visualization in R +## From Your Annual Reports -.pull-left[ -- We'll use the package `ggplot2` to create data visualizations +In your annual reports, the most common visualizations include: -- `ggplot2` greatly facilitates producing plots in R +1. **Bar Graphs**: Used to compare categories (e.g., spending across departments or revenue by month). +2. **Pie Charts**: Used to show proportions (e.g., product share, budget breakdown). - + It follows a syntax based on a description of the plot you want to obtain - - + This syntax is called **grammar of graphics**, a benchmark method of data visualization definition in statistical programming +.pull-left[ +**Bar Graph Example** +![Bar Graph](img/session4/bar-chart.png) ] .pull-right[ -<img src="img/session4/ggplot2.png" width="70%" style="display: block; margin: auto;" /> +**Pie Chart Example** +![Pie Chart](img/session4/pie-chart.png) ] --- -class: inverse, center, middle -name: grammar-of-graphics - -# The grammar of graphics // გრაფიკის გრამატიკა +# Introduction -<html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> +## What You Will Learn ---- - -# The grammar of graphics // გრაფიკის გრამატიკა - -## The grammar of graphics in `ggplot2` - -<img src="img/session4/grammar-of-graphics.png" width="100%" style="display: block; margin: auto;" /> - ---- - -# The grammar of graphics // გრაფიკის გრამატიკა - -## The grammar of graphics in `ggplot2` - -<img src="img/session4/grammar-of-graphics2.png" width="100%" style="display: block; margin: auto;" /> - ---- +Today, I will teach you how to: -# The grammar of graphics // გრაფიკის გრამატიკა +1. **Recreate these common visualizations** (and more) using code. +2. Build clean and reproducible charts that you can easily include in your annual reports. +3. Explore additional visualization techniques to make your data more impactful. -## The grammar of graphics in `ggplot2` +We will keep using the data we have been using from the beginning to keep this simple, but the code could be recycled to fit the data of your annual reports. -<img src="img/session4/grammar-of-graphics3x.png" width="100%" style="display: block; margin: auto;" /> +Let’s get started! --- -# The grammar of graphics // გრაფიკის გრამატიკა - -## The grammar of graphics in `ggplot2` +# Introduction -<img src="img/session4/grammar-of-graphics3y.png" width="100%" style="display: block; margin: auto;" /> +## Data visualization in R ---- +.pull-left[ +- We'll use the package `ggplot2` to create data visualizations -# The grammar of graphics // გრაფიკის გრამატიკა +- `ggplot2` greatly facilitates producing plots in R -## The grammar of graphics in `ggplot2` + + It follows a syntax based on a description of the plot you want to obtain + + + This syntax is called **grammar of graphics**, a benchmark method of data visualization definition in statistical programming +] -<img src="img/session4/grammar-of-graphics4gc.png" width="100%" style="display: block; margin: auto;" /> +.pull-right[ +<img src="img/session4/ggplot2.png" width="70%" style="display: block; margin: auto;" /> +] --- -# The grammar of graphics // გრაფიკის გრამატიკა - -## The grammar of graphics in `ggplot2` +# ggplot2 -<img src="img/session4/grammar-of-graphics4l.png" width="100%" style="display: block; margin: auto;" /> +<img src="img/session4/ggplot2-dib.png" width="70%" style="display: block; margin: auto;" /> --- class: inverse, center, middle -name: bar-plots +name: grammar-of-graphics -# Bar plots // ბარის ნაკვეთები +# The grammar of graphics <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- -# Bar plots // ბარის ნაკვეთები - -## Exercise 1a: Create a basic bar plot - -1. Open a new script for this session by clicking on `File` >> `New File` >> `R Script` - -1. Load `ggplot2` - +# The grammar of graphics -```r -library(ggplot2) -``` +## What is ggplot2? -1. Produce a basic bar plot with the following code: +- **ggplot2** is a powerful and flexible tool for creating data visualizations in R. +- It combines **philosophy + functions** into a well-organized framework. - -```r -ggplot(small_business_2019_all) + - aes(x = taxperiod, - y = vat_liability) + - geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month") -``` +### Things to Keep in Mind: +1. ggplot2 may feel like **a lot to learn**, but let's do it step by step. +2. Today, we’ll cover the basics, but you’ll also get resources to keep exploring on your own. --- -# Bar plots // ბარის ნაკვეთები +# The grammar of graphics +## The Structure of ggplot2 +Creating a plot with ggplot2 requires **three basic components**: -This result should be displayed in the lower right panel of your RStudio window +1. **Data**: The dataset you want to visualize. +2. **Aesthetics (aes)**: How you map your data to visual elements (e.g., x-axis, y-axis, color). +3. **Geometry (geom)**: The type of plot you want (e.g., bar graph, scatter plot). -<img src="4-data-visualization_files/figure-html/unnamed-chunk-14-1.png" width="70%" style="display: block; margin: auto;" /> +(and many more but we will keep it simple in this presentation) --- -# Bar plots // ბარის ნაკვეთები - -This plot looks acceptable but it can be improved: - -.pull-left[ -- `taxperiod` is a variable representing months but R doesn't know it and it's showing the x-axis with decimals. We need to tell R that those values shouldn't be changed - -- We can center the title - -- We should add axis labels (instead of just "vat_liability" and "taxperiod") -] +# The grammar of graphics -.pull-right[ -<img src="4-data-visualization_files/figure-html/unnamed-chunk-15-1.png" width="100%" style="display: block; margin: auto;" /> -] +## The grammar of graphics in `ggplot2` ---- +I will use the table of employees by department we created in session two to start with this. -# Bar plots // ბარის ნაკვეთები +You can also download it from here: https://osf.io/th6qk if you don't have it. -## Exercise 1b: Improve your bar plot -1. Use the following code to improve the aesthetics of your plot +``` r +employees_by_department <- read.csv("data/employees_by_department.csv") +``` -```r -ggplot(small_business_2019_all) + - aes(x = taxperiod, - y = vat_liability) + - geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month", - # x-axis title - x = "Month", - # y-axis title - y = "Georgian Lari") + - # telling R not to break the x-axis - scale_x_continuous(breaks = 201901:201912) + - # centering plot title - theme(plot.title = element_text(hjust = 0.5)) -``` +<img src="img/session4/grammar-of-graphics.png" width="50%" style="display: block; margin: auto;" /> --- -# Bar plots // ბარის ნაკვეთები +# The grammar of graphics -Now this looks better: +## The grammar of graphics in `ggplot2` -<img src="4-data-visualization_files/figure-html/unnamed-chunk-17-1.png" width="70%" style="display: block; margin: auto;" /> +<img src="img/session4/grammar-of-graphics2.png" width="70%" style="display: block; margin: auto;" /> --- -# Bar plots // ბარის ნაკვეთები - -## Exercise 1c: save your plot - -Now that your plot looks good, you can save it into an output with `ggsave()` - -1. Use this code to save your plot: +# The grammar of graphics +## The grammar of graphics in `ggplot2` -```r -ggsave("vat_liability_small_2019.png", - width = 20, - height = 10, - units = "cm") -``` +<img src="img/session4/grammar-of-graphics3x.png" width="70%" style="display: block; margin: auto;" /> --- -# Bar plots // ბარის ნაკვეთები - -.pull-left[ -- `ggsave()` by default saves the last plot your produced +# The grammar of graphics -- The first argument in `ggsave()` is the name of the file we save the plot into. We can also use file paths here - -- The rest are optional arguments that define the dimensions of the image you export, it's better to define them so the image has the correct proportions and text size -] +## The grammar of graphics in `ggplot2` -.pull-right[ -<img src="img/session4/ggsave-ex1.png" width="95%" style="display: block; margin: auto;" /> -] +<img src="img/session4/grammar-of-graphics3y.png" width="70%" style="display: block; margin: auto;" /> --- -# Bar plots // ბარის ნაკვეთები - -## A note about syntax - -- Data visualization usually requires several iterations to add new elements to your initial code and improve your plot +# The grammar of graphics -- `ggplot2` adds new elements to a visualization with the symbol `+` +## The grammar of graphics in `ggplot2` -- More customization means that your code can easily become quite long. Using spaces and line breaks helps for clarity, but **there is just no way around it** +<img src="img/session4/grammar-of-graphics4gc.png" width="70%" style="display: block; margin: auto;" /> -- In programming this is known as **heavy syntax** +--- -.pull-left[ -.small[ +# The grammar of graphics -```r -# Exercise 1: -ggplot(small_business_2019_all) + - aes(x = taxperiod, - y = vat_liability) + - geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month") -``` -] -] - -.pull-right[ -.small[ +## The grammar of graphics in `ggplot2` -```r -# Exercise 2 -ggplot(small_business_2019_all) + - aes(x = as.factor(taxperiod), - y = vat_liability) + - geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month", - x = "Month", - y = "Georgian Lari") + - theme(plot.title = element_text(hjust = 0.5)) -``` -] -] +<img src="img/session4/grammar-of-graphics4l.png" width="70%" style="display: block; margin: auto;" /> --- class: inverse, center, middle -name: line-plots +name: bar-plots -# Line plots // ხაზის ნაკვეთები +# Bar plots <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- -# Line plots // ხაზის ნაკვეთები - -- In data visualization, we call **encodings** to the geometry selected to represent the data visually +# Bar plots -- In the previous examples, we did that when we used `geom_col()`. This tells R that our encoding to represent the data is **bars** (also called columns) - -- `ggplot2` has several different options for encodings in data visualization, for example: **lines** +## Exercise 1a: Create a basic bar plot -- The encoding for lines in `ggplot2` is `geom_lines()` +1. Open a new script for this session by clicking on `File` >> `New File` >> `R Script` ---- +1. Load `ggplot2` -# Line plots // ხაზის ნაკვეთები -- Line plots are a nice option to encode numeric values and include different categories of a second variable at the same time +``` r +library(ggplot2) +``` +1. Produce a basic bar plot with the following code: -<img src="4-data-visualization_files/figure-html/unnamed-chunk-23-1.png" width="65%" style="display: block; margin: auto;" /> +``` r +ggplot(employees_by_department) + + aes(x = department, + y = number) + + geom_col() + + labs(title = "Number of employees by department, 2024") +``` --- -# Line plots // ხაზის ნაკვეთები +# Bar plots -- However, they also usually require some additional data wrangling compared to bar plots: data should be collapsed at the level specified in the x-axis and grouping variable -- This is going to be clearer in the next exercise. The plot we'll produce is below -<img src="4-data-visualization_files/figure-html/unnamed-chunk-24-1.png" width="65%" style="display: block; margin: auto;" /> +This result should be displayed in the lower right panel of your RStudio window ---- +<img src="4-data-visualization_files/figure-html/unnamed-chunk-17-1.png" width="60%" style="display: block; margin: auto;" /> -# Line plots // ხაზის ნაკვეთები +In Excel ❎ you would select the data and insert bar graph. -## Exercise 2a: Collapse your data at the month-group level +--- -Use the following code to create a dataframe collapsed at the month-group level and calculate the total VAT liability for each group in each month +# Bar plots +This plot doesn't look great yet! -```r -df_group_month <- small_business_2019_all %>% - select(group, taxperiod, vat_liability) %>% - group_by(group, taxperiod) %>% - summarize(total = sum(vat_liability)) -``` +.pull-left[ +- `department` is too crowded as the names are long, but R does not know this. We need to tell R that those labels should be rotated ---- +- We can center the title -# Line plots // ხაზის ნაკვეთები +- We should add axis labels (instead of just variable names) -.pull-left[ -The result will look like this, you can explore it with `View(df_group_month)`. +- We can add color ] .pull-right[ -<img src="img/session4/df-group-month.png" width="85%" style="display: block; margin: auto;" /> +<img src="4-data-visualization_files/figure-html/unnamed-chunk-18-1.png" width="100%" style="display: block; margin: auto;" /> ] --- -# Line plots // ხაზის ნაკვეთები - -## Exercise 2b: Create a line plot of VAT liability by month and group - +# Bar plots -```r -ggplot(df_group_month) + - aes(x = taxperiod, - y = total) + - geom_line(aes(color = group)) + - labs(title = "Total VAT liability of small businesses in 2019 by experiment group", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) -``` - ---- +## Exercise 1b: Improve your bar plot -# Line plots // ხაზის ნაკვეთები +1. Use the following code to improve the aesthetics of your plot -Your result possibly looks like this: -<img src="4-data-visualization_files/figure-html/unnamed-chunk-28-1.png" width="75%" style="display: block; margin: auto;" /> +``` r +ggplot(employees_by_department) + + aes(x = department, + y = number) + +* geom_col(fill = "#9370DB") + + labs( + title = "Number of employees by department, 2024", # title +* x = "Department", +* y = "Number" + ) + + # Title and subtitles + theme( +* plot.title = element_text(hjust = 0.5), +* axis.text.x = element_text(angle = 45, hjust = 1) + ) +``` --- -# Line plots // ხაზის ნაკვეთები +# Bar plots -Something looks off, doesn't it? We need to **make the legend labels and x-axis texts smaller** for them without overlapping each other and to **remove the centering of the title** so it's not cropped +Now this looks better: -<img src="4-data-visualization_files/figure-html/unnamed-chunk-29-1.png" width="75%" style="display: block; margin: auto;" /> +<img src="4-data-visualization_files/figure-html/unnamed-chunk-20-1.png" width="70%" style="display: block; margin: auto;" /> --- -# Line plots // ხაზის ნაკვეთები +# Bar plots -## Exercise 2c: Continue improving your plot +## Exercise 1c: Improve your bar plot -1. Add the argument `legend.text=element_text(size=7)` inside `theme()` to decrease the legend text size -1. Add the argument `axis.text.x=element_text(size=6)` inside `theme()` to decrease the x-axis text size - + note that both arguments need to be separated by a comma -1. Remove `plot.title = element_text(hjust = 0.5)` from `theme()` to remove the centering of the plot title +But we can actually make this even better. I would reorder by amount of employees, and add labels. -The result should be this: -```{eval=FALSE} -ggplot(df_group_month) + - aes(x = taxperiod, - y = total) + - geom_line(aes(color = group)) + - labs(title = "Total VAT liability of small businesses in 2019 by experiment group", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(legend.text = element_text(size = 7), # don't forget the comma! - axis.text.x=element_text(size=6)) +``` r +# Create the bar plot +ggplot(employees_by_department) + + aes(x = reorder(department, -number), y = number) + #<< # Reorder bars by `number` + geom_col(fill = "#9370DB") + +* geom_text( +* aes(label = number), +* angle = 90 +* ) + + labs( + title = "Number of employees by department, 2024", + x = "Department", + y = "Number" + ) + + theme( + plot.title = element_text(hjust = 0.5), + axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis labels + ) ``` --- -# Line plots // ხაზის ნაკვეთები +# Bar plots + +Now this is ready to be in one of your reports. + +<img src="4-data-visualization_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> -<img src="4-data-visualization_files/figure-html/unnamed-chunk-30-1.png" width="75%" style="display: block; margin: auto;" /> --- -# Line plots // ხაზის ნაკვეთები +# Bar plots -## Exercise 2d: save your plot +## Exercise 1d: save your plot -Use this code to save your plot: +Now that your plot looks good, you can save it into an output with `ggsave()` + +1. Use this code to save your plot: -```r -ggsave("vat_liability_small_2019_by_group.png", +``` r +ggsave("employees_by_department.png", width = 20, height = 10, units = "cm") @@ -498,244 +409,216 @@ --- -# Line plots // ხაზის ნაკვეთები +# Bar plots -<img src="img/session4/ggsave-ex2.png" width="50%" style="display: block; margin: auto;" /> - ---- - -# Line plots // ხაზის ნაკვეთები - -Choosing the right encoding for your data can be tricky. It depends on what you want to show in your plot and how much information you want it to show +.pull-left[ +- `ggsave()` by default saves the last plot your produced - + Bar plots show less information than line plots in general, but they are good for cases when you have only one numeric variable and one categorical variable to show - + Line plots can show more information and includes multiple groups to add an additional dimension to your data, but they are not visually appealing when your data varies a lot from one category to another +- The first argument in `ggsave()` is the name of the file we save the plot into. We can also use file paths here -.pull-left[ -<img src="4-data-visualization_files/figure-html/unnamed-chunk-33-1.png" width="100%" style="display: block; margin: auto;" /> +- The rest are optional arguments that define the dimensions of the image you export, it's better to define them so the image has the correct proportions and text size ] .pull-right[ -<img src="4-data-visualization_files/figure-html/unnamed-chunk-34-1.png" width="100%" style="display: block; margin: auto;" /> +<img src="img/session4/ggsave-ex1.png" width="95%" style="display: block; margin: auto;" /> ] --- -class: inverse, center, middle -name: text-encodings +# You Did Your First Plot! 🎉 -# Text encodings // ტექსტური დაშიფვრები +## The Possibilities Are Endless -<html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> +- Congratulations on creating your **first plot**! +- From here, you can explore countless options to visualize your data: + - Bar graphs, line plots, scatterplots, pie charts, and more. +- Experimentation is key! Don’t be afraid to try new things or make mistakes. ---- +## You Have the Power 💪 -# Text encodings // ტექსტური დაშიფვრები +- Data visualization is a skill that grows with practice. +- Use online resources and communities like: + - [ggplot2 Documentation](https://ggplot2.tidyverse.org/) + - [ggplot presentation](https://pkg.garrickadenbuie.com/gentle-ggplot2) -- Geometric shapes such as bars and plots are not the only way to encode data in plots +--- -- We can also use text directly to represent information in the plot as in the example below. This is called **text encodings** +# Now the sky is the limit! 🎉 -- Text encodings can be combined with geometric encodings to highlight information or provide important additional details to your visualization +.pull-center[ +![](https://media.giphy.com/media/5C472t1RGNuq4/giphy.gif?cid=790b76118bi2mzh4evmg1ohc9xiot3g9fdv7mi1u96xeb4s5&ep=v1_gifs_search&rid=giphy.gif&ct=g) +] +--- +class: inverse, center, middle +name: pie-charts -<img src="4-data-visualization_files/figure-html/unnamed-chunk-36-1.png" width="57%" style="display: block; margin: auto;" /> +# Pie charts ---- +<html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> -# Text encodings // ტექსტური დაშიფვრები +--- -- Using text encodings in a bar plot is nice +# Pie charts -- However, it also requires additional data wrangling: the data needs to be collapsed at the same level of the x-axis to be able to add encodings +## From Bar Plots to Pie Charts -- We'll see this in the next exercise +- We just created a **bar plot** to visualize the number of employees by department. +- Let’s now use the **same data** to create a **pie chart**, a common way to show proportions. -<img src="4-data-visualization_files/figure-html/unnamed-chunk-37-1.png" width="57%" style="display: block; margin: auto;" /> +### Geoms and Graph Types +Remember: The **geom** controls the type of graph we create. +- **`geom_col()`**: Creates bar plots. +- **`geom_bar()`** with **coord_polar()**: Converts data into a pie chart. --- -# Text encodings // ტექსტური დაშიფვრები +# Pie Charts -## Exercise 3a: Collapse your data at the month level +## Exercise 2a: Create a Pie Chart -This is similar to what we did for exercise 2a, except that this time the collapsing is at month level instead of the month-group level. Use the code below to store the collapsed dataframe in `month_df` +1. Use the following code to create a pie chart: -```r -df_month <- small_business_2019_all %>% - select(taxperiod, vat_liability) %>% - group_by(taxperiod) %>% - summarize(total = sum(vat_liability)) +``` r +ggplot(employees_by_department) + + aes(x = "", y = number, fill = department) + + geom_bar(stat = "identity", width = 1) + + coord_polar(theta = "y") + + labs( + title = "Proportion of employees by department, 2024" + ) + + theme_void() ``` - --- -# Text encodings // ტექსტური დაშიფვრები +# Pie chart -This is the result, you can use `View(df_month)` to display it +You should see this on your Plots panel. -<img src="img/session4/df-month.png" width="55%" style="display: block; margin: auto;" /> +<img src="4-data-visualization_files/figure-html/unnamed-chunk-26-1.png" width="60%" style="display: block; margin: auto;" /> --- -# Text encodings // ტექსტური დაშიფვრები - -## Exercise 3b: Add encodings to your bar plot +# Pie chart -Use the collapsed dataframe `df_month` and the former code from exercise 1 to add encodings to your bar plot. The result is the code below: +## Exercise 2b +The chart is alright, but we can customize it further.. What about changing the colors? -```r - ggplot(df_month) + - aes(x = taxperiod, - y = total) + - geom_col() + - # Note that the text encodings are added here - geom_text(aes(label = total), - position = position_dodge(width = 1), - vjust = -0.5, - size = 3) + - labs(title = "Total VAT liability of small businesses in 2019 by month", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) -``` - ---- -# Text encodings // ტექსტური დაშიფვრები -This is result is nice but the text encodings have several decimal places. We can improve it by rounding `total` with the function `round()` - -<img src="4-data-visualization_files/figure-html/unnamed-chunk-41-1.png" width="75%" style="display: block; margin: auto;" /> +``` r +ggplot(employees_by_department) + + aes(x = "", y = number, fill = department) + + geom_bar(stat = "identity", width = 1) + + coord_polar(theta = "y") + + labs( + title = "Proportion of employees by department, 2024" + ) + + theme_void() + +* scale_fill_viridis_d(option = "D") +``` --- -# Text encodings // ტექსტური დაშიფვრები +# Pie chart -## Exercise 3c: Improve your plot once again +## Exercise 2c -Replace `total` with `round(total)` in `geom_text()`. The result should be: +Now this is ready to go into our report, so let's save it. +<img src="4-data-visualization_files/figure-html/unnamed-chunk-28-1.png" width="60%" style="display: block; margin: auto;" /> -```r - ggplot(df_month) + - aes(x = taxperiod, - y = total) + - geom_col() + - geom_text(aes(label = round(total)), # <--- the change goes here - position = position_dodge(width = 1), - vjust = -0.5, - size = 3) + - labs(title = "Total VAT liability of small businesses in 2019 by month", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) -``` --- -# Text encodings // ტექსტური დაშიფვრები +# Pie chart -<img src="4-data-visualization_files/figure-html/unnamed-chunk-43-1.png" width="75%" style="display: block; margin: auto;" /> +## Exercise 2c ---- +Now this is ready to go into our report, so let's save it. -# Text encodings // ტექსტური დაშიფვრები -## Don't forget to save your plot! +``` r +ggsave("employees_by_department_pie.png") +``` -Use this code to save this last plot: +<img src="img/session4/ggsave-ex2.png" width="40%" style="display: block; margin: auto;" /> -```r -ggsave("vat_liability_small_2019_text.png", - width = 20, - height = 10, - units = "cm") -``` - --- class: inverse, center, middle -name: scatter-plots +name: line-plots -# Scatter plots // გაფანტავს ნაკვეთებს +# Line plots <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- -# Scatter plots // გაფანტავს ნაკვეთებს - -- We're going to explore one more type of encoding for data visualizations: **scatter plots** +# Line plots -- Scatter plots are useful when you have two continuous numeric variables and want to show that there might to a correlation between them or visualize outliers (values that stand out from the rest because they are very extreme) +Another Common Plot: Line Plots -- We use the encoding `geom_line()` for scatterplots +- **Line plots** are commonly used to show **trends over time** or changes across a sequence. +- Examples include visualizing **monthly sales**, **yearly growth**, or **temperature changes**. -<img src="4-data-visualization_files/figure-html/unnamed-chunk-45-1.png" width="55%" style="display: block; margin: auto;" /> ---- +**Description** +- Each point represents a value at a specific time. +- Lines connect the points to show a continuous trend. -# Scatter plots // გაფანტავს ნაკვეთებს -## Exercise 4: Create a scatter plot - -Use the following code to reproduce the scatter plot of the last slide: - - -```r -ggplot(small_business_2019_all) + - aes(x = age, - y = vat_liability) + - geom_point() + - labs(title = "VAT liability versus age for small businesses in 2019", - x = "Age of firm (years)", - y = "VAT liability") + - theme(plot.title = element_text(hjust = 0.5)) -``` --- -# Scatter plots // გაფანტავს ნაკვეთებს +# Line plots +.pull-left[ +**Code Example** -<img src="4-data-visualization_files/figure-html/unnamed-chunk-47-1.png" width="80%" style="display: block; margin: auto;" /> ---- +``` r +library(ggplot2) -# Scatter plots // გაფანტავს ნაკვეთებს +# Using built-in `economics` dataset from ggplot2 +ggplot(economics) + + aes(x = date, y = unemploy) + + geom_line(color = "#9370DB", size = 1) + + labs( + title = "Unemployment Over Time", + x = "Date", + y = "Number of Unemployed" + ) + + theme_minimal() +``` +] -Lastly, remember to save your scatter plot with: +.pull-right[ +**Output Example** +<img src="4-data-visualization_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> +] -```r -ggsave("scatter_age_vat.png", - width = 20, - height = 10, - units = "cm")) -``` --- class: inverse, center, middle name: wrapping-up -# Wrapping up // შეფუთვა +# Wrapping up <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> --- -# Wrapping up // შეფუთვა +# Wrapping up -## Other encodings in `ggplot2` +## More in `ggplot2` -This table lists several of the most popular encoding types in `ggplot2` +This table lists several of the most popular encoding types in `ggplot2`. Also see more [here](https://ggplot2.tidyverse.org/reference/) | Encoding | Function in `ggplot2` | | -------- | --------------------- | @@ -751,7 +634,7 @@ --- -# Wrapping up // შეფუთვა +# Wrapping up ## Save your code! @@ -761,44 +644,75 @@ --- -# Wrapping up // შეფუთვა +# Wrap-Up: Looking Ahead 🚀 -## Data work pipeline +## Key Takeaways -<img src="img/session4/data-work-final.png" width="95%" style="display: block; margin: auto;" /> +- Today, you learned how **data + code** can create powerful visualizations and tables for your annual reports. +- **Why this matters:** + - Code is **reusable**: Use it for next quarters or years without starting from scratch. + - Code is **transparent**: Everyone can see and verify all the calculations. --- -# Wrapping up // შეფუთვა +# Wrap-Up: Looking Ahead 🚀 + +## From Data to Annual Reports + +.pull-center[ +**Data + Code** → **Annual Report** +![](img/data_code_report.png) +*Reproducible workflows save time and improve accuracy.* +] -## Looking ahead +## What’s Next? -<img src="img/session4/data-work-expanded.png" width="95%" style="display: block; margin: auto;" /> +- **Tomorrow**: Bring your own data, graphs, or tables. + - We’ll have a **long hands-on session** where you can: + - Ask questions about how to code specific visualizations or tables. + - Work with your own data to make real progress. --- -# Wrapping up // შეფუთვა +# Wrap-Up: Looking Ahead 🚀 -## Looking ahead +## Final Thoughts 💡 -- [Connecting R with SQL databases](https://solutions.posit.co/connections/db/getting-started/connect-to-database/). Some libraries: - + `dbConnect` - + `dbplyr` - + `DBI` +- This is **new and challenging**, but it’s also incredibly **powerful** and **useful**. +- The only way to get better is to **keep practicing**. +- Experiment, google/chatgpt-it, ask questions, and remember—this will make your reports clearer, faster, and less error-prone! -- [More on data wrangling](https://raw.githack.com/worldbank/dime-r-training/main/Presentations/03-data-wrangling.html#1). More libraries: - + `tidyr` - + `janitor` +Keep going—you’ve got this! + +![](https://media.giphy.com/media/kF9C1IrWKrltu/giphy.gif?cid=ecf05e47wy5wj5rjlofflyz9z3p0xgl9a7r5asamv89oy299&ep=v1_gifs_search&rid=giphy.gif&ct=g) -- [More on data visualization](https://raw.githack.com/worldbank/dime-r-training/main/Presentations/04-data-visualization.html#1) --- class: inverse, center, middle -# Thanks! // მადლობა! // ¡Gracias! // Obrigado! +# Thanks! // ¡Gracias! // Obrigado! <html><div style='float:left'></div><hr color='#D38C28' size=1px width=1100px></html> +<img src="img/session4/welcome-r.png" width="60%" style="display: block; margin: auto;" /> + +--- + +# Resources for Data Visualization 📚 + +## Learn More About ggplot2 + +- **ggplot2 Documentation** + [ggplot2.tidyverse.org](https://ggplot2.tidyverse.org) + +- **R Graphics Cookbook** (Online resource for practical examples) + [r-graphics.org](https://r-graphics.org) + +- **R for Data Science**: Chapter on Data Visualization + [r4ds.had.co.nz](https://r4ds.had.co.nz/data-visualisation.html) + +- **ggplot2 Cheatsheet** (Quick reference for all functions) + [Download the cheatsheet here](https://github.com/rstudio/cheatsheets/raw/main/data-visualization-2.1.pdf) diff --git a/Presentations-Ghana/2024-10/4-data-visualization.pdf b/Presentations-Ghana/2024-10/4-data-visualization.pdf index e687856..80069c6 100644 Binary files a/Presentations-Ghana/2024-10/4-data-visualization.pdf and b/Presentations-Ghana/2024-10/4-data-visualization.pdf differ diff --git a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-17-1.png b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-17-1.png index a9f2b93..66249bb 100644 Binary files a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-17-1.png and b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-18-1.png b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-18-1.png new file mode 100644 index 0000000..66249bb Binary files /dev/null and b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-18-1.png differ diff --git a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-20-1.png b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-20-1.png index 1bebad3..6808393 100644 Binary files a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-20-1.png and b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-20-1.png differ diff --git a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-22-1.png b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-22-1.png index 71df132..1932c50 100644 Binary files a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-22-1.png and b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-22-1.png differ diff --git a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-23-1.png b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-23-1.png index 53a8ec1..756b361 100644 Binary files a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-23-1.png and b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-23-1.png differ diff --git a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-24-1.png b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-24-1.png index 53a8ec1..756b361 100644 Binary files a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-24-1.png and b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-24-1.png differ diff --git a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-26-1.png b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-26-1.png index 53a8ec1..756b361 100644 Binary files a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-26-1.png and b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-26-1.png differ diff --git a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-28-1.png b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-28-1.png index 1bebad3..d0c9f5d 100644 Binary files a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-28-1.png and b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-28-1.png differ diff --git a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-32-1.png b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-32-1.png index 2850e79..03d2804 100644 Binary files a/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-32-1.png and b/Presentations-Ghana/2024-10/4-data-visualization_files/figure-html/unnamed-chunk-32-1.png differ diff --git a/Presentations-Ghana/2024-10/employees_by_department.png b/Presentations-Ghana/2024-10/employees_by_department.png new file mode 100644 index 0000000..6f7c143 Binary files /dev/null and b/Presentations-Ghana/2024-10/employees_by_department.png differ diff --git a/Presentations-Ghana/2024-10/employees_by_department_pie.png b/Presentations-Ghana/2024-10/employees_by_department_pie.png new file mode 100644 index 0000000..0ed1884 Binary files /dev/null and b/Presentations-Ghana/2024-10/employees_by_department_pie.png differ diff --git a/Presentations-Ghana/2024-10/exercises-session3.R b/Presentations-Ghana/2024-10/exercises-session3.R index 74a1d32..c7c1667 100644 --- a/Presentations-Ghana/2024-10/exercises-session3.R +++ b/Presentations-Ghana/2024-10/exercises-session3.R @@ -1,30 +1,28 @@ # Data -small_business_2019_all <- read.csv("data/small_business_2019_all.csv") -View(small_business_2019_all) +department_staff <- read.csv("data/department_staff_final.csv") # Exercise 1 #install.packages("modelsummary") #install.packages("huxtable") # Exercise 2 -df_tbilisi_50 <- filter(small_business_2019, - region == "Tbilisi") %>% - arrange(-income) %>% - filter(row_number() <= 50) +department_female <- filter(department_staff, + sex == "Female") %>% + arrange(years_of_service) # Exercise 3 library(modelsummary) -datasummary_skim(small_business_2019_all) +datasummary_skim(department_staff) # Exercise 4 datasummary( - age + income + vat_liability ~ N + Mean + SD + Min + Max, - small_business_2019_all + years_of_service ~ N + Mean + SD + Min + Max, + department_staff ) # Exercise 5 library(huxtable) -stats_table <- datasummary_skim(small_business_2019_all, output = "huxtable") +stats_table <- datasummary_skim(department_staff, output = "huxtable") quick_xlsx(stats_table, file = "quick_stats.xlsx") # Exercise 6 @@ -33,8 +31,8 @@ stats_table_custom <- stats_table %>% set_header_cols(1, TRUE) %>% set_number_format(everywhere, 2:ncol(.), "%9.0f") %>% set_align(1, everywhere, "center") %>% - theme_basic() + theme_blue() quick_xlsx( stats_table_custom, file = "stats-custom.xlsx" -) \ No newline at end of file +) diff --git a/Presentations-Ghana/2024-10/exercises-session4.R b/Presentations-Ghana/2024-10/exercises-session4.R index 8d0ddf7..53f3e20 100644 --- a/Presentations-Ghana/2024-10/exercises-session4.R +++ b/Presentations-Ghana/2024-10/exercises-session4.R @@ -1,125 +1,81 @@ # Data -small_business_2019_all <- read.csv("data/small_business_2019_all.csv") +employees_by_department <- read.csv("data/employees_by_department.csv") # Exercise 1a library(ggplot2) -ggplot(small_business_2019_all) + - aes(x = taxperiod, - y = vat_liability) + - geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month") -# Exercise 1b -ggplot(small_business_2019_all) + - aes(x = taxperiod, - y = vat_liability) + +ggplot(employees_by_department) + + aes(x = department, + y = number) + geom_col() + - labs(title = "Total VAT liability of small businesses in 2019 by month", - # x-axis title - x = "Month", - # y-axis title - y = "Georgian Lari") + - # telling R not to break the x-axis - scale_x_continuous(breaks = 201901:201912) + - # centering plot title - theme(plot.title = element_text(hjust = 0.5)) + labs(title = "Number of employees by department, 2024") +# Exercise 1b +ggplot(employees_by_department) + + aes(x = department, + y = number) + + geom_col(fill = "#9370DB") + + labs( + title = "Number of employees by department, 2024", + # x-axis title + x = "Department", + # y-axis title + y = "Number" + ) + + # Centering plot title + theme( + plot.title = element_text(hjust = 0.5), + axis.text.x = element_text(angle = 45, hjust = 1) # Rotating x-axis labels + ) # Exercise 1c -ggsave("vat_liability_small_2019.png", + +ggplot(employees_by_department) + + aes(x = reorder(department, -number), y = number) + #<< # Reorder bars by `number` + geom_col(fill = "#9370DB") + + geom_text( #<< + aes(label = number), #<< + angle = 90 #<< + ) + #<< + labs( + title = "Number of employees by department, 2024", + x = "Department", + y = "Number" + ) + + theme( + plot.title = element_text(hjust = 0.5), + axis.text.x = element_text(angle = 45, hjust = 1) # Rotate x-axis labels + ) + +# Exercise 1d +ggsave("employees_by_department.png", width = 20, height = 10, units = "cm") # Exercise 2a -df_group_month <- small_business_2019_all %>% - select(group, taxperiod, vat_liability) %>% - group_by(group, taxperiod) %>% - summarize(total = sum(vat_liability)) -# Exercise 2b -ggplot(df_group_month) + - aes(x = taxperiod, - y = total) + - geom_line(aes(color = group)) + - labs(title = "Total VAT liability of small businesses in 2019 by experiment group", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) +ggplot(employees_by_department) + + aes(x = "", y = number, fill = department) + + geom_bar(stat = "identity", width = 1) + + coord_polar(theta = "y") + + labs( + title = "Proportion of employees by department, 2024" + ) + + theme_void() -# Exercise 2c - ggplot(df_group_month) + - aes(x = taxperiod, - y = total) + - geom_line(aes(color = group)) + - labs(title = "Total VAT liability of small businesses in 2019 by experiment group", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(legend.text = element_text(size = 7), - axis.text.x = element_text(size = 6)) - -# Exercise 2d - ggsave("vat_liability_small_2019_by_group.png", - width = 20, - height = 10, - units = "cm") +# Exercise 2b -# Exercise 3a -df_month <- small_business_2019_all %>% - select(taxperiod, vat_liability) %>% - group_by(taxperiod) %>% - summarize(total = sum(vat_liability)) +ggplot(employees_by_department) + + aes(x = "", y = number, fill = department) + + geom_bar(stat = "identity", width = 1) + + coord_polar(theta = "y") + + labs( + title = "Proportion of employees by department, 2024" + ) + + theme_void() + + scale_fill_viridis_d(option = "D") -# Exercise 3b -ggplot(df_month) + - aes(x = taxperiod, - y = total) + - geom_col() + - geom_text(aes(label = total), - position = position_dodge(width = 1), - vjust = -0.5, - size = 3) + - labs(title = "Total VAT liability of small businesses in 2019 by month", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) - -# Exercise 3c - ggplot(df_month) + - aes(x = taxperiod, - y = total) + - geom_col() + - geom_text(aes(label = round(total)), - position = position_dodge(width = 1), - vjust = -0.5, - size = 3) + - labs(title = "Total VAT liability of small businesses in 2019 by month", - x = "Month", - y = "Georgian Lari") + - scale_x_continuous(breaks = 201901:201912) + - theme(plot.title = element_text(hjust = 0.5)) - -# Saving -ggsave("vat_liability_small_2019_text.png", - width = 20, - height = 10, - units = "cm") +#Exercise 2c -# Exercise 4 - ggplot(small_business_2019_all) + - aes(x = age, - y = vat_liability) + - geom_point() + - labs(title = "VAT liability versus age for small businesses in 2019", - x = "Age of firm (years)", - y = "VAT liability") + - theme(plot.title = element_text(hjust = 0.5)) +ggsave("employees_by_department_pie.png") -# Saving - ggsave("scatter_age_vat.png", - width = 20, - height = 10, - units = "cm") - \ No newline at end of file diff --git a/Presentations-Ghana/2024-10/img/session1/code-workflow.png b/Presentations-Ghana/2024-10/img/session1/code-workflow.png index 1fdf8c0..e0c07bf 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/code-workflow.png and b/Presentations-Ghana/2024-10/img/session1/code-workflow.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/data-viewer.png b/Presentations-Ghana/2024-10/img/session1/data-viewer.png index ba9ba80..03b574b 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/data-viewer.png and b/Presentations-Ghana/2024-10/img/session1/data-viewer.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/data-viewer2.png b/Presentations-Ghana/2024-10/img/session1/data-viewer2.png index ddcd1ff..02a81d6 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/data-viewer2.png and b/Presentations-Ghana/2024-10/img/session1/data-viewer2.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/data-work-and-cooking.png b/Presentations-Ghana/2024-10/img/session1/data-work-and-cooking.png index ee0b56a..f0ab722 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/data-work-and-cooking.png and b/Presentations-Ghana/2024-10/img/session1/data-work-and-cooking.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/data-work-excel-r.png b/Presentations-Ghana/2024-10/img/session1/data-work-excel-r.png index b43a69e..323121d 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/data-work-excel-r.png and b/Presentations-Ghana/2024-10/img/session1/data-work-excel-r.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/data-work-script.png b/Presentations-Ghana/2024-10/img/session1/data-work-script.png index 8ea92c0..bd4e801 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/data-work-script.png and b/Presentations-Ghana/2024-10/img/session1/data-work-script.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/data-work-with-instructions.png b/Presentations-Ghana/2024-10/img/session1/data-work-with-instructions.png index c081947..8f8a437 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/data-work-with-instructions.png and b/Presentations-Ghana/2024-10/img/session1/data-work-with-instructions.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/data-work.png b/Presentations-Ghana/2024-10/img/session1/data-work.png index af7011d..935d445 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/data-work.png and b/Presentations-Ghana/2024-10/img/session1/data-work.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/day1.png b/Presentations-Ghana/2024-10/img/session1/day1.png index febf750..1a31622 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/day1.png and b/Presentations-Ghana/2024-10/img/session1/day1.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/downloads.png b/Presentations-Ghana/2024-10/img/session1/downloads.png index e06b857..76350fc 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/downloads.png and b/Presentations-Ghana/2024-10/img/session1/downloads.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/environment2.png b/Presentations-Ghana/2024-10/img/session1/environment2.png index 1121759..2d4a08e 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/environment2.png and b/Presentations-Ghana/2024-10/img/session1/environment2.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/import3.png b/Presentations-Ghana/2024-10/img/session1/import3.png index 999f177..9d64a99 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/import3.png and b/Presentations-Ghana/2024-10/img/session1/import3.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/import_data1.png b/Presentations-Ghana/2024-10/img/session1/import_data1.png index 8a8df3a..23baed0 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/import_data1.png and b/Presentations-Ghana/2024-10/img/session1/import_data1.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/osf-screenshot.png b/Presentations-Ghana/2024-10/img/session1/osf-screenshot.png index f151678..9e68041 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/osf-screenshot.png and b/Presentations-Ghana/2024-10/img/session1/osf-screenshot.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/session1.png b/Presentations-Ghana/2024-10/img/session1/session1.png index 879171b..1a31622 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/session1.png and b/Presentations-Ghana/2024-10/img/session1/session1.png differ diff --git a/Presentations-Ghana/2024-10/img/session1/session2.png b/Presentations-Ghana/2024-10/img/session1/session2.png index 6de6db5..95a782c 100644 Binary files a/Presentations-Ghana/2024-10/img/session1/session2.png and b/Presentations-Ghana/2024-10/img/session1/session2.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/data-work-day-2.png b/Presentations-Ghana/2024-10/img/session2/data-work-day-2.png index 160848d..f7d45b7 100644 Binary files a/Presentations-Ghana/2024-10/img/session2/data-work-day-2.png and b/Presentations-Ghana/2024-10/img/session2/data-work-day-2.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/data-work-expanded.png b/Presentations-Ghana/2024-10/img/session2/data-work-expanded.png deleted file mode 100644 index a5f0380..0000000 Binary files a/Presentations-Ghana/2024-10/img/session2/data-work-expanded.png and /dev/null differ diff --git a/Presentations-Ghana/2024-10/img/session2/data-work-final-table.png b/Presentations-Ghana/2024-10/img/session2/data-work-final-table.png index ef329dc..ddd2bf6 100644 Binary files a/Presentations-Ghana/2024-10/img/session2/data-work-final-table.png and b/Presentations-Ghana/2024-10/img/session2/data-work-final-table.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/data-work-final.png b/Presentations-Ghana/2024-10/img/session2/data-work-final.png index a4e14e1..4608fa1 100644 Binary files a/Presentations-Ghana/2024-10/img/session2/data-work-final.png and b/Presentations-Ghana/2024-10/img/session2/data-work-final.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/data-work-progress.png b/Presentations-Ghana/2024-10/img/session2/data-work-progress.png index 366b28e..7dead56 100644 Binary files a/Presentations-Ghana/2024-10/img/session2/data-work-progress.png and b/Presentations-Ghana/2024-10/img/session2/data-work-progress.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/data-work-script.png b/Presentations-Ghana/2024-10/img/session2/data-work-script.png index 8ea92c0..bd4e801 100644 Binary files a/Presentations-Ghana/2024-10/img/session2/data-work-script.png and b/Presentations-Ghana/2024-10/img/session2/data-work-script.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/data-wrangling-reasoning.png b/Presentations-Ghana/2024-10/img/session2/data-wrangling-reasoning.png index 63b34c5..5fecc28 100644 Binary files a/Presentations-Ghana/2024-10/img/session2/data-wrangling-reasoning.png and b/Presentations-Ghana/2024-10/img/session2/data-wrangling-reasoning.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/data-wrangling.png b/Presentations-Ghana/2024-10/img/session2/data-wrangling.png index 57feb43..9f464e4 100644 Binary files a/Presentations-Ghana/2024-10/img/session2/data-wrangling.png and b/Presentations-Ghana/2024-10/img/session2/data-wrangling.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/day1.png b/Presentations-Ghana/2024-10/img/session2/day1.png index febf750..1a31622 100644 Binary files a/Presentations-Ghana/2024-10/img/session2/day1.png and b/Presentations-Ghana/2024-10/img/session2/day1.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/ex5.png b/Presentations-Ghana/2024-10/img/session2/ex5.png index e91bc81..d24c31e 100644 Binary files a/Presentations-Ghana/2024-10/img/session2/ex5.png and b/Presentations-Ghana/2024-10/img/session2/ex5.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/exported-csv.png b/Presentations-Ghana/2024-10/img/session2/exported-csv.png index fc2297f..a8556b0 100644 Binary files a/Presentations-Ghana/2024-10/img/session2/exported-csv.png and b/Presentations-Ghana/2024-10/img/session2/exported-csv.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/import3.png b/Presentations-Ghana/2024-10/img/session2/import3.png index e9bcc77..9d64a99 100644 Binary files a/Presentations-Ghana/2024-10/img/session2/import3.png and b/Presentations-Ghana/2024-10/img/session2/import3.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/path.png b/Presentations-Ghana/2024-10/img/session2/path.png new file mode 100644 index 0000000..257b967 Binary files /dev/null and b/Presentations-Ghana/2024-10/img/session2/path.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/session1.png b/Presentations-Ghana/2024-10/img/session2/session1.png new file mode 100644 index 0000000..1a31622 Binary files /dev/null and b/Presentations-Ghana/2024-10/img/session2/session1.png differ diff --git a/Presentations-Ghana/2024-10/img/session2/session2.png b/Presentations-Ghana/2024-10/img/session2/session2.png index 51e98cc..1fe84ef 100644 Binary files a/Presentations-Ghana/2024-10/img/session2/session2.png and b/Presentations-Ghana/2024-10/img/session2/session2.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/annual_report.png b/Presentations-Ghana/2024-10/img/session3/annual_report.png new file mode 100644 index 0000000..da439d5 Binary files /dev/null and b/Presentations-Ghana/2024-10/img/session3/annual_report.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/custom.png b/Presentations-Ghana/2024-10/img/session3/custom.png new file mode 100644 index 0000000..8506f10 Binary files /dev/null and b/Presentations-Ghana/2024-10/img/session3/custom.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/data-work-descriptive-stats.png b/Presentations-Ghana/2024-10/img/session3/data-work-descriptive-stats.png index 4fa40cb..ad5204c 100644 Binary files a/Presentations-Ghana/2024-10/img/session3/data-work-descriptive-stats.png and b/Presentations-Ghana/2024-10/img/session3/data-work-descriptive-stats.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/data-work-final-table.png b/Presentations-Ghana/2024-10/img/session3/data-work-final-table.png new file mode 100644 index 0000000..ddd2bf6 Binary files /dev/null and b/Presentations-Ghana/2024-10/img/session3/data-work-final-table.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/datasummary_cat.png b/Presentations-Ghana/2024-10/img/session3/datasummary_cat.png new file mode 100644 index 0000000..339a635 Binary files /dev/null and b/Presentations-Ghana/2024-10/img/session3/datasummary_cat.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/datasummary_skim.png b/Presentations-Ghana/2024-10/img/session3/datasummary_skim.png new file mode 100644 index 0000000..9e7dce0 Binary files /dev/null and b/Presentations-Ghana/2024-10/img/session3/datasummary_skim.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/environment.png b/Presentations-Ghana/2024-10/img/session3/environment.png index cf2589c..9988e55 100644 Binary files a/Presentations-Ghana/2024-10/img/session3/environment.png and b/Presentations-Ghana/2024-10/img/session3/environment.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/import.png b/Presentations-Ghana/2024-10/img/session3/import.png index e081439..0c563dd 100644 Binary files a/Presentations-Ghana/2024-10/img/session3/import.png and b/Presentations-Ghana/2024-10/img/session3/import.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/programmed_psrl_2023.png b/Presentations-Ghana/2024-10/img/session3/programmed_psrl_2023.png new file mode 100644 index 0000000..17977ae Binary files /dev/null and b/Presentations-Ghana/2024-10/img/session3/programmed_psrl_2023.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/quick-stats-excel.png b/Presentations-Ghana/2024-10/img/session3/quick-stats-excel.png index d5e0a15..a71b606 100644 Binary files a/Presentations-Ghana/2024-10/img/session3/quick-stats-excel.png and b/Presentations-Ghana/2024-10/img/session3/quick-stats-excel.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/quick-stats-output.png b/Presentations-Ghana/2024-10/img/session3/quick-stats-output.png index 5cf3ebd..75fb441 100644 Binary files a/Presentations-Ghana/2024-10/img/session3/quick-stats-output.png and b/Presentations-Ghana/2024-10/img/session3/quick-stats-output.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/r-script.png b/Presentations-Ghana/2024-10/img/session3/r-script.png new file mode 100644 index 0000000..3254a01 Binary files /dev/null and b/Presentations-Ghana/2024-10/img/session3/r-script.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/stats-custom.png b/Presentations-Ghana/2024-10/img/session3/stats-custom.png index fc78a50..1069ecd 100644 Binary files a/Presentations-Ghana/2024-10/img/session3/stats-custom.png and b/Presentations-Ghana/2024-10/img/session3/stats-custom.png differ diff --git a/Presentations-Ghana/2024-10/img/session3/temp_dfs.png b/Presentations-Ghana/2024-10/img/session3/temp_dfs.png index 6e6eb95..5db4d51 100644 Binary files a/Presentations-Ghana/2024-10/img/session3/temp_dfs.png and b/Presentations-Ghana/2024-10/img/session3/temp_dfs.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/bar-chart.png b/Presentations-Ghana/2024-10/img/session4/bar-chart.png new file mode 100644 index 0000000..0c862cf Binary files /dev/null and b/Presentations-Ghana/2024-10/img/session4/bar-chart.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/data-vis.png b/Presentations-Ghana/2024-10/img/session4/data-vis.png index f652ff9..c8532dc 100644 Binary files a/Presentations-Ghana/2024-10/img/session4/data-vis.png and b/Presentations-Ghana/2024-10/img/session4/data-vis.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/data-work-data-vis.png b/Presentations-Ghana/2024-10/img/session4/data-work-data-vis.png index ee59e7e..4a3a6d4 100644 Binary files a/Presentations-Ghana/2024-10/img/session4/data-work-data-vis.png and b/Presentations-Ghana/2024-10/img/session4/data-work-data-vis.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/geom-type.png b/Presentations-Ghana/2024-10/img/session4/geom-type.png new file mode 100644 index 0000000..f0ed5e3 Binary files /dev/null and b/Presentations-Ghana/2024-10/img/session4/geom-type.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/ggsave-ex1.png b/Presentations-Ghana/2024-10/img/session4/ggsave-ex1.png index 439cf06..af0029b 100644 Binary files a/Presentations-Ghana/2024-10/img/session4/ggsave-ex1.png and b/Presentations-Ghana/2024-10/img/session4/ggsave-ex1.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/ggsave-ex2.png b/Presentations-Ghana/2024-10/img/session4/ggsave-ex2.png index 5e77fec..77cd1d6 100644 Binary files a/Presentations-Ghana/2024-10/img/session4/ggsave-ex2.png and b/Presentations-Ghana/2024-10/img/session4/ggsave-ex2.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics.png b/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics.png index 0160192..7003dbe 100644 Binary files a/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics.png and b/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics2.png b/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics2.png index 5694be8..e7f6cbe 100644 Binary files a/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics2.png and b/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics2.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics3x.png b/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics3x.png index 41bfd05..72c5c8d 100644 Binary files a/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics3x.png and b/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics3x.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics3y.png b/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics3y.png index 919fe30..7cd4260 100644 Binary files a/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics3y.png and b/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics3y.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics4gc.png b/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics4gc.png index e2dbad4..a1c8666 100644 Binary files a/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics4gc.png and b/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics4gc.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics4l.png b/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics4l.png index f04c745..b5a9941 100644 Binary files a/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics4l.png and b/Presentations-Ghana/2024-10/img/session4/grammar-of-graphics4l.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/pie-chart.png b/Presentations-Ghana/2024-10/img/session4/pie-chart.png new file mode 100644 index 0000000..2e0e5e0 Binary files /dev/null and b/Presentations-Ghana/2024-10/img/session4/pie-chart.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/save.png b/Presentations-Ghana/2024-10/img/session4/save.png index 0858c79..97f31c5 100644 Binary files a/Presentations-Ghana/2024-10/img/session4/save.png and b/Presentations-Ghana/2024-10/img/session4/save.png differ diff --git a/Presentations-Ghana/2024-10/img/session4/welcome-r.png b/Presentations-Ghana/2024-10/img/session4/welcome-r.png new file mode 100644 index 0000000..f3de5a8 Binary files /dev/null and b/Presentations-Ghana/2024-10/img/session4/welcome-r.png differ diff --git a/Presentations-Ghana/2024-10/img/template.png b/Presentations-Ghana/2024-10/img/template.png index 71f1467..36bb8fc 100644 Binary files a/Presentations-Ghana/2024-10/img/template.png and b/Presentations-Ghana/2024-10/img/template.png differ diff --git a/Presentations-Ghana/2024-10/scatter_age_vat.png b/Presentations-Ghana/2024-10/scatter_age_vat.png deleted file mode 100644 index d47dfa0..0000000 Binary files a/Presentations-Ghana/2024-10/scatter_age_vat.png and /dev/null differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/id0z2nzyunmbl2vk6tfcbw.png b/Presentations-Ghana/2024-10/tinytable_assets/id0z2nzyunmbl2vk6tfcbw.png new file mode 100644 index 0000000..d5f9269 Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/id0z2nzyunmbl2vk6tfcbw.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/id1m1xoe955ddfyta5m19b.png b/Presentations-Ghana/2024-10/tinytable_assets/id1m1xoe955ddfyta5m19b.png new file mode 100644 index 0000000..d5f9269 Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/id1m1xoe955ddfyta5m19b.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/id3626d8tsf81hsdrtn6na.png b/Presentations-Ghana/2024-10/tinytable_assets/id3626d8tsf81hsdrtn6na.png new file mode 100644 index 0000000..0589e82 Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/id3626d8tsf81hsdrtn6na.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/id6bn0ooh0ll5b34u3bt0q.png b/Presentations-Ghana/2024-10/tinytable_assets/id6bn0ooh0ll5b34u3bt0q.png new file mode 100644 index 0000000..0589e82 Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/id6bn0ooh0ll5b34u3bt0q.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/id6fnke38umrfqzxh5a1u9.png b/Presentations-Ghana/2024-10/tinytable_assets/id6fnke38umrfqzxh5a1u9.png new file mode 100644 index 0000000..ef339ed Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/id6fnke38umrfqzxh5a1u9.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/id7nojlg5r7m70g3yz6ur8.png b/Presentations-Ghana/2024-10/tinytable_assets/id7nojlg5r7m70g3yz6ur8.png new file mode 100644 index 0000000..d5f9269 Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/id7nojlg5r7m70g3yz6ur8.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/idgveritdvsswjod13rjqi.png b/Presentations-Ghana/2024-10/tinytable_assets/idgveritdvsswjod13rjqi.png new file mode 100644 index 0000000..d5f9269 Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/idgveritdvsswjod13rjqi.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/idicu1pt3fddc0z0f1covp.png b/Presentations-Ghana/2024-10/tinytable_assets/idicu1pt3fddc0z0f1covp.png new file mode 100644 index 0000000..ef339ed Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/idicu1pt3fddc0z0f1covp.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/idjxrnee3iopsdgegacthg.png b/Presentations-Ghana/2024-10/tinytable_assets/idjxrnee3iopsdgegacthg.png new file mode 100644 index 0000000..ef339ed Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/idjxrnee3iopsdgegacthg.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/idnnjhquqbm0enzssnupsr.png b/Presentations-Ghana/2024-10/tinytable_assets/idnnjhquqbm0enzssnupsr.png new file mode 100644 index 0000000..fe01de9 Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/idnnjhquqbm0enzssnupsr.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/idpspcqpt9xr4lq3z9mdv6.png b/Presentations-Ghana/2024-10/tinytable_assets/idpspcqpt9xr4lq3z9mdv6.png new file mode 100644 index 0000000..fe01de9 Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/idpspcqpt9xr4lq3z9mdv6.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/idu31zr1xdhf7k3jhyaf9i.png b/Presentations-Ghana/2024-10/tinytable_assets/idu31zr1xdhf7k3jhyaf9i.png new file mode 100644 index 0000000..0589e82 Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/idu31zr1xdhf7k3jhyaf9i.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/idufzeyjirminhvc1gyj4q.png b/Presentations-Ghana/2024-10/tinytable_assets/idufzeyjirminhvc1gyj4q.png new file mode 100644 index 0000000..fe01de9 Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/idufzeyjirminhvc1gyj4q.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/idwbsxevpi5ccgtig979da.png b/Presentations-Ghana/2024-10/tinytable_assets/idwbsxevpi5ccgtig979da.png new file mode 100644 index 0000000..fe01de9 Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/idwbsxevpi5ccgtig979da.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/idxmxx868xtp1e0jsgyg51.png b/Presentations-Ghana/2024-10/tinytable_assets/idxmxx868xtp1e0jsgyg51.png new file mode 100644 index 0000000..ef339ed Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/idxmxx868xtp1e0jsgyg51.png differ diff --git a/Presentations-Ghana/2024-10/tinytable_assets/idz75cqb4e5fsroxv2pakw.png b/Presentations-Ghana/2024-10/tinytable_assets/idz75cqb4e5fsroxv2pakw.png new file mode 100644 index 0000000..0589e82 Binary files /dev/null and b/Presentations-Ghana/2024-10/tinytable_assets/idz75cqb4e5fsroxv2pakw.png differ diff --git a/Presentations-Ghana/2024-10/vat_liability_small_2019.png b/Presentations-Ghana/2024-10/vat_liability_small_2019.png deleted file mode 100644 index b255dae..0000000 Binary files a/Presentations-Ghana/2024-10/vat_liability_small_2019.png and /dev/null differ diff --git a/Presentations-Ghana/2024-10/vat_liability_small_2019_by_group.png b/Presentations-Ghana/2024-10/vat_liability_small_2019_by_group.png deleted file mode 100644 index 9b7d6a9..0000000 Binary files a/Presentations-Ghana/2024-10/vat_liability_small_2019_by_group.png and /dev/null differ diff --git a/Presentations-Ghana/2024-10/vat_liability_small_2019_text.png b/Presentations-Ghana/2024-10/vat_liability_small_2019_text.png deleted file mode 100644 index 5a04c86..0000000 Binary files a/Presentations-Ghana/2024-10/vat_liability_small_2019_text.png and /dev/null differ diff --git a/Screenshot 2024-12-14 155227.png b/Screenshot 2024-12-14 155227.png new file mode 100644 index 0000000..9e7dce0 Binary files /dev/null and b/Screenshot 2024-12-14 155227.png differ diff --git a/data_cleaning_ghana.R b/data_cleaning_ghana.R index e94ebb5..74a4ee3 100644 --- a/data_cleaning_ghana.R +++ b/data_cleaning_ghana.R @@ -87,9 +87,11 @@ department_staff_clean <- department_staff_clean %>% ) department_staff <- department_staff_clean %>% - select(ID, sex, current_grade, senior_junior_staff, department, years_of_service) + select(ID, sex, current_grade, senior_junior_staff, department, years_of_service) %>% + filter(years_of_service >0) department_staff_age <- department_staff_clean %>% + filter(years_of_service >0) %>% select(ID, age) write.xlsx(department_staff, "DataWork/DataSets/Raw/Ghana/department_staff_list.xlsx")