-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathload_md_aux.Rmd
146 lines (103 loc) · 6.83 KB
/
load_md_aux.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# (PART) Standalone Technical Considerations {.unnumbered}
# Load microdata and Auxiliary data {#load}
Make sure you have all the packages installed and loaded into memory. Given that they are hosted in Github, the code below makes sure that any package in the PIP workflow can be installed correctly.
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
last_item <- function(x, word = "and") {
if (!(is.character(x))) {
warning("`x` must be character. coercing to character")
x <- as.character(x)
}
lx <- length(x)
if (lx == 1) {
y <- x
}
else if (lx == 2) {
y <- paste(x[1], word, x[2])
}
else {
y <- c(x[1:lx-1], paste(word, x[lx]))
y <- paste(y, collapse = ", ")
}
return(y)
}
```
```{r load}
## First specify the packages of interest
packages = c("pipaux", "pipload")
## Now load or install&load all
package.check <- lapply(
packages,
FUN = function(x) {
if (!require(x, character.only = TRUE)) {
pck_name <- paste0("PIP-Technical-Team/", x)
devtools::install_github(pck_name)
library(x, character.only = TRUE)
}
}
)
```
## Auxiilary data
```{r numb-func, include = FALSE}
lf <- lsf.str("package:pipaux", pattern = "^pip_")
lf <- as.character(lf)
num_functions <- length(lf)
```
Even though `pipaux` has more than `r num_functions` functions, most of its features
can be executed by using only the `pipaux::load_aux` and `pipaux::update_aux` functions.
### udpate data
```{r functions-av, include = FALSE}
lf <- lsf.str("package:pipaux", pattern = "^pip_[a-z]{3}$")
lf <- as.character(lf)
lf <- gsub("pip_", "", lf)
```
the main function of the `pipaux` package is `udpate_aux`. The first argument of this function is `measure` and it refers to the measure data to be loaded. The measures available are **`r last_item(lf)`**.
```{r update-aux}
pipaux::update_aux(measure = "cpi")
```
### Load data
Loading auxiliary data is the job of the package `pipload` through the function `pipload::pip_load_aux()`, though `pipaux` also provides `pipaux::load_aux()` for the same purpose. Notice that, though both function do exactly the same, the loading function from `pipload` has the prefix `pip_` to distinguish it from the one in `pipaux`. However, we are going to limit the work of `pipaux` to update auxiliary data and the work of `pipload` to load data. Thus, all the examples below use `pipload` for loading either microdata or auxiliary data.
```{r load-aux}
df <- pipload::pip_load_aux(measure = "cpi")
head(df)
```
## Microdata
Loading PIP microdata is the most practical action in the `pipload` package. However, it is important to understand the logic of microdata.
PIP microdata has several characteristics,
* There could be more than once survey for each Country/Year. This happens when there are more than one welfare variable available such as income and consumption.
* Some countries, like Mexico, have the two different welfare types in the same survey for the same country/year. This add a layer of complexity when the objective is to known which is default one.
* There are multiple version of the same harmonized survey. These version are organized in a two-type vintage control. It is possible to have a new version of the data because the Raw data--the one provided by the official NSO--has been updated, or because there has been un update in the harmonization process.
* Each survey could be use for more than one analytic tool in PIP (e.g., Poverty Calculator, Table Maker, or SOL). Thus, the data to be loaded depends on the tool in which it is going to be used.
Thus, in order to make the process of finding and loading data efficiently, `pipload` is a three-step process.
### Inventory file
The inventory file resides in `y:/PIP-Data/_inventory/inventory.fst`. This file is a data frame with all the microdata available in the PIP structure. It has two main variables, `orig` and `filename`. The former refers to the full directory path of the database, whereas the latter is only the file name. the other variables in this data frame are derived from these two.
The inventory file is used to speed up the file searching process in `pipload`. In previous packages, each time the user wanted to find a particular data base, it was necessary to look into the folder structure and extract the name of all the file that meet a particular criteria. This is time-consuming and inefficient. The advantage of this method though, is that, by construction, it finds all the the data available. By contrast, the inventory file method is much faster than the "searching" method, as it only requires to load a light file with all the data available, filter the data, and return the required information. The drawback, however, is that it needs to be kept up to date as data changes constantly.
To update the inventory file, you need to use the function `pip_update_inventory`. If you don't provide any argument, it will update the whole inventory, which may take around 10 to 15 min--the function will warn you about it. By provide the country/ies you want to update, the process is way faster.
```{r update-inventory}
# update one country
pip_update_inventory("MEX")
# Load inventory file
df <- pip_load_inventory()
head(df[, "filename"])
```
### Finding data
Every dataset in the PIP microdata repository is identified by seven variables! Country code, survey year, survey acronym, master version, alternative version, tool, and source. So giving the user the responsibility to know all the different combinations of each file is a heavy burden. Thus, the data finder, `pip_find_data()`, will provide the names of all the files available that meet the criteria in the arguments provided by the user. For instance, if the use wants to know the all the file available for Paraguay, we could type,
```{r inventory-pry}
pip_find_data(country = "PRY")[["filename"]]
```
Yet, if the user need to be more precise in its request, she can add information to the different arguments of the function. For example, this is data available in 2012,
```{r inventory-pry2012}
pip_find_data(country = "PRY",
year = 2012)[["filename"]]
```
### Loading data
Function `pip_load_data` takes care of loading the data. The very first instruction within `pip_load_data` is to find the data avialable in the repository by using `pip_load_inventory()`. The difference however is two-fold. First, `pip_load_data` will load the default and/or most recent version of the country/year combination available. Second, it gives the user the possibility to load different datasets in either list or dataframe form. For instance, if the user wants to load the Paraguay data in 2014 and 2015 used in the Poverty Calculator tool, she may type,
```{r load-data, cache=TRUE}
df <- pip_load_data(country = "PRY",
year = c(2014, 2015),
tool = "PC")
janitor::tabyl(df, survey_id)
```