-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathjoining_data.Rmd
174 lines (145 loc) · 8.02 KB
/
joining_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# Joining Data
Since the PIP project is comprised of several databases at very different domain
levels, this chapter provides all the guidelines to join data frames correctly.
The first thing to understand is that the reporting measures in PIP (e.g.,
poverty and inequality) are uniquely identified by four variables: *Country*,
*year*, *domain*, and *welfare type*.
- *Country* refers to independent economies that conduct independent household
surveys. For instance, China and Taiwan are treated as **two different
economies** by the World Bank, and hence by PIP, even though under some
criteria some people think that Taiwan is part of China.
- *Year* refers to reporting year rather than the actual calendar years over
which the survey was conducted. Some household surveys like India 2011/2012
are conducted over the of two calendar year, but the welfare aggregate is
deflated to the reporting year, 2011.
- *Domain* refers to the smallest geographical disaggregation for which it is
possible to deflate and line up to PPP values the welfare aggregate of a
household survey. The criteria to determine the reporting domain of a
household survey is still under consideration, but ideally it such for which
there is CPI, PPP, and population auxiliary data, as well as a household
survey representative at that level. There are some exceptions to this
criterion like China or the Philippines, but this cases explained in
detailed in Section \@ref(special-data-cases). As of today
(`r format(Sys.Date(), "%B %d, %Y")`), most country/years are reported at
the national domain and few are reported at the urban/rural domain. However,
the PIP technical infrastructure has been designed to incorporate other
domain levels if, at some point in time, it is the case.
- Finally, the *welfare type* specifies whether the welfare aggregate is based
on income or in consumption. For the latter case, though some households
surveys capture expenditure instead, they are still considered
consumption-based surveys.
The challenge of joining different data frames in PIP is that these four
variables that uniquely identify the reporting measures [**are
not**]{style="color:red"} available on any of the PIP data files---with the
exception of the cache files that we discuss below. This challenge is easily
addressed by having a clear understanding of the Price FrameWork (pfw) data
frame. This file does not only contain valuable metadata, but it could also be
considered as the anchor among all PIP data.
## The Price FrameWork (pfw) data {#pfw-join}
As always, this file can be loaded by typing,
```{r}
library(data.table)
pfw <- pipload::pip_load_aux("pfw")
joyn::is_id(pfw, by = c("country_code", "surveyid_year", "survey_acronym"))
```
First of all, notice that `pfw` is uniquely identified by country code, survey
year, and survey acronym. The reason for this is that pfw aims at providing a
link between every single household survey and all the other auxiliary data.
Since welfare data is stored following the naming convention of the
[International Household Survey Network (IHSN)](http://ihsn.org/), data is
stored according to country, survey year, acronym of the survey, and vintage
control of master and alternative versions. The vintage control of the master
and alternative version of the data is not relevant for joining data because PIP
uses, by default, the most recent version.
Keep in mind that PIP estimates are reported at the country, year, domain, and
welfare type level, but the last two of these are not found either in the survey
ID nor as unique identifiers of the pfw. To solve this problem, the pfw data
makes use of the variables `welfare_type,` `aaa_domain`, and `aaa_domain_var`.
As the name suggests, `welfare_type` indicates the **main** welfare aggregate
type (i.e, income or consumption ) of the variable `welfare` in the GMD datasets
that correspond to the survey ID formed by concatenating variables
`country_code`, `surveyid_year`, and `survey_acronym`. For example, the
`welfare_type` of the `welfare` variable in the datasets of
***COL_2018_GEIH**\_V01_M\_V03_A\_GMD* is income.
```{r}
pfw[ country_code == "COL"
& surveyid_year == 2018
& survey_acronym == "GEIH", # Not necessary since it is the only one
unique(welfare_type)]
```
The prefix `aaa` in variables `aaa_domain` and `aaa_domain_var` refers to the
identification code of any of the auxiliary data. Thus, you will find a
`gdp_domain`, `cpi_domain`, `ppp_domain` and several others. All `aaa_domain`
variables contain the *lower* level of geographical disaggregation of the
corresponding `aaa` auxiliary data. There are only three possible levels of
disaagregation,
```{r, echo=FALSE}
dd <- tibble::tribble(
~domain.value, ~meaning,
1L, "national",
2L, "urban/rural",
3L, "subnational"
)
knitr::kable(dd,
format = "html",
col.names = gsub("[.]", " ", names(dd)))
```
As of now, no survey or auxiliary data is broken down at level 3 (i.e.,
subnational), but it is important to know that the PIP internal code takes that
possibility into account for future cases.
Depending on the country, the domain level of each auxiliary data might be
different. In Indonesia, for instance, the CPI domain is national, whereas the
PPP domain is "urban/rural."
```{r}
pfw[ country_code == "IDN" & surveyid_year == 2018,
.(cpi = unique(cpi_domain),
ppp = unique(ppp_domain))]
```
Finally, [and this is really important]{style="color:red"}, variables
`aaa_domain_var` contains the name of variable in the GMD dataset that uniquely
identify the household survey *in the corresponding `aaa`* *auxiliary data*. In
other words, `aaa_domain_var` contains *the name* of the variable in GMD that
must be used as *key* *to join* GMD to `aaa`. You may ask, does the name of the
variable in the `aaa` auxiliary data have the same variable name in the GMD data
specified in `aaa_domain_var`? No, it does not. Since the domain level to
identify observations in the `aaa` auxiliary data is unique, there is only one
variable in auxiliary data used to merge any welfare data, `aaa_data_level`.
Since all this process is a little cumbersome, the
{[pipdp](https://github.com/PIP-Technical-Team/pipdp)} Stata package, during the
process of cleaning GMD databases to PIP databases, creates as many
`aaa_data_level` variables as needed in order to make the join of welfare data
and auxiliary data simpler. You can see the lines of code that create these
variables in [this
section](https://github.com/PIP-Technical-Team/pipdp/blob/9c4e32636dd1c71954816f8a7c8ad743349959a6/pipdp_md_clean.ado#L225-L294)
of the file "pipdp_md_clean.ado."[^joining_data-1]
[^joining_data-1]: You can find more information about the conversion from GMD
to PIP databases in Section \@ref(welfare-data)
### Joining data example
Let's see the case of Indonesia above. The pfw says that the CPI domain is
"national" and the PPP domain is "urban/rural." That means that the welfare data
join to each of these auxiliary data with two different variables,
```{r}
domains <-
pfw[ country_code == "IDN" & surveyid_year == 2018,
.(cpi = unique(cpi_domain_var),
ppp = unique(ppp_domain_var))][]
```
This says that the name of the variable in the welfare data to join PPP data is
called `uban`, but there is not seem to be a variable name in GMD to join the
CPI data. When the name of the variable is missing, it indicates that the
welfare data is not split by any variable to merge CPI data. That is, it is at
the national level.
```{r}
ccode <- "CHN"
cpi <- pipload::pip_load_aux("cpi")
ppp <- pipload::pip_load_aux("ppp")
CHN <- pipload::pip_load_data(country = ccode,
year = 2015)
dt <- joyn::merge(CHN, cpi,
by = c("country_code", "survey_year",
"survey_acronym", "cpi_data_level"),
match_type = "m:1",
keep = "left")
```
## Special data cases
[NOTE: Andres. Add this section]{style="color:red"}