-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathauxiliary_files.qmd
201 lines (130 loc) · 8.94 KB
/
auxiliary_files.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
title: "Auxiliary Files"
format: html
editor: visual
---
```{r setup, include=FALSE}
library(data.table)
library(dplyr)
```
## List of Auxiliary Files
Here is the list of auxiliary flies along with variable lists within each of the auxiliary data. In the variable list, the key variable(s) for the underline auxiliary file are represented as bold text. Variables that can be used as interchangeable/ replacement for any of the key variables are represented in square brackets.
1. pfw: wb_region_code, **country_code**, pcn_region_code, ctryname, **year**, \[surveyid_year\], timewp, fieldwork, **survey_acronym**, link, altname, survey_time, wbint_link, wbext_link, alt_link, pip_meta, surv_title, surv_producer, survey_coverage, \[welfare_type\], use_imputed, use_microdata, use_bin, use_groupdata, reporting_year, survey_comparability, comp_note, preferable, display_cp, fieldwork_range, \[survey_year\], ref_year_des, wf_baseprice, wf_baseprice_note, wf_baseprice_des, wf_spatial_des, wf_spatial_var, cpi_replication, cpi_domain, cpi_domain_var, wf_currency_des, ppp_replication, ppp_domain, ppp_domain_var, wf_add_temp_des, wf_add_temp_var, wf_add_spatial_des, wf_add_spatial_var, tosplit, tosplit_var, inpovcal, oth_welfare1_type, oth_welfare1_var, gdp_domain, pce_domain, pop_domain, Note, pfw_id
Note: *welfare_type* and *survey_acronym*, and *year* and *surveyid_year* can be interchangeably used as key to merge pfw data with other auxiliary files.
2. cpi: **country_code**, **cpi_year**, \[survey_year\], cpi, ccf, **survey_acronym**, change_cpi2011, cpi_domain, cpi_domain_value, cpi2017_unadj, cpi2011_unadj, cpi2011, cpi2017, cpi2011_SM22, cpi2017_SM22, cpi2005, **cpi_data_level**, cpi2011_SM23, cpi2017_SM23, cpi_id
Note: *cpi_year* and *survey_year* can be interchangeably used as the pary of key variables.
3. gdp: **country_code**, **year**, gdp, **gdp_data_level**, gdp_domain
4. gdm: survey_id, **country_code**, **surveyid_year**, \[survey_year\], welfare_type, survey_mean_lcu, distribution_type, gd_type, **pop_data_level**, pcn_source_file, pcn_survey_id
5. pce: **country_code**, **year**, pce, **pce_data_level**, pce_domain
6. pop: **country_code**, **year**, **pop_data_level**, pop, pop_domain
7. ppp: **country_code**, **ppp_year**, **release_version**, **adaptation_version**, ppp, ppp_default, ppp_default_by_year, ppp_domain, *ppp_data_level*
Note: ppp auxiliary data may need to be reshaped to merge it with other auxiliary files.
8. maddison: **country_code**, **year**, mpd_gdp
9. weo: **country_code**, **year**, weo_gdp
10. npl: **country_code**, **reporting_year**, nat_headcount, comparability, footnote
11. countries: **country_code**, country_name, africa_split, africa_split_code, region, region_code, world, world_code
12. regions: region, **region_code**, grouping_type
13. income_groups: **country_code**, **year_data**, incgroup_historical, fcv_historical, ssa_subregion_code
14. metadata: **country_code**, country_name, reporting_year, \[survey_year\], **surveyid_year**, survey_title, survey_conductor, survey_coverage, **welfare_type**, distribution_type, metadata
## Relationship between auxiliary files
Understanding the relationship among auxiliary files help to merge the files easily. The auxiliary datsets are grouped into three based on unique level of observations. Datasets within each group have one-to-one relationship.
- Group one: maddison, weo
Note: Group one files can be merged with any of the auxiliary files using **country_code** and **year** varaibles.
- Group two: gdp, pop, gdm, pce, *ppp*
Note: ppp requires reshaping to map one-to-one within its group members.
- Group three: metadata, cpi, pfw
# one-to-one relationship
Datasets within each group have one-to-one relationship
# many-to-one
Datasets in Group two and three have many-to-one relationship with datasets in Group one.
## Merge auxiliary files
In order to merge any two auxiliary data, we may need to change part of key variable names. For example, reporting level variable such as pop_data_level, cpi_data_level, gdp_data_level, pce_data_level and ppp_data_level are same in content and can be renamed as *reporting_level* and **year** can be renamed as **surveyid_year**.
We may need to generate a data that contains set of key variables. For example, to merge pfw data with any of the datasets at reporting level (pop, pce, gdp, gdm, ppp) we need to add reporting level variable in pfw data.
Let's generate a data that can be used to merge pfw data with datasets at reporting level. This dataset is going to be generated using variables from pfw and cpi datasets.
```{r, generate key variables for pfw data}
# pfw ---------------------------------------------------
pfw <- pipload::pip_load_aux("pfw")
pfw_key_options <- pfw[, .(country_code,
year,
surveyid_year,
survey_acronym,
survey_coverage,
welfare_type,
survey_year,
cpi_domain,
cpi_domain_var)]
pfw_key_options |> count(cpi_domain_var)
pfw_key_options <- pfw_key_options[, cpi_domain_value:=
fifelse(cpi_domain_var == "urban",
0, 1)]
# cpi ---------------------------------------------------
cpi <- pipload::pip_load_aux("cpi")
cpi_key <- cpi[, .(country_code,
cpi_year,
survey_year,
survey_acronym,
cpi_domain,
cpi_domain_value,
cpi_data_level)] |>
setnames("cpi_data_level", "reporting_level")
cpi_key |> count(cpi_domain_value)
cpi_key <- cpi_key[, cpi_domain :=
fifelse(cpi_domain == "National",
1, 2)]
cpi_key$cpi_domain <- as.numeric(cpi_key$cpi_domain)
pfw_cpi_key <- cpi_key[pfw_key_options, on = .(country_code, survey_year,
survey_acronym, cpi_domain, cpi_domain_value)]
pfw_cpi_key <- pfw_cpi_key |>
group_by(country_code, survey_year,
survey_acronym) |>
mutate(year_ = mean(year, na.rm = TRUE),
surveyid_year_ = mean(surveyid_year, na.rm = TRUE)) |>
ungroup() |>
mutate(year_ = ifelse(is.na(year_), cpi_year, year_),
surveyid_year_ = ifelse(is.na(surveyid_year_), cpi_year, surveyid_year_),
year = ifelse(is.na(year), year_, year),
surveyid_year = ifelse(is.na(surveyid_year), surveyid_year_, surveyid_year)) |>
select(country_code, survey_year, survey_acronym, reporting_level, cpi_domain) |> setDT()
any(duplicated(pfw_cpi_key, by = c("country_code", "survey_year", "survey_acronym", "cpi_domain")))
```
In order to merge pfw data with any of the Group two (gdp, pop, gdm, pce, *ppp*) or Group three (cpi) dataset:
1. Merge *pfw_cpi_key* data with pfw dataset
2. Rename ???\_data_level variable in the other dataset to reporting_level. Where ??? could be pop, cpi, gdp, or pce depending on the type of the auxiliary file. For example:
```{r}
# merging pfw, cpi, pop, pce, gdp, gdm and ppp
# first, merge pfw data with pfw_cpi_key datasets
pfw <- pipload::pip_load_aux("pfw")
pfw <- pfw_cpi_key[pfw, on= .(country_code, survey_year,
survey_acronym, cpi_domain)]
# merge pfw with cpi
cpi <- pipload::pip_load_aux("cpi")
cpi <- cpi[, -c("cpi_domain")] |> # since it is available in pfw
setnames("cpi_data_level", "reporting_level")
pfw_cpi <- cpi[pfw, on = .(country_code, survey_year, survey_acronym, reporting_level)]
# merge pfw_cpi with pop
pop <- pipload::pip_load_aux("pop")
pop <- pop[, -c("pop_domain")] |> # since it is available in pfw
setnames("pop_data_level", "reporting_level")
pfw_cpi_pop <- pop[pfw_cpi, on = .(country_code, year, reporting_level)]
# merge pfw_cpi_pop with pce
pce <- pipload::pip_load_aux("pce")
pce <- pce[, -c("pce_domain")] |> # since it is available in pfw
setnames("pce_data_level", "reporting_level")
pfw_cpi_pop_pce <- pce[pfw_cpi_pop, on = .(country_code, year, reporting_level)]
# merge pfw_cpi_pop_pce with gdp
gdp <- pipload::pip_load_aux("gdp")
gdp <- gdp[, -c("gdp_domain")] |> # since it is available in pfw
setnames("gdp_data_level", "reporting_level")
pfw_cpi_pop_pce_gdp <- gdp[pfw_cpi_pop_pce,
on = .(country_code, year, reporting_level)]
# merge pfw_cpi_pop_pce_gdp with ppp
# taking the default ppp year
ppp <- pipload::pip_load_aux("ppp")
ppp <- ppp[ppp_default == TRUE, .(country_code, ppp_year, ppp, ppp_data_level)] |>
setnames("ppp_data_level", "reporting_level")
any(duplicated(ppp))
pfw_cpi_pop_pce_gdp_ppp <- ppp[pfw_cpi_pop_pce_gdp,
on = .(country_code, reporting_level)]
#dcast(ppp, formula = country_code + reporting_level ~ ppp_year, value.var = "ppp")
```
As demonstrated in the above example, pfw_cpi_key data is a key to merge pfw data with any of the auxiliary data files.