joining_data.Rmd

# Joining Data

Since the PIP project is comprised of several databases at very different domain
levels, this chapter provides all the guidelines to join data frames correctly.
The first thing to understand is that the reporting measures in PIP (e.g.,
poverty and inequality) are uniquely identified by four variables: *Country*,
*year*, *domain*, and *welfare type*.

-   *Country* refers to independent economies that conduct independent household
    surveys. For instance, China and Taiwan are treated as **two different
    economies** by the World Bank, and hence by PIP, even though under some
    criteria some people think that Taiwan is part of China.

-   *Year* refers to reporting year rather than the actual calendar years over
    which the survey was conducted. Some household surveys like India 2011/2012
    are conducted over the of two calendar year, but the welfare aggregate is
    deflated to the reporting year, 2011.

-   *Domain* refers to the smallest geographical disaggregation for which it is
    possible to deflate and line up to PPP values the welfare aggregate of a
    household survey. The criteria to determine the reporting domain of a
    household survey is still under consideration, but ideally it such for which
    there is CPI, PPP, and population auxiliary data, as well as a household
    survey representative at that level. There are some exceptions to this
    criterion like China or the Philippines, but this cases explained in
    detailed in Section \@ref(special-data-cases). As of today
    (`r format(Sys.Date(), "%B %d, %Y")`), most country/years are reported at
    the national domain and few are reported at the urban/rural domain. However,
    the PIP technical infrastructure has been designed to incorporate other
    domain levels if, at some point in time, it is the case.

-   Finally, the *welfare type* specifies whether the welfare aggregate is based
    on income or in consumption. For the latter case, though some households
    surveys capture expenditure instead, they are still considered
    consumption-based surveys.

The challenge of joining different data frames in PIP is that these four
variables that uniquely identify the reporting measures [**are
not**]{style="color:red"} available on any of the PIP data files---with the
exception of the cache files that we discuss below. This challenge is easily
addressed by having a clear understanding of the Price FrameWork (pfw) data
frame. This file does not only contain valuable metadata, but it could also be
considered as the anchor among all PIP data.

## The Price FrameWork (pfw) data {#pfw-join}

As always, this file can be loaded by typing,

```{r}
library(data.table)
pfw <- pipload::pip_load_aux("pfw")
joyn::is_id(pfw, by = c("country_code", "surveyid_year", "survey_acronym"))
```

First of all, notice that `pfw` is uniquely identified by country code, survey
year, and survey acronym. The reason for this is that pfw aims at providing a
link between every single household survey and all the other auxiliary data.
Since welfare data is stored following the naming convention of the
[International Household Survey Network (IHSN)](http://ihsn.org/), data is
stored according to country, survey year, acronym of the survey, and vintage
control of master and alternative versions. The vintage control of the master
and alternative version of the data is not relevant for joining data because PIP
uses, by default, the most recent version.

Keep in mind that PIP estimates are reported at the country, year, domain, and
welfare type level, but the last two of these are not found either in the survey
ID nor as unique identifiers of the pfw. To solve this problem, the pfw data
makes use of the variables `welfare_type,` `aaa_domain`, and `aaa_domain_var`.

As the name suggests, `welfare_type` indicates the **main** welfare aggregate
type (i.e, income or consumption ) of the variable `welfare` in the GMD datasets
that correspond to the survey ID formed by concatenating variables
`country_code`, `surveyid_year`, and `survey_acronym`. For example, the
`welfare_type` of the `welfare` variable in the datasets of
***COL_2018_GEIH**\_V01_M\_V03_A\_GMD* is income.

```{r}
pfw[ country_code    == "COL"
    & surveyid_year  == 2018
    & survey_acronym == "GEIH", # Not necessary since it is the only one
    unique(welfare_type)]
```

The prefix `aaa` in variables `aaa_domain` and `aaa_domain_var` refers to the
identification code of any of the auxiliary data. Thus, you will find a
`gdp_domain`, `cpi_domain`, `ppp_domain` and several others. All `aaa_domain`
variables contain the *lower* level of geographical disaggregation of the
corresponding `aaa` auxiliary data. There are only three possible levels of
disaagregation,

```{r, echo=FALSE}
dd <- tibble::tribble(
        ~domain.value,      ~meaning,
                   1L,    "national",
                   2L, "urban/rural",
                   3L, "subnational"
        )
knitr::kable(dd, 
             format =  "html", 
             col.names = gsub("[.]", " ", names(dd)))
```

As of now, no survey or auxiliary data is broken down at level 3 (i.e.,
subnational), but it is important to know that the PIP internal code takes that
possibility into account for future cases.

Depending on the country, the domain level of each auxiliary data might be
different. In Indonesia, for instance, the CPI domain is national, whereas the
PPP domain is "urban/rural."

```{r}
pfw[ country_code == "IDN" & surveyid_year == 2018, 
    .(cpi = unique(cpi_domain), 
      ppp = unique(ppp_domain))]
```

Finally, [and this is really important]{style="color:red"}, variables
`aaa_domain_var` contains the name of variable in the GMD dataset that uniquely
identify the household survey *in the corresponding `aaa`* *auxiliary data*. In
other words, `aaa_domain_var` contains *the name* of the variable in GMD that
must be used as *key* *to join* GMD to `aaa`. You may ask, does the name of the
variable in the `aaa` auxiliary data have the same variable name in the GMD data
specified in `aaa_domain_var`? No, it does not. Since the domain level to
identify observations in the `aaa` auxiliary data is unique, there is only one
variable in auxiliary data used to merge any welfare data, `aaa_data_level`.
Since all this process is a little cumbersome, the
{[pipdp](https://github.com/PIP-Technical-Team/pipdp)} Stata package, during the
process of cleaning GMD databases to PIP databases, creates as many
`aaa_data_level` variables as needed in order to make the join of welfare data
and auxiliary data simpler. You can see the lines of code that create these
variables in [this
section](https://github.com/PIP-Technical-Team/pipdp/blob/9c4e32636dd1c71954816f8a7c8ad743349959a6/pipdp_md_clean.ado#L225-L294)
of the file "pipdp_md_clean.ado."[^joining_data-1]

[^joining_data-1]: You can find more information about the conversion from GMD
    to PIP databases in Section \@ref(welfare-data)

### Joining data example

Let's see the case of Indonesia above. The pfw says that the CPI domain is
"national" and the PPP domain is "urban/rural." That means that the welfare data
join to each of these auxiliary data with two different variables,

```{r}
domains <- 
  pfw[ country_code == "IDN" & surveyid_year == 2018, 
     .(cpi = unique(cpi_domain_var), 
       ppp = unique(ppp_domain_var))][]
```

This says that the name of the variable in the welfare data to join PPP data is
called `uban`, but there is not seem to be a variable name in GMD to join the
CPI data. When the name of the variable is missing, it indicates that the
welfare data is not split by any variable to merge CPI data. That is, it is at
the national level.

```{r}
ccode  <- "CHN"
cpi <- pipload::pip_load_aux("cpi")
ppp <- pipload::pip_load_aux("ppp")

CHN <-  pipload::pip_load_data(country = ccode, 
                              year    = 2015)

dt <- joyn::merge(CHN, cpi,
                  by = c("country_code", "survey_year",
                         "survey_acronym", "cpi_data_level"),
                  match_type = "m:1", 
                  keep = "left")
```

## Special data cases

[NOTE: Andres. Add this section]{style="color:red"}