auxiliary_data.Rmd

# Auxiliary data

```{r, echo=FALSE, include=FALSE}
library(data.table)
library(pipaux)
```

As it was said in Section \@ref(data-acquisition), auxiliary data is the data
used to temporally deflate and line up welfare data, with the objective of
getting poverty estimates comparable over time, across countries, and, more
importantly, being able to estimate regional and global estimates. Yet,
auxiliary data also refers to metadata with functional and qualitative
information. Functional information is such that is used in internal
calculations such as time comparability or surveys availability. Qualitative
information is just useful information that does not affect, neither depend on,
quantitative data. It is primary collected and made available for the end user.

As explain in Chapter \@ref(folder-structure), all auxiliary data is stored in
`"y:/PIP-Data/_aux/"`.

```{r aux-dir, echo=FALSE}
fs::dir_tree("y:/PIP-Data/_aux/", 
             recurse = FALSE, 
             regex = ".*(?<!\\.txt)$", 
             perl = TRUE)
```

The naming convention of subfolders inside the \_aux directory is useful because
auxiliary data is commonly referred to in all technical processes by its
convention rather than by it actual name. For instance Gross Domestic Product or
Purchasing Power Parity are better known as gdp and ppp, respectively. Yet,
other measures such as national population or consumption also make use of
conventions.

In this chapter you will learn everything related to each of the files that
store auxiliary data. Notice that the chapter is structured by files rather than
by measures or types of auxiliary data because you may find more than one
measure in one file.

The R package that manages auxiliary data is `{pipaux}`.

As explained in Chapter \@ref(folder-structure), within the folder of each
auxiliary file, you will find, at a minimum, a `_vintage` folder, one `xxx.fst`,
one `xxx.dta` file, and one `xxx_datasignature.txt` , where `xxx` stands for the
name of the file.

## Population

### Original data

Everything related to population data should be placed in the folder
`y:\PIP-Data\_aux\pop\`. hereafter (`./`).

The population data come from one of two different sources. WDI or an internal
file provided by a member of the DECDG team. Ideally, population data should be
downloaded from WDI, but sometimes the most recent data available has not been
uploaded yet, so it needs to be collected internally in DECDG. As of now
(`r format(Sys.time(), "%B %d, %Y")`), the DECDG focal point to provide the
population data is [Emi Suzuki](mailto:esuzuki1@worldbank.org). You just need to
send her an email, and she will provide the data to you.

If the data is provided by DECDG, it should be stored in the folder
`./raw_data`. The original excel file must be placed without modification in the
folder `./raw_data/original`. Then, the file is copied again one level up into
the folder `./raw_data` with the name `population_country_yyyy-mm-dd.xlsx` where
`yyyy-mm-dd` refers to the official release date of the population data. Notice
that for countries PSE, KWT and SXM, some years of population data are missing
in the DECDG main file and hence in WDI. Here we complement the main file with
an additional file shared by Emi to assure complete coverage. This file contains
historical data and will not need to be updated every year. This additional file
has the name convention `population_missing_yyyy-mm-dd.xlsx` and should follow
the same process as the `population_country` file. Once all the files and their
corresponding place, you can update the `./pop.fst` file by typing
`pipaux::pip_pop_update(src = "decdg")`.

If the data comes directly from WDI, you just need to update the file
`./pop.fst` by typing `pipaux::pip_pop_update(src = "wdi")`. It is worth
mentioning that the population codes used in WDI are "SP.POP.TOTL",
"SP.RUR.TOTL", and "SP.URB.TOTL", which are total population, rural population,
and urban population, respectively. If it is the case that PIP begins using
subnational population, a new set of WDI codes should be added to the R script
in
[pipaux::pip_pop_update()](https://github.com/PIP-Technical-Team/pipaux/blob/cd7738f98ec5373db8333c3573700b4991776c8d/R/pip_pop_update.R#L17).

### Data structure

Population data is loaded by typing either `pipload::pip_load_aux("pop")` or
`pipaux::pip_pop("load")`. We highly recommend the former, as `{pipload}` is the
intended R package for loading any PIP data.

```{r}
pop <- pipload::pip_load_aux("pop")
head(pop)
```

## National Accounts

National accounts account for the economic development of a country at an
aggregate or macroeconomic level. These measure are thus useful to interpolate
or extrapolate microeconomic measures mean welfare aggregate or poverty
headcount when household surveys are not available. National accounts work as a
proxy of the economic development that would have been present if household
surveys were available.

There are two main types of national accounts, Household Final Consumption
Expenditure (HFCE) and Gross Domestic Product (GDP)---both in real per capita
terms. Please refer to [Section
5.3](https://povcalnet-team.github.io/Methodology/lineupestimates.html#nationalaccounts)
of [@worldbankPovertyInequalityPlatform2021] to understand the usage of national
accounts data.

### GDP

As explained in [Section
5.3](https://povcalnet-team.github.io/Methodology/lineupestimates.html#nationalaccounts)
of [@worldbankPovertyInequalityPlatform2021], there are three sources of GDP
data, and one more for a few particular cases. The integration of all the
sources of GDP data is performed by `pipaux::pip_gdp_update()`, you'll need to
manually download and store the data from WEO and the data for the special
cases. The national accounts series from WDI are GDP per capita  [series code:
**NY.GDP.PCAP.KD**]. These series are in constant 2010 US\$.

The most recent version of the WEO data most be downloaded from the [World
Economic Outlook
Databases](https://www.imf.org/en/Publications/SPROLLS/world-economic-outlook-databases)
of the IMF.org website and saved as an .xls file in `<maindir>/_aux/weo/`. The
filename should be in the following structure `WEO_<YYYY-DD-MM>.xls`. Due to
potential file corruption the file must be opened and re-saved before it can be
updated with `pip_gdp_weo()`, which is an internal function fo
`pipaux::pip_gdp_update()`. Hopefully in the future IMF will stop using an
\`.xls\` file that's not really xls.

### Consumption (PCE)

Private Consumption Expenditure (pce) is gathered from WDI, with the exception
of a few special cases. As in the case of GDP, the special cases are treated in
the same way with PCE. You only need to execute the function
`pipaux::pip_pce_update()` to update the PCE data. HFCE per capita [series code:
**NE.CON.PRVT.PC.KD**] [@prydzNationalAccountsData2019]. These series are in
constant 2010 US\$.

### National Accounts, Special Cases

Special national accounts are used for lining up poverty estimates in the
following cases[^auxiliary_data-1]:

[^auxiliary_data-1]: The examples of special cases mentioned in this document
    are based on the March 2021 PovcalNet update.

1.  National accounts data are *unavailable* in the latest version of WDI.

    In such cases, national accounts data are obtained, in order of preference,
    from the latest version of WEO, or the latest version of MPD. For example,
    the entire series of GDP per capita for Taiwan, China and Somalia are
    missing in WDI, so WEO series are used instead.

2.  National accounts data are *incomplete* in the latest version of WDI.

    These are the cases where national accounts data are not available in WDI
    for some historical or recent years. In such cases, national accounts data
    in WDI are chained on backward or forward using growth rates from WEO or
    MPD, in that order. For example, GDP per capita for South Sudan (2016-2019)
    are based on the growth rate of GDP per capita from WEO. GDP per capita data
    for Liberia up to 1999 are based on the growth rate in GDP per capita from
    MPD.

3.  The available national accounts data from official sources (e.g. WDI, WEO,
    MPD) are considered to have quality issues.

    This is the case for Syria. Supplementary national accounts data are
    obtained from other sources, including research papers or national
    statistical offices. GDP per capita series for Syria (for 2010 through 2019)
    are from research
    [papers---\@kostialSyriaConflictEconomy2016Gobat](mailto:papers---@kostialSyriaConflictEconomy2016Gobat){.email}
    (for 2011-2015) and @devadasGrowthWarSyria2019 (for 2016-2019)---and are
    chained on backward with growth rates in GDP per capita from WEO. See
    \*y:/PIP-Data/\_aux/sna/\* for more details on how this is implemented.

4.  National accounts data need to be adjusted for the purposes of global
    poverty monitoring.

    This is the case for India. Growth rates in national accounts data for rural
    and urban India after 2014, precisely HFCE (or formerly PCE) per capita from
    WDI, are adjusted with a pass-through rate of 67%, as described in Section 5
    of @castanedaaguilarSeptember2020PovcalNet2020. See
    \*y:/PIP-Data/\_aux/sna/NAS special_2021-01-14.csv\* for more details on how
    this is implemented.

## CPI

### Raw data

General documents on the CPI source and the CPI frameworks are posted here. Yet,
for more details, please refer to [@laknerConsumerPriceIndices2018b;
@azevedoPricesUsedGlobal2018a].

There are three sources of CPI: IFS, WEO, and country team.

In general, the CPI data will be taken from the IMF International Financial
Statistics (IFS). For the incoming update, about 2-3 months to the upload we
initial the request to DECDG CPI team for the three series: annually, quarterly,
monthly from IFS CPI database. The purpose of requesting DECDG CPI team is to
ensure the same vintage will be updated in WDI later in the next update cycle.
We would need the three series as for some countries we only have annually, and
for other countries we could have up to monthly. This is also a check for us in
checking the consistency of annual and monthly series. The monthly series will
be used to construct annual and quarterly series, as that there are some
inconsistences between the annual and monthly series in the IFS. For the
exceptional countries, Poverty GP replaces data series based on previous
consultations when there is no update or better information.

Some countries also use National series which are not available from IFS -- in
this case we check with country poverty TTLs to provide the updated information,
especially when we have a new survey for that country.

In some cases where the CPI value are missing for some sources but not other
sources. This is especially true for the very old year where the data is
available for one source, or very recent year where WEO has a projection on the
CPI while it is not available in IFS. In those cases, we will follow the logic
and method described in the "CPI source document" to ensure we have all CPI
values for all data points.

Global D4G team will prepare and send the CPI raw series as well as the weighted
numbers for the current data points in the system. The data will be query by
datalibweb. The following files will be added to the system each round:

| File ane                     | Description                                                 |
|------------------------------|-------------------------------------------------------------|
| Final_CPI_PPP_to_be_used.dta | final weighted CPI for poverty calculation                  |
| Yearly_CPI_Final.dta         | annual CPI -- combined from different sources using chained |
| methods Yearly_CPI.dta       | annual CPI constructed from the monthly CPI                 |
| Yearly_CPI_Annual.dta        | annual CPI from the annual series                           |
| Quarterly_CPI.dta            | quarterly CPI                                               |
| Monthly_CPI.dta              | monthly CPI series                                          |
| WEO_Yearly_CPI.dta           | annual CPI from WEO                                         |
| Special_CPI_series.dta       | Special case of CPI (national source, imputation)           |

### Vintage control

```{r, echo=FALSE, include=FALSE}
cpi <- pipload::pip_load_aux("cpi")
cpi_id <- cpi[, unique(cpi_id)]
```

Vintage control of the CPI data comes in a similar fashion as welfare data,
`CPI_vXX_M_vXX_A`, where `vXX_M` refers to the version of the master or raw
data, and `vXX_A` refers to the alternative version.

Every year, around November-December, PIP CPI data is updated with the most
recent version of the IMF CPI data, which comes with information for the most
recent year available and with changes/fixes/additions of previous years for
each country. When this happens, the master version of the CPI ID is increased
in one unit before the data is saved. As of today, the current ID is `r cpi_id`.
If data is modified during the rolling of the year, then the alternative version
of the CPI ID is increased in one unit.

### Data structure

When you load CPI data using `pipload::pip_load_aux("cpi")`, the data you get
has already been cleaned for being use in the PIP workflow, and it is slightly
different from the original CPI data stored in datalibweb servers. That is, the
way CPI data is used and referred to datalibweb is different from the way it is
used in PIP even though they both achive the same purpose.

The most important variable in CPI data is, no surprisingly, `cpi`. This
variable however, is not available in the original CPI data from dlw. The
original name of this variable comes in the form `cpiYYYY`, where `YYYY` refers
to the base year of the CPI, which in turn depends on the collection year of the
PPP. Today, this variable is thus "`r getOption("pipaux.cpivar")`." The name of
this variable is stored in the `pipaux.cpivar` object in the `zzz.R` file of the
`{pipaux}` package. This will supdate the option `getOption("pipaux.cpivar")`,
guaranteing that `pipaux::pip_cpi_update()` uses the right variable when
updating the CPI data.

Another important variable in CPI dataframe is `change_cpiYYYY`, where `YYYY`
stands for the base year of the CPI. Since it version control of the CPI data
does not depend on the individual changes in the CPI series of each country but
on the release of new data by the IMF or by additional modifications by the
Poverty GP, variable `change_cpiYYYY` tracks changes in the CPI at the
country/year/survey with respect to the previous version. This is very useful
when you need to identify changes in output measures like poverty rates that
depend on deflation. One possible source of difference is the CPI and this
variable will help you identify whether the number of interest has change
because the CPI has changed.

## PPP

### Raw data

The PPP data is downloaded from ICP website for most of the countries (Ask ICP
team for the link, outlier, countries with changes in currency). Often there is
a GPWG working group to assess the PPP and its impacts on poverty. In this case,
the team would determine the countries for which there is a need to impute the
PPP value, either from the ICP model or the team model.

After the validation and adjustment process, the PPP values are stored in the
data file for all PPP rounds with vintage controls for each round.

The name of the variables in the wide-format file will follow the structure
**ppp_YYYY_vX_vY**. Where, *YYYY* refers to the ICP round. *vX* refers to the
version of the release, and *vY* refers to the adaptation of the release. So, v1
will be the original data, whereas v2 would be the first adaptation or estimates
of the release.

-   YYYY: refers to the ICP round.

-   vX: refers to the version of the release.

-   vY: refers to the adaptation of the release. So, v1 will be the original
    data, whereas v2 would be the first adaptation or estimates of the release.

```{r, echo=FALSE, include=FALSE}
ppp <- pipload::pip_load_aux("ppp")
```

### Data structure

PPP data is available by typing, `pipload::pip_load_aux("ppp")`. As expected,
the data you get has already been cleaned for being use in the PIP workflow, and
it is slightly different from the original PPP data stored in datalibweb
servers. The most important difference between the PIP data frame and the
datalibweb data frame is its rectangular structure. PIP data is in long format,
whereas datalibweb data in wide format.

The reason for having PPP data in long format in PIP is that some countries,
very few, use a different PPP year than the rest of the countries. Instead of
using a different variable for the calculations of those specific countries, we
use the same variable for all the countries but filter the corresponding
observations for each country using metadata from the Price Framework database.

The PPP data is at the country/ppp year/data_level/release version/adapation
version level. Yet, several filters most always be applied before this data can
be used. Ultimately, the data frame should be at the country/data_level level to
used properly. As a general rule, the filter must be done by selecting the most
recent `release_version` and the most recent `adaptation_version` in each year.
Then you can just filter by the PPP year you want to work with. In order to make
this process even easier we have created variables `ppp_default` and
`ppp_default_by_year`, which dummy variables to filter data. If you keep all
observations that `ppp_default == 1` you will get the current PPP used for all
PIP calculations. If you use `ppp_default_by_year == 1`, you get the default
version used in each PPP year. This is useful in case you want to make
comparisons between PPP releases. This two variables are created in function
`pipaux::pip_ppp_clean()` , in particular in [these
lines](https://github.com/PIP-Technical-Team/pipaux/blob/cd7738f98ec5373db8333c3573700b4991776c8d/R/pip_ppp_clean.R#L44-L66).

## Price FrameWork (PFW)

blah

### Original data

asds

## Abbreviations

**HFCE** -- final consumption expenditure

**MDP** -- Maddison Project Database

**PCE** -- private consumption expenditure

**WDI** -- World Development Indicators

**WEO** -- World Economic Outlook