Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wrapper for csg #286

Merged
merged 28 commits into from
Mar 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
f29dbb5
Add tests for wrapper and set_priority_coords
JGuetschow Nov 4, 2024
02981e6
Add metadata to wrapper test
JGuetschow Nov 4, 2024
1a5c3fa
Fix metadata in wrapper test
JGuetschow Nov 4, 2024
e719590
Added docs for csg wrapper and link filling strategies in csg docs
JGuetschow Nov 5, 2024
2e94b5c
Add 'FX' to letter codes for data reading
JGuetschow Jan 14, 2025
47ecc41
Added footnote marker removal to string data processing in data readi…
JGuetschow Jan 24, 2025
9ba0a31
Merge branch 'main' into CRT_reading_fixes
JGuetschow Jan 24, 2025
0263931
Added NE(1) to special codes for data reading
JGuetschow Jan 24, 2025
70a05bc
add some special codes for data reading (for Venezuela BTR1)
JGuetschow Mar 5, 2025
9f4067b
Merge branch 'main' into CRT_reading_fixes
JGuetschow Mar 11, 2025
b755a96
Change treatment of FX code. It is now considered before other codes.
JGuetschow Mar 11, 2025
857f5e8
skipna=True in sum for min_count to be effective
JGuetschow Mar 11, 2025
b1c396d
Merge branch 'nansum_convert' into CRT_reading_fixes
JGuetschow Mar 11, 2025
afb64f1
Merge branch 'main' into CRT_reading_fixes
JGuetschow Mar 19, 2025
4c5b6cf
Merge branch 'main' into CRT_reading_fixes
JGuetschow Mar 25, 2025
e4e8abd
Added changelog files and updated docs
JGuetschow Mar 25, 2025
8680fdc
merge main into csg_regression_test
JGuetschow Mar 25, 2025
e5b541e
Merge branch 'CRT_reading_fixes' into csg_regression_test
JGuetschow Mar 25, 2025
cf9dcb5
* Added option to pass pd.DatetimeIndex to the csg wrapper instead of…
JGuetschow Mar 25, 2025
453591d
* Updated wrapper docstring
JGuetschow Mar 25, 2025
5ff7f42
* Added more flexibility in input data types for the time range in th…
JGuetschow Mar 25, 2025
a521e50
Accept suggestions from review
JGuetschow Mar 26, 2025
972f1b9
style: update docs
mikapfl Mar 26, 2025
104f107
Merge branch 'csg_regression_test' into csg_regression_test_mp
mikapfl Mar 26, 2025
5f5c4d3
adapt docs
JGuetschow Mar 26, 2025
73df2e4
docs: regenerate docs
mikapfl Mar 26, 2025
3813dae
Merge branch 'csg_regression_test' into csg_regression_test_mp
mikapfl Mar 26, 2025
bd02ecf
Merge pull request #327 from primap-community/csg_regression_test_mp
JGuetschow Mar 26, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog/286.improvement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Added a wrapper for the csg `compose` function to handle input data preparation (remove data which is not needed in the process) and output data handling (set coords and metadata)
2 changes: 2 additions & 0 deletions changelog/323.fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
* Fix a pandas stack issue in GHG_inventory_reading
* Fix `skipna` in conversions
1 change: 1 addition & 0 deletions changelog/323.improvement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Added additional non-numerical codes in data reading functions.
1 change: 1 addition & 0 deletions docs/source/api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ source priorities and matching algorithms.
csg.StrategyUnableToProcess
csg.SubstitutionStrategy
csg.compose
csg.create_composite_source


.. currentmodule:: xarray
Expand Down
9 changes: 7 additions & 2 deletions docs/source/data_reading/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,21 +74,26 @@ NaN. For example "IE" stands for "included elsewhere" and thus it has to be
mapped to 0 to show that emissions in this timeseries are 0 and not missing.

As a default, we use easy rules combined with defined mappings for special cases.
The rules are
The rules are as follows and each data point is tested against the rules in the same order as below.

- If the code contains `FX` it is mapped to `np.nan`
- If the code contains `IE` and/or `NO` it is mapped to 0
- If the code contains `NE` and/or `NA` but neither `IE` nor `NO`, it is mapped to NaN.
- If the code contains `NE` and/or `NA` but neither `IE` nor `NO`, it is mapped to `np.nan`.

The special cases are

```python
_special_codes = {
"C": np.nan,
"CC": np.nan,
"CH4": np.nan, # TODO: move to user passed codes in CRT reading
"nan": np.nan,
"NaN": np.nan,
"-": 0,
"NE0": np.nan,
"NE(1)": np.nan,
"": np.nan,
"FX": np.nan,
}
```

Expand Down
3 changes: 0 additions & 3 deletions docs/source/data_reading/test_csv_data_sec_cat_if.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,6 @@ attrs:
area: area (ISO3)
cat: category (IPCC2006)
scen: scenario (general)
sec_cats:
- Class (class)
- Type (type)
data_file: test_csv_data_sec_cat_if.csv
dimensions:
'*':
Expand Down
64 changes: 62 additions & 2 deletions docs/source/usage/csg.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ When no missing information is left in the result timeseries, the algorithm term
It also terminates if all source timeseries are used, even if missing information is
left.

## The `compose` function

The core function to use is the {py:func}`primap2.csg.compose` function.
It needs the following input:

Expand Down Expand Up @@ -111,8 +113,7 @@ priority_definition = primap2.csg.PriorityDefinition(
```

```{code-cell} ipython3
# Currently, there is only one strategy implemented, so we use
# the empty selector {}, which matches everything, to configure
# We use the empty selector {}, which matches everything, to configure
# to use the substitution strategy for all timeseries.
strategy_definition = primap2.csg.StrategyDefinition(
strategies=[({}, primap2.csg.SubstitutionStrategy())]
Expand All @@ -125,6 +126,7 @@ result_ds = primap2.csg.compose(
priority_definition=priority_definition,
strategy_definition=strategy_definition,
progress_bar=None, # The animated progress bar is useless in the generated documentation

)

result_ds
Expand Down Expand Up @@ -162,3 +164,61 @@ category 1 "lowpop" was preferred.
For category 0, the initial timeseries did not contain NaNs, so no filling was needed.
For category 1, there was information missing in the initial timeseries, so the
lower-priority timeseries was used to fill the holes.

## The `create_composite_source` wrapper function

The {py:func}`primap2.csg.compose` function creates a composite time series according to
the given priorities and strategies, but it does not take care of pre- and postprocessing
of the data. It will carry along unnecessary data and the resulting dataset will miss the
priority coordinates. The {py:func}`primap2.csg.create_composite_source` function takes care
of these steps and prepares the input data and completes the output data to a primap2 dataset
with all desired dimensions and metadata.

The function takes the same inputs as {py:func}`primap2.csg.compose` with additional input to
define pre- and postprocessing:

* **result_prio_coords** Defines the vales for the priority coordinates in the output dataset. As the
priority coordinates differ for all input sources there is no canonical value
for the result and it has to be explicitly defined
* **metadata** Set metadata values such as title and references

```{code-cell} ipython3
result_prio_coords = result_prio_coords = {
"source": {"value": "PRIMAP-test"},
"scenario": {"value": "HISTORY", "terminology": "PRIMAP"},
}
metadata = {"references": "test-data", "contact": "[email protected]"}

```

* **limit_coords** Optional parameter to remove data for coordinate values not needed for the
composition from the input data. The time coordinate is treated separately.
* **time_range** Optional parameter to limit the time coverage of the input data. The input can either be a pandas `DatetimeIndex` or a tuple of `str` or datetime-like in the form (year_from, year_to) where both boundaries are included in the range. Only the overlap of the supplied index or index created from the tuple with the time coordinate of the input dataset will be used.


```{code-cell} ipython3
limit_coords = {'area (ISO3)': ['COL', 'ARG', 'MEX']}
time_range = ("2000", "2010")

```

```{code-cell} ipython3
complete_result_ds = primap2.csg.create_composite_source(
input_ds,
priority_definition=priority_definition,
strategy_definition=strategy_definition,
result_prio_coords=result_prio_coords,
limit_coords=limit_coords,
time_range=time_range,
metadata=metadata,
progress_bar=None,
)

complete_result_ds
```


## Filling strategies
Currently the following filling strategies are implemented
* Global least square matching: {py:func}`primap2.csg.GlobalLSStrategy`
* Straight substitution: {py:func}`primap2.csg.SubstitutionStrategy`
6 changes: 4 additions & 2 deletions primap2/_convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,9 @@ def _fill_category(
continue

# the left-hand side of the conversion formula summed up
lhs = (input_factors * self._da.loc[input_selection]).sum(dim=dim, min_count=1)
lhs = (input_factors * self._da.loc[input_selection]).sum(
dim=dim, min_count=1, skipna=True
)
# the right-hand side of the conversion formula split up
rhs = lhs / output_factors

Expand All @@ -190,7 +192,7 @@ def _fill_category(
da = da.reindex({"category (IPCC2006)": new_categories}, fill_value=np.nan)
new_output_selection = output_selection.copy()
new_output_selection[new_dim] = new_category
da.loc[new_output_selection] = rhs.sum(dim=new_dim, min_count=1)
da.loc[new_output_selection] = rhs.sum(dim=new_dim, min_count=1, skipna=True)
return output_selection[new_dim], da
else:
da.loc[output_selection] = rhs
Expand Down
2 changes: 2 additions & 0 deletions primap2/csg/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from ._strategies.exceptions import StrategyUnableToProcess
from ._strategies.global_least_squares import GlobalLSStrategy
from ._strategies.substitution import SubstitutionStrategy
from ._wrapper import create_composite_source

__all__ = [
"GlobalLSStrategy",
Expand All @@ -21,4 +22,5 @@
"StrategyUnableToProcess",
"SubstitutionStrategy",
"compose",
"create_composite_source",
]
168 changes: 168 additions & 0 deletions primap2/csg/_wrapper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
from datetime import datetime

import numpy as np
import pandas as pd
import tqdm
import xarray as xr

from ._compose import compose
from ._models import PriorityDefinition, StrategyDefinition


def set_priority_coords(
ds: xr.Dataset,
dims: dict[str, dict[str, str]],
) -> xr.Dataset:
"""Set values for priority coordinates in output dataset.

Parameters
----------
ds
Input dataset.
dims
Values to be set for priority coordinates. The format is
{"name": {"value": value, "terminology": terminology}}, where the
terminology is optional.
"""
for dim in dims:
if "terminology" in dims[dim]:
terminology = dims[dim]["terminology"]
else:
terminology = None
ds = ds.pr.expand_dims(dim=dim, coord_value=dims[dim]["value"], terminology=terminology)
return ds


def create_composite_source(
input_ds: xr.Dataset,
priority_definition: PriorityDefinition,
strategy_definition: StrategyDefinition,
result_prio_coords: dict[str, dict[str, str]],
limit_coords: dict[str, str | list[str]] | None = None,
time_range: tuple[str | np.datetime64, str | np.datetime64] | pd.DatetimeIndex | None = None,
metadata: dict[str, str] | None = None,
progress_bar: type[tqdm.tqdm] | None = tqdm.tqdm,
) -> xr.Dataset:
"""Create a composite data source

This is a wrapper around `primap2.csg.compose` that prepares the input data and sets result
values for the priority coordinates.

Parameters
----------
input_ds
Dataset containing all input data
priority_definition
Defines the priorities to select timeseries from the input data. Priorities
are formed by a list of selections and are used "from left to right", where the
first matching selection has the highest priority. Each selection has to specify
values for all priority dimensions (so that exactly one timeseries is selected
from the input data), but can also specify other dimensions. That way it is,
e.g., possible to define a different priority for a specific country by listing
it early (i.e. with high priority) before the more general rules which should
be applied for all other countries.
You can also specify the "entity" or "variable" in the selection, which will
limit the rule to a specific entity or variable, respectively. For each
DataArray in the input_data Dataset, the variable is its name, the entity is
the value of the key `entity` in its attrs.
strategy_definition
Defines the filling strategies to be used when filling timeseries with other
timeseries. Again, the priority is defined by a list of selections and
corresponding strategies which are used "from left to right". Selections can use
any dimension and don't have to apply to only one timeseries. For example, to
define a default strategy which should be used for all timeseries unless
something else is configured, configure an empty selection as the last
(rightmost) entry.
You can also specify the "entity" or "variable" in the selection, which will
limit the rule to a specific entity or variable, respectively. For each
DataArray in the input_data Dataset, the variable is its name, the entity is
the value of the key `entity` in its attrs.
result_prio_coords
Defines the vales for the priority coordinates in the output dataset. As the
priority coordinates differ for all input sources there is no canonical value
for the result and it has to be explicitly defined.
limit_coords
Optional parameter to remove data for coordinate values not needed for the
composition from the input data. The time coordinate is treated separately.
time_range
Optional parameter to limit the time coverage of the input data.
Can either be a pandas `DatetimeIndex` or a tuple of `str` or `np.datetime64` in
the form (year_from, year_to) where both boundaries are included in the range.
Only the overlap of the supplied index or index created from the tuple with
the time coordinate of the input dataset will be used.
metadata
Set metadata values such as title and references.
progress_bar
By default, show progress bars using the tqdm package during the
operation. If None, don't show any progress bars. You can supply a class
compatible to tqdm.tqdm's protocol if you want to customize the progress bar.

Returns
-------
xr.Dataset with composed data according to the given priority and strategy
definitions
"""
# limit input data to these values
if limit_coords is not None:
if "variable" in limit_coords:
variable = limit_coords.pop("variable")
input_ds = input_ds[variable].pr.loc[limit_coords]
else:
input_ds = input_ds.pr.loc[limit_coords]

Check warning on line 111 in primap2/csg/_wrapper.py

View check run for this annotation

Codecov / codecov/patch

primap2/csg/_wrapper.py#L111

Added line #L111 was not covered by tests

# set time range according to input
if time_range is not None:
time_index = create_time_index(time_range)
time_index = time_index.intersection(input_ds.coords["time"])
input_ds = input_ds.pr.loc[{"time": time_index}]

# run compose
result_ds = compose(
input_data=input_ds,
priority_definition=priority_definition,
strategy_definition=strategy_definition,
progress_bar=progress_bar,
)

# set priority coordinates
result_ds = set_priority_coords(result_ds, result_prio_coords)

if metadata is not None:
for key in metadata.keys():
result_ds.attrs[key] = metadata[key]

result_ds.pr.ensure_valid()
return result_ds


def create_time_index(
time_range: tuple[
str | np.datetime64 | datetime | pd.Timestamp, str | np.datetime64 | datetime | pd.Timestamp
]
| pd.DatetimeIndex
| None = None,
) -> pd.DatetimeIndex:
"""
Unify different input options for a time range to a `pd.DatetimeIndex`.

Parameters
----------
time_range
Can either be pandas `DatetimeIndex` or a tuple of `str` or datetime-like in
the form (year_from, year_to) where both boundaries are included in the range.
Only the overlap of the supplied index or index created from the tuple with
the time coordinate of the input dataset will be used.

Returns
-------
Pandas DatetimeIndex according to the time range input
"""

if isinstance(time_range, pd.DatetimeIndex):
time_index = time_range
elif isinstance(time_range, tuple):
time_index = pd.date_range(time_range[0], time_range[1], freq="YS", inclusive="both")
else:
raise ValueError("time_range must be a datetime index or a tuple")

Check warning on line 166 in primap2/csg/_wrapper.py

View check run for this annotation

Codecov / codecov/patch

primap2/csg/_wrapper.py#L166

Added line #L166 was not covered by tests

return time_index
2 changes: 1 addition & 1 deletion primap2/pm2io/_GHG_inventory_reading.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@
if header_long is None:
header_long = ["category", "orig_cat_name", "entity", "unit", "time", "data"]

df_stacked = df_nir.stack([0, 1], dropna=False).to_frame()
df_stacked = df_nir.stack([0, 1], future_stack=True).to_frame()

Check warning on line 166 in primap2/pm2io/_GHG_inventory_reading.py

View check run for this annotation

Codecov / codecov/patch

primap2/pm2io/_GHG_inventory_reading.py#L166

Added line #L166 was not covered by tests
df_stacked.insert(0, "year", str(year))
df_stacked = df_stacked.reset_index()
df_stacked.columns = header_long
Expand Down
Loading