Skip to content

Add notes about data structures to documentation #329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

crdanielbusch
Copy link
Collaborator

@crdanielbusch crdanielbusch commented Apr 8, 2025

Pull request

Please confirm that this pull request has done the following:

  • Tests added
  • Documentation added (where applicable)
  • Description in a {pr}.thing.md file in the directory changelog added - see changelog/README.md for details

Description

Add the notes on data structure research so they don't get lost

Copy link

codecov bot commented Apr 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.26%. Comparing base (e8e2a4f) to head (c55395c).

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #329   +/-   ##
=======================================
  Coverage   97.26%   97.26%           
=======================================
  Files          54       54           
  Lines        5299     5299           
=======================================
  Hits         5154     5154           
  Misses        145      145           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@crdanielbusch crdanielbusch requested a review from mikapfl April 8, 2025 13:41
@crdanielbusch
Copy link
Collaborator Author

@mikapfl I think these are the essential aspects of our research. Let me know if you think of anything else.

Copy link
Member

@mikapfl mikapfl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I have some small suggestions, but generally this should be useful if we later decide to pick it up again.

Comment on lines +48 to +70
<table>
<tr>
<th>Pro</th>
<th>Con</th>
</tr>
<tr>
<td>
<ul>
<li>Supports conversion from/to NetCDF and Zarr.</li>
<li>Allows us to stay within the xarray ecosystem (as opposed to SQL or pandas).</li>
<li>All pre-processing steps can be covered in DataTree.</li>
</ul>
</td>
<td>
<ul>
<li>Limited examples and documentation available (as of April 2025).</li>
<li>Requires making our primap2 functions compatible with DataTree.</li>
<li>Processing DataTrees may involve looping through sub-datasets. Is it faster than merging once?</li>
<li>The hierarchical concept does not align well with our use case. All country datasets would be "siblings," but this is not necessarily a problem.</li>
</ul>
</td>
</tr>
</table>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<table>
<tr>
<th>Pro</th>
<th>Con</th>
</tr>
<tr>
<td>
<ul>
<li>Supports conversion from/to NetCDF and Zarr.</li>
<li>Allows us to stay within the xarray ecosystem (as opposed to SQL or pandas).</li>
<li>All pre-processing steps can be covered in DataTree.</li>
</ul>
</td>
<td>
<ul>
<li>Limited examples and documentation available (as of April 2025).</li>
<li>Requires making our primap2 functions compatible with DataTree.</li>
<li>Processing DataTrees may involve looping through sub-datasets. Is it faster than merging once?</li>
<li>The hierarchical concept does not align well with our use case. All country datasets would be "siblings," but this is not necessarily a problem.</li>
</ul>
</td>
</tr>
</table>
| Pro | Con |
|---|---|
| Supports conversion from/to NetCDF and Zarr. | Limited examples and documentation available (as of April 2025). |
| Allows us to stay within the xarray ecosystem (as opposed to SQL or pandas). | Requires making our primap2 functions compatible with DataTree. |
| All pre-processing steps can be covered in DataTree. | Processing DataTrees may involve looping through sub-datasets. Is it faster than merging once? |
| | The hierarchical concept does not align well with out use case. All country datasets would "siblings," but this is not necessarily a problem. |

It's not exactly the same result, but if possible avoid html and use markdown instead.

Example: A script like `src/unfccc_ghg_data/unfccc_crf_reader/crf_raw_for_year.py` compiles all CRF (or CRT) datasets for a submission year into one dataset. Since countries report at different levels of detail, categories unique to one dataset result in NaNs in others, making datasets sparse and memory-intensive. We aim to significantly reduce memory usage.

* **Select**
Filtering datasets by dimensions such as country, sector, or gas is essential. The `primap2` `.loc` function already supports this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Filtering datasets by dimensions such as country, sector, or gas is essential. The `primap2` `.loc` function already supports this.
Filtering datasets by dimensions such as country, sector, or gas is essential. The {py:meth}`xarray.Dataset.pr.set` function already supports this, but not for DataTrees.

We can add a proper link using the {py:meth} syntax.

@@ -0,0 +1,138 @@
### Data structures for primap2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here, we document an unfinished proposal for more advanced data structures for sparse data within primap2.

@mikapfl
Copy link
Member

mikapfl commented Apr 9, 2025

Also, maybe "Data Structures for primap" is a bit too general as a title. Maybe "Sparse data ideas" instead, to make clear in the title already that this is not currently implemented functionality?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants