-
Notifications
You must be signed in to change notification settings - Fork 2
Add notes about data structures to documentation #329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## main #329 +/- ##
=======================================
Coverage 97.26% 97.26%
=======================================
Files 54 54
Lines 5299 5299
=======================================
Hits 5154 5154
Misses 145 145 ☔ View full report in Codecov by Sentry. |
@mikapfl I think these are the essential aspects of our research. Let me know if you think of anything else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I have some small suggestions, but generally this should be useful if we later decide to pick it up again.
<table> | ||
<tr> | ||
<th>Pro</th> | ||
<th>Con</th> | ||
</tr> | ||
<tr> | ||
<td> | ||
<ul> | ||
<li>Supports conversion from/to NetCDF and Zarr.</li> | ||
<li>Allows us to stay within the xarray ecosystem (as opposed to SQL or pandas).</li> | ||
<li>All pre-processing steps can be covered in DataTree.</li> | ||
</ul> | ||
</td> | ||
<td> | ||
<ul> | ||
<li>Limited examples and documentation available (as of April 2025).</li> | ||
<li>Requires making our primap2 functions compatible with DataTree.</li> | ||
<li>Processing DataTrees may involve looping through sub-datasets. Is it faster than merging once?</li> | ||
<li>The hierarchical concept does not align well with our use case. All country datasets would be "siblings," but this is not necessarily a problem.</li> | ||
</ul> | ||
</td> | ||
</tr> | ||
</table> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<table> | |
<tr> | |
<th>Pro</th> | |
<th>Con</th> | |
</tr> | |
<tr> | |
<td> | |
<ul> | |
<li>Supports conversion from/to NetCDF and Zarr.</li> | |
<li>Allows us to stay within the xarray ecosystem (as opposed to SQL or pandas).</li> | |
<li>All pre-processing steps can be covered in DataTree.</li> | |
</ul> | |
</td> | |
<td> | |
<ul> | |
<li>Limited examples and documentation available (as of April 2025).</li> | |
<li>Requires making our primap2 functions compatible with DataTree.</li> | |
<li>Processing DataTrees may involve looping through sub-datasets. Is it faster than merging once?</li> | |
<li>The hierarchical concept does not align well with our use case. All country datasets would be "siblings," but this is not necessarily a problem.</li> | |
</ul> | |
</td> | |
</tr> | |
</table> | |
| Pro | Con | | |
|---|---| | |
| Supports conversion from/to NetCDF and Zarr. | Limited examples and documentation available (as of April 2025). | | |
| Allows us to stay within the xarray ecosystem (as opposed to SQL or pandas). | Requires making our primap2 functions compatible with DataTree. | | |
| All pre-processing steps can be covered in DataTree. | Processing DataTrees may involve looping through sub-datasets. Is it faster than merging once? | | |
| | The hierarchical concept does not align well with out use case. All country datasets would "siblings," but this is not necessarily a problem. | |
It's not exactly the same result, but if possible avoid html and use markdown instead.
Example: A script like `src/unfccc_ghg_data/unfccc_crf_reader/crf_raw_for_year.py` compiles all CRF (or CRT) datasets for a submission year into one dataset. Since countries report at different levels of detail, categories unique to one dataset result in NaNs in others, making datasets sparse and memory-intensive. We aim to significantly reduce memory usage. | ||
|
||
* **Select** | ||
Filtering datasets by dimensions such as country, sector, or gas is essential. The `primap2` `.loc` function already supports this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filtering datasets by dimensions such as country, sector, or gas is essential. The `primap2` `.loc` function already supports this. | |
Filtering datasets by dimensions such as country, sector, or gas is essential. The {py:meth}`xarray.Dataset.pr.set` function already supports this, but not for DataTrees. |
We can add a proper link using the {py:meth}
syntax.
@@ -0,0 +1,138 @@ | |||
### Data structures for primap2 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we document an unfinished proposal for more advanced data structures for sparse data within primap2. |
Also, maybe "Data Structures for primap" is a bit too general as a title. Maybe "Sparse data ideas" instead, to make clear in the title already that this is not currently implemented functionality? |
Pull request
Please confirm that this pull request has done the following:
{pr}.thing.md
file in the directorychangelog
added - see changelog/README.md for detailsDescription
Add the notes on data structure research so they don't get lost