Add notes about data structures to documentation #329

crdanielbusch · 2025-04-08T13:15:08Z

Pull request

Please confirm that this pull request has done the following:

Tests added
Documentation added (where applicable)
Description in a {pr}.thing.md file in the directory changelog added - see changelog/README.md for details

Description

Add the notes on data structure research so they don't get lost

codecov · 2025-04-08T13:16:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.26%. Comparing base (e8e2a4f) to head (c55395c).

✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #329   +/-   ##
=======================================
  Coverage   97.26%   97.26%           
=======================================
  Files          54       54           
  Lines        5299     5299           
=======================================
  Hits         5154     5154           
  Misses        145      145

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

crdanielbusch · 2025-04-08T13:45:50Z

@mikapfl I think these are the essential aspects of our research. Let me know if you think of anything else.

mikapfl

Nice, I have some small suggestions, but generally this should be useful if we later decide to pick it up again.

mikapfl · 2025-04-09T09:36:51Z

docs/source/data_structures.md

+<table>
+  <tr>
+    <th>Pro</th>
+    <th>Con</th>
+  </tr>
+  <tr>
+    <td>
+      <ul>
+        <li>Supports conversion from/to NetCDF and Zarr.</li>
+        <li>Allows us to stay within the xarray ecosystem (as opposed to SQL or pandas).</li>
+        <li>All pre-processing steps can be covered in DataTree.</li>
+      </ul>
+    </td>
+    <td>
+      <ul>
+        <li>Limited examples and documentation available (as of April 2025).</li>
+        <li>Requires making our primap2 functions compatible with DataTree.</li>
+        <li>Processing DataTrees may involve looping through sub-datasets. Is it faster than merging once?</li>
+        <li>The hierarchical concept does not align well with our use case. All country datasets would be "siblings," but this is not necessarily a problem.</li>
+      </ul>
+    </td>
+  </tr>
+</table>


Suggested change

<table>

<tr>

<th>Pro</th>

<th>Con</th>

</tr>

<tr>

<td>

<ul>

<li>Supports conversion from/to NetCDF and Zarr.</li>

<li>Allows us to stay within the xarray ecosystem (as opposed to SQL or pandas).</li>

<li>All pre-processing steps can be covered in DataTree.</li>

</ul>

</td>

<td>

<ul>

<li>Limited examples and documentation available (as of April 2025).</li>

<li>Requires making our primap2 functions compatible with DataTree.</li>

<li>Processing DataTrees may involve looping through sub-datasets. Is it faster than merging once?</li>

<li>The hierarchical concept does not align well with our use case. All country datasets would be "siblings," but this is not necessarily a problem.</li>

</ul>

</td>

</tr>

</table>

| Pro | Con |

|---|---|

| Supports conversion from/to NetCDF and Zarr. | Limited examples and documentation available (as of April 2025). |

| Allows us to stay within the xarray ecosystem (as opposed to SQL or pandas). | Requires making our primap2 functions compatible with DataTree. |

| All pre-processing steps can be covered in DataTree. | Processing DataTrees may involve looping through sub-datasets. Is it faster than merging once? |

| | The hierarchical concept does not align well with out use case. All country datasets would "siblings," but this is not necessarily a problem. |

It's not exactly the same result, but if possible avoid html and use markdown instead.

mikapfl · 2025-04-09T09:38:52Z

docs/source/data_structures.md

+  Example: A script like `src/unfccc_ghg_data/unfccc_crf_reader/crf_raw_for_year.py` compiles all CRF (or CRT) datasets for a submission year into one dataset. Since countries report at different levels of detail, categories unique to one dataset result in NaNs in others, making datasets sparse and memory-intensive. We aim to significantly reduce memory usage.
+
+* **Select**
+  Filtering datasets by dimensions such as country, sector, or gas is essential. The `primap2` `.loc` function already supports this.


Suggested change

Filtering datasets by dimensions such as country, sector, or gas is essential. The `primap2` `.loc` function already supports this.

Filtering datasets by dimensions such as country, sector, or gas is essential. The {py:meth}`xarray.Dataset.pr.set` function already supports this, but not for DataTrees.

We can add a proper link using the {py:meth} syntax.

mikapfl · 2025-04-09T09:42:38Z

docs/source/data_structures.md

@@ -0,0 +1,138 @@
+### Data structures for primap2
+


Suggested change

Here, we document an unfinished proposal for more advanced data structures for sparse data within primap2.

mikapfl · 2025-04-09T09:44:49Z

Also, maybe "Data Structures for primap" is a bit too general as a title. Maybe "Sparse data ideas" instead, to make clear in the title already that this is not currently implemented functionality?

crdanielbusch added 2 commits April 8, 2025 14:52

more text

10d6295

clean up

1adaabc

crdanielbusch added 3 commits April 8, 2025 15:24

add links

77c4851

Merge remote-tracking branch 'origin/main' into datatree-docs

5cad608

add to index

c55395c

crdanielbusch requested a review from mikapfl April 8, 2025 13:41

mikapfl approved these changes Apr 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add notes about data structures to documentation #329

Add notes about data structures to documentation #329

crdanielbusch commented Apr 8, 2025 •

edited

Loading

codecov bot commented Apr 8, 2025 •

edited

Loading

crdanielbusch commented Apr 8, 2025

mikapfl left a comment

mikapfl Apr 9, 2025

mikapfl Apr 9, 2025

mikapfl Apr 9, 2025

mikapfl commented Apr 9, 2025

	Filtering datasets by dimensions such as country, sector, or gas is essential. The `primap2` `.loc` function already supports this.
	Filtering datasets by dimensions such as country, sector, or gas is essential. The {py:meth}`xarray.Dataset.pr.set` function already supports this, but not for DataTrees.



	Here, we document an unfinished proposal for more advanced data structures for sparse data within primap2.

Add notes about data structures to documentation #329

Are you sure you want to change the base?

Add notes about data structures to documentation #329

Conversation

crdanielbusch commented Apr 8, 2025 • edited Loading

Pull request

Description

codecov bot commented Apr 8, 2025 • edited Loading

Codecov Report

crdanielbusch commented Apr 8, 2025

mikapfl left a comment

Choose a reason for hiding this comment

mikapfl Apr 9, 2025

Choose a reason for hiding this comment

mikapfl Apr 9, 2025

Choose a reason for hiding this comment

mikapfl Apr 9, 2025

Choose a reason for hiding this comment

mikapfl commented Apr 9, 2025

crdanielbusch commented Apr 8, 2025 •

edited

Loading

codecov bot commented Apr 8, 2025 •

edited

Loading