Skip to content
This repository was archived by the owner on Oct 24, 2024. It is now read-only.

Commit 133f60c

Browse files
Data structures docs (#103)
* sketching out changes needed to integrate variables into DataTree * fixed some other basic conflicts * fix mypy errors * can create basic datatree node objects again * child-variable name collisions dectected correctly * in-progres * add _replace method * updated tests to assert identical instead of check .ds is expected_ds * refactor .ds setter to use _replace * refactor init to use _replace * refactor test tree to avoid init * attempt at copy methods * rewrote implementation of .copy method * xfailing test for deepcopying * pseudocode implementation of DatasetView * Revert "pseudocode implementation of DatasetView" This reverts commit 52ef23b. * removed duplicated implementation of copy * reorganise API docs * expose data_vars, coords etc. properties * try except with calculate_dimensions private import * add keys/values/items methods * don't use has_data when .variables would do * explanation of basic properties * add data structures page to index * revert adding documentation in favour of that going in a different PR * explanation of basic properties * add data structures page to index * create tree node-by-node * create tree from dict * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * dict-like interface * correct deepcopy tests * use .data_vars in copy tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * black * whatsnew * data contents * dictionary-like access * TODOs * test assigning int * allow assigning coercible values * simplify example using #115 * add note about fully qualified names Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent b7cdaa0 commit 133f60c

File tree

4 files changed

+229
-3
lines changed

4 files changed

+229
-3
lines changed

datatree/datatree.py

+13-3
Original file line numberDiff line numberDiff line change
@@ -360,6 +360,10 @@ def ds(self) -> DatasetView:
360360
An immutable Dataset-like view onto the data in this node.
361361
362362
For a mutable Dataset containing the same data as in this node, use `.to_dataset()` instead.
363+
364+
See Also
365+
--------
366+
DataTree.to_dataset
363367
"""
364368
return DatasetView._from_node(self)
365369

@@ -393,7 +397,13 @@ def _pre_attach(self: DataTree, parent: DataTree) -> None:
393397
)
394398

395399
def to_dataset(self) -> Dataset:
396-
"""Return the data in this node as a new xarray.Dataset object."""
400+
"""
401+
Return the data in this node as a new xarray.Dataset object.
402+
403+
See Also
404+
--------
405+
DataTree.ds
406+
"""
397407
return Dataset._construct_direct(
398408
self._variables,
399409
self._coord_names,
@@ -432,7 +442,7 @@ def variables(self) -> Mapping[Hashable, Variable]:
432442

433443
@property
434444
def attrs(self) -> Dict[Hashable, Any]:
435-
"""Dictionary of global attributes on this node"""
445+
"""Dictionary of global attributes on this node object."""
436446
if self._attrs is None:
437447
self._attrs = {}
438448
return self._attrs
@@ -443,7 +453,7 @@ def attrs(self, value: Mapping[Any, Any]) -> None:
443453

444454
@property
445455
def encoding(self) -> Dict:
446-
"""Dictionary of global encoding attributes on this node"""
456+
"""Dictionary of global encoding attributes on this node object."""
447457
if self._encoding is None:
448458
self._encoding = {}
449459
return self._encoding

docs/source/data-structures.rst

+212
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
.. _data structures:
2+
3+
Data Structures
4+
===============
5+
6+
.. ipython:: python
7+
:suppress:
8+
9+
import numpy as np
10+
import pandas as pd
11+
import xarray as xr
12+
import datatree
13+
14+
np.random.seed(123456)
15+
np.set_printoptions(threshold=10)
16+
17+
.. note::
18+
19+
This page builds on the information given in xarray's main page on
20+
`data structures <https://docs.xarray.dev/en/stable/user-guide/data-structures.html>`_, so it is suggested that you
21+
are familiar with those first.
22+
23+
DataTree
24+
--------
25+
26+
:py:class:``DataTree`` is xarray's highest-level data structure, able to organise heterogeneous data which
27+
could not be stored inside a single ``Dataset`` object. This includes representing the recursive structure of multiple
28+
`groups`_ within a netCDF file or `Zarr Store`_.
29+
30+
.. _groups: https://www.unidata.ucar.edu/software/netcdf/workshops/2011/groups-types/GroupsIntro.html
31+
.. _Zarr Store: https://zarr.readthedocs.io/en/stable/tutorial.html#groups
32+
33+
Each ``DataTree`` object (or "node") contains the same data that a single ``xarray.Dataset`` would (i.e. ``DataArray`` objects
34+
stored under hashable keys), and so has the same key properties:
35+
36+
- ``dims``: a dictionary mapping of dimension names to lengths, for the variables in this node,
37+
- ``data_vars``: a dict-like container of DataArrays corresponding to variables in this node,
38+
- ``coords``: another dict-like container of DataArrays, corresponding to coordinate variables in this node,
39+
- ``attrs``: dict to hold arbitary metadata relevant to data in this node.
40+
41+
A single ``DataTree`` object acts much like a single ``Dataset`` object, and has a similar set of dict-like methods
42+
defined upon it. However, ``DataTree``'s can also contain other ``DataTree`` objects, so they can be thought of as nested dict-like
43+
containers of both ``xarray.DataArray``'s and ``DataTree``'s.
44+
45+
A single datatree object is known as a "node", and its position relative to other nodes is defined by two more key
46+
properties:
47+
48+
- ``children``: An ordered dictionary mapping from names to other ``DataTree`` objects, known as its' "child nodes".
49+
- ``parent``: The single ``DataTree`` object whose children this datatree is a member of, known as its' "parent node".
50+
51+
Each child automatically knows about its parent node, and a node without a parent is known as a "root" node
52+
(represented by the ``parent`` attribute pointing to ``None``).
53+
Nodes can have multiple children, but as each child node has at most one parent, there can only ever be one root node in a given tree.
54+
55+
The overall structure is technically a `connected acyclic undirected rooted graph`, otherwise known as a
56+
`"Tree" <https://en.wikipedia.org/wiki/Tree_(graph_theory)>`_.
57+
58+
.. note::
59+
60+
Technically a ``DataTree`` with more than one child node forms an `"Ordered Tree" <https://en.wikipedia.org/wiki/Tree_(graph_theory)#Ordered_tree>`_,
61+
because the children are stored in an Ordered Dictionary. However, this distinction only really matters for a few
62+
edge cases involving operations on multiple trees simultaneously, and can safely be ignored by most users.
63+
64+
65+
``DataTree`` objects can also optionally have a ``name`` as well as ``attrs``, just like a ``DataArray``.
66+
Again these are not normally used unless explicitly accessed by the user.
67+
68+
69+
Creating a DataTree
70+
~~~~~~~~~~~~~~~~~~~
71+
72+
There are two ways to create a ``DataTree`` from scratch. The first is to create each node individually,
73+
specifying the nodes' relationship to one another as you create each one.
74+
75+
The ``DataTree`` constructor takes:
76+
77+
- ``data``: The data that will be stored in this node, represented by a single ``xarray.Dataset``, or a named ``xarray.DataArray``.
78+
- ``parent``: The parent node (if there is one), given as a ``DataTree`` object.
79+
- ``children``: The various child nodes (if there are any), given as a mapping from string keys to ``DataTree`` objects.
80+
- ``name``: A string to use as the name of this node.
81+
82+
Let's make a datatree node without anything in it:
83+
84+
.. ipython:: python
85+
86+
from datatree import DataTree
87+
88+
# create root node
89+
node1 = DataTree(name="Oak")
90+
91+
node1
92+
93+
At this point our node is also the root node, as every tree has a root node.
94+
95+
We can add a second node to this tree either by referring to the first node in the constructor of the second:
96+
97+
.. ipython:: python
98+
99+
# add a child by referring to the parent node
100+
node2 = DataTree(name="Bonsai", parent=node1)
101+
102+
or by dynamically updating the attributes of one node to refer to another:
103+
104+
.. ipython:: python
105+
106+
# add a grandparent by updating the .parent property of an existing node
107+
node0 = DataTree(name="General Sherman")
108+
node1.parent = node0
109+
110+
Our tree now has three nodes within it, and one of the two new nodes has become the new root:
111+
112+
.. ipython:: python
113+
114+
node0
115+
116+
Is is at tree construction time that consistency checks are enforced. For instance, if we try to create a `cycle` the constructor will raise an error:
117+
118+
.. ipython:: python
119+
:okexcept:
120+
121+
node0.parent = node2
122+
123+
The second way is to build the tree from a dictionary of filesystem-like paths and corresponding ``xarray.Dataset`` objects.
124+
125+
This relies on a syntax inspired by unix-like filesystems, where the "path" to a node is specified by the keys of each intermediate node in sequence,
126+
separated by forward slashes. The root node is referred to by ``"/"``, so the path from our current root node to its grand-child would be ``"/Oak/Bonsai"``.
127+
A path specified from the root (as opposed to being specified relative to an arbitrary node in the tree) is sometimes also referred to as a
128+
`"fully qualified name" <https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf-zarr-data-model-specification#nczarr_fqn>`_.
129+
130+
If we have a dictionary where each key is a valid path, and each value is either valid data or ``None``,
131+
we can construct a complex tree quickly using the alternative constructor ``:py:func::DataTree.from_dict``:
132+
133+
.. ipython:: python
134+
135+
d = {
136+
"/": xr.Dataset({"foo": "orange"}),
137+
"/a": xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}),
138+
"/a/b": xr.Dataset({"zed": np.NaN}),
139+
"a/c/d": None,
140+
}
141+
dt = DataTree.from_dict(d)
142+
dt
143+
144+
Notice that this method will also create any intermediate empty node necessary to reach the end of the specified path
145+
(i.e. the node labelled `"c"` in this case.)
146+
147+
Finally if you have a file containing data on disk (such as a netCDF file or a Zarr Store), you can also create a datatree by opening the
148+
file using ``:py:func::~datatree.open_datatree``.
149+
150+
151+
DataTree Contents
152+
~~~~~~~~~~~~~~~~~
153+
154+
Like ``xarray.Dataset``, ``DataTree`` implements the python mapping interface, but with values given by either ``xarray.DataArray`` objects or other ``DataTree`` objects.
155+
156+
.. ipython:: python
157+
158+
dt["a"]
159+
dt["foo"]
160+
161+
Iterating over keys will iterate over both the names of variables and child nodes.
162+
163+
We can also access all the data in a single node through a dataset-like view
164+
165+
.. ipython:: python
166+
167+
dt["a"].ds
168+
169+
This demonstrates the fact that the data in any one node is equivalent to the contents of a single ``xarray.Dataset`` object.
170+
The ``DataTree.ds`` property returns an immutable view, but we can instead extract the node's data contents as a new (and mutable)
171+
``xarray.Dataset`` object via ``.to_dataset()``:
172+
173+
.. ipython:: python
174+
175+
dt["a"].to_dataset()
176+
177+
Like with ``Dataset``, you can access the data and coordinate variables of a node separately via the ``data_vars`` and ``coords`` attributes:
178+
179+
.. ipython:: python
180+
181+
dt["a"].data_vars
182+
dt["a"].coords
183+
184+
185+
Dictionary-like methods
186+
~~~~~~~~~~~~~~~~~~~~~~~
187+
188+
We can update the contents of the tree in-place using a dictionary-like syntax.
189+
190+
We can update a datatree in-place using Python's standard dictionary syntax, similar to how we can for Dataset objects.
191+
For example, to create this example datatree from scratch, we could have written:
192+
193+
# TODO update this example using ``.coords`` and ``.data_vars`` as setters,
194+
195+
.. ipython:: python
196+
197+
dt = DataTree()
198+
dt["foo"] = "orange"
199+
dt["a"] = DataTree(data=xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}))
200+
dt["a/b/zed"] = np.NaN
201+
dt["a/c/d"] = DataTree()
202+
dt
203+
204+
To change the variables in a node of a ``DataTree``, you can use all the standard dictionary
205+
methods, including ``values``, ``items``, ``__delitem__``, ``get`` and
206+
:py:meth:`~xarray.DataTree.update`.
207+
Note that assigning a ``DataArray`` object to a ``DataTree`` variable using ``__setitem__`` or ``update`` will
208+
:ref:`automatically align<update>` the array(s) to the original node's indexes.
209+
210+
If you copy a ``DataTree`` using the ``:py:func::copy`` function or the :py:meth:`~xarray.DataTree.copy` it will copy the entire tree,
211+
including all parents and children.
212+
Like for ``Dataset``, this copy is shallow by default, but you can copy all the data by calling ``dt.copy(deep=True)``.

docs/source/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Datatree
1111
Installation <installation>
1212
Quick Overview <quick-overview>
1313
Tutorial <tutorial>
14+
Data Model <data-structures>
1415
API Reference <api>
1516
How do I ... <howdoi>
1617
Contributing Guide <contributing>

docs/source/whats-new.rst

+3
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,9 @@ Bug fixes
4848
Documentation
4949
~~~~~~~~~~~~~
5050

51+
- Added ``Data Structures`` page describing the internal structure of a ``DataTree`` object, and its relation to
52+
``xarray.Dataset`` objects. (:pull:`103`)
53+
By `Tom Nicholas <https://github.com/TomNicholas>`_.
5154
- API page updated with all the methods that are copied from ``xarray.Dataset``. (:pull:`41`)
5255
By `Tom Nicholas <https://github.com/TomNicholas>`_.
5356

0 commit comments

Comments
 (0)