|
| 1 | +.. _data structures: |
| 2 | + |
| 3 | +Data Structures |
| 4 | +=============== |
| 5 | + |
| 6 | +.. ipython:: python |
| 7 | + :suppress: |
| 8 | +
|
| 9 | + import numpy as np |
| 10 | + import pandas as pd |
| 11 | + import xarray as xr |
| 12 | + import datatree |
| 13 | +
|
| 14 | + np.random.seed(123456) |
| 15 | + np.set_printoptions(threshold=10) |
| 16 | +
|
| 17 | +.. note:: |
| 18 | + |
| 19 | + This page builds on the information given in xarray's main page on |
| 20 | + `data structures <https://docs.xarray.dev/en/stable/user-guide/data-structures.html>`_, so it is suggested that you |
| 21 | + are familiar with those first. |
| 22 | + |
| 23 | +DataTree |
| 24 | +-------- |
| 25 | + |
| 26 | +:py:class:``DataTree`` is xarray's highest-level data structure, able to organise heterogeneous data which |
| 27 | +could not be stored inside a single ``Dataset`` object. This includes representing the recursive structure of multiple |
| 28 | +`groups`_ within a netCDF file or `Zarr Store`_. |
| 29 | + |
| 30 | +.. _groups: https://www.unidata.ucar.edu/software/netcdf/workshops/2011/groups-types/GroupsIntro.html |
| 31 | +.. _Zarr Store: https://zarr.readthedocs.io/en/stable/tutorial.html#groups |
| 32 | + |
| 33 | +Each ``DataTree`` object (or "node") contains the same data that a single ``xarray.Dataset`` would (i.e. ``DataArray`` objects |
| 34 | +stored under hashable keys), and so has the same key properties: |
| 35 | + |
| 36 | +- ``dims``: a dictionary mapping of dimension names to lengths, for the variables in this node, |
| 37 | +- ``data_vars``: a dict-like container of DataArrays corresponding to variables in this node, |
| 38 | +- ``coords``: another dict-like container of DataArrays, corresponding to coordinate variables in this node, |
| 39 | +- ``attrs``: dict to hold arbitary metadata relevant to data in this node. |
| 40 | + |
| 41 | +A single ``DataTree`` object acts much like a single ``Dataset`` object, and has a similar set of dict-like methods |
| 42 | +defined upon it. However, ``DataTree``'s can also contain other ``DataTree`` objects, so they can be thought of as nested dict-like |
| 43 | +containers of both ``xarray.DataArray``'s and ``DataTree``'s. |
| 44 | + |
| 45 | +A single datatree object is known as a "node", and its position relative to other nodes is defined by two more key |
| 46 | +properties: |
| 47 | + |
| 48 | +- ``children``: An ordered dictionary mapping from names to other ``DataTree`` objects, known as its' "child nodes". |
| 49 | +- ``parent``: The single ``DataTree`` object whose children this datatree is a member of, known as its' "parent node". |
| 50 | + |
| 51 | +Each child automatically knows about its parent node, and a node without a parent is known as a "root" node |
| 52 | +(represented by the ``parent`` attribute pointing to ``None``). |
| 53 | +Nodes can have multiple children, but as each child node has at most one parent, there can only ever be one root node in a given tree. |
| 54 | + |
| 55 | +The overall structure is technically a `connected acyclic undirected rooted graph`, otherwise known as a |
| 56 | +`"Tree" <https://en.wikipedia.org/wiki/Tree_(graph_theory)>`_. |
| 57 | + |
| 58 | +.. note:: |
| 59 | + |
| 60 | + Technically a ``DataTree`` with more than one child node forms an `"Ordered Tree" <https://en.wikipedia.org/wiki/Tree_(graph_theory)#Ordered_tree>`_, |
| 61 | + because the children are stored in an Ordered Dictionary. However, this distinction only really matters for a few |
| 62 | + edge cases involving operations on multiple trees simultaneously, and can safely be ignored by most users. |
| 63 | + |
| 64 | + |
| 65 | +``DataTree`` objects can also optionally have a ``name`` as well as ``attrs``, just like a ``DataArray``. |
| 66 | +Again these are not normally used unless explicitly accessed by the user. |
| 67 | + |
| 68 | + |
| 69 | +Creating a DataTree |
| 70 | +~~~~~~~~~~~~~~~~~~~ |
| 71 | + |
| 72 | +There are two ways to create a ``DataTree`` from scratch. The first is to create each node individually, |
| 73 | +specifying the nodes' relationship to one another as you create each one. |
| 74 | + |
| 75 | +The ``DataTree`` constructor takes: |
| 76 | + |
| 77 | +- ``data``: The data that will be stored in this node, represented by a single ``xarray.Dataset``, or a named ``xarray.DataArray``. |
| 78 | +- ``parent``: The parent node (if there is one), given as a ``DataTree`` object. |
| 79 | +- ``children``: The various child nodes (if there are any), given as a mapping from string keys to ``DataTree`` objects. |
| 80 | +- ``name``: A string to use as the name of this node. |
| 81 | + |
| 82 | +Let's make a datatree node without anything in it: |
| 83 | + |
| 84 | +.. ipython:: python |
| 85 | +
|
| 86 | + from datatree import DataTree |
| 87 | +
|
| 88 | + # create root node |
| 89 | + node1 = DataTree(name="Oak") |
| 90 | +
|
| 91 | + node1 |
| 92 | +
|
| 93 | +At this point our node is also the root node, as every tree has a root node. |
| 94 | + |
| 95 | +We can add a second node to this tree either by referring to the first node in the constructor of the second: |
| 96 | + |
| 97 | +.. ipython:: python |
| 98 | +
|
| 99 | + # add a child by referring to the parent node |
| 100 | + node2 = DataTree(name="Bonsai", parent=node1) |
| 101 | +
|
| 102 | +or by dynamically updating the attributes of one node to refer to another: |
| 103 | + |
| 104 | +.. ipython:: python |
| 105 | +
|
| 106 | + # add a grandparent by updating the .parent property of an existing node |
| 107 | + node0 = DataTree(name="General Sherman") |
| 108 | + node1.parent = node0 |
| 109 | +
|
| 110 | +Our tree now has three nodes within it, and one of the two new nodes has become the new root: |
| 111 | + |
| 112 | +.. ipython:: python |
| 113 | +
|
| 114 | + node0 |
| 115 | +
|
| 116 | +Is is at tree construction time that consistency checks are enforced. For instance, if we try to create a `cycle` the constructor will raise an error: |
| 117 | + |
| 118 | +.. ipython:: python |
| 119 | + :okexcept: |
| 120 | +
|
| 121 | + node0.parent = node2 |
| 122 | +
|
| 123 | +The second way is to build the tree from a dictionary of filesystem-like paths and corresponding ``xarray.Dataset`` objects. |
| 124 | + |
| 125 | +This relies on a syntax inspired by unix-like filesystems, where the "path" to a node is specified by the keys of each intermediate node in sequence, |
| 126 | +separated by forward slashes. The root node is referred to by ``"/"``, so the path from our current root node to its grand-child would be ``"/Oak/Bonsai"``. |
| 127 | +A path specified from the root (as opposed to being specified relative to an arbitrary node in the tree) is sometimes also referred to as a |
| 128 | +`"fully qualified name" <https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf-zarr-data-model-specification#nczarr_fqn>`_. |
| 129 | + |
| 130 | +If we have a dictionary where each key is a valid path, and each value is either valid data or ``None``, |
| 131 | +we can construct a complex tree quickly using the alternative constructor ``:py:func::DataTree.from_dict``: |
| 132 | + |
| 133 | +.. ipython:: python |
| 134 | +
|
| 135 | + d = { |
| 136 | + "/": xr.Dataset({"foo": "orange"}), |
| 137 | + "/a": xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}), |
| 138 | + "/a/b": xr.Dataset({"zed": np.NaN}), |
| 139 | + "a/c/d": None, |
| 140 | + } |
| 141 | + dt = DataTree.from_dict(d) |
| 142 | + dt |
| 143 | +
|
| 144 | +Notice that this method will also create any intermediate empty node necessary to reach the end of the specified path |
| 145 | +(i.e. the node labelled `"c"` in this case.) |
| 146 | + |
| 147 | +Finally if you have a file containing data on disk (such as a netCDF file or a Zarr Store), you can also create a datatree by opening the |
| 148 | +file using ``:py:func::~datatree.open_datatree``. |
| 149 | + |
| 150 | + |
| 151 | +DataTree Contents |
| 152 | +~~~~~~~~~~~~~~~~~ |
| 153 | + |
| 154 | +Like ``xarray.Dataset``, ``DataTree`` implements the python mapping interface, but with values given by either ``xarray.DataArray`` objects or other ``DataTree`` objects. |
| 155 | + |
| 156 | +.. ipython:: python |
| 157 | +
|
| 158 | + dt["a"] |
| 159 | + dt["foo"] |
| 160 | +
|
| 161 | +Iterating over keys will iterate over both the names of variables and child nodes. |
| 162 | + |
| 163 | +We can also access all the data in a single node through a dataset-like view |
| 164 | + |
| 165 | +.. ipython:: python |
| 166 | +
|
| 167 | + dt["a"].ds |
| 168 | +
|
| 169 | +This demonstrates the fact that the data in any one node is equivalent to the contents of a single ``xarray.Dataset`` object. |
| 170 | +The ``DataTree.ds`` property returns an immutable view, but we can instead extract the node's data contents as a new (and mutable) |
| 171 | +``xarray.Dataset`` object via ``.to_dataset()``: |
| 172 | + |
| 173 | +.. ipython:: python |
| 174 | +
|
| 175 | + dt["a"].to_dataset() |
| 176 | +
|
| 177 | +Like with ``Dataset``, you can access the data and coordinate variables of a node separately via the ``data_vars`` and ``coords`` attributes: |
| 178 | + |
| 179 | +.. ipython:: python |
| 180 | +
|
| 181 | + dt["a"].data_vars |
| 182 | + dt["a"].coords |
| 183 | +
|
| 184 | +
|
| 185 | +Dictionary-like methods |
| 186 | +~~~~~~~~~~~~~~~~~~~~~~~ |
| 187 | + |
| 188 | +We can update the contents of the tree in-place using a dictionary-like syntax. |
| 189 | + |
| 190 | +We can update a datatree in-place using Python's standard dictionary syntax, similar to how we can for Dataset objects. |
| 191 | +For example, to create this example datatree from scratch, we could have written: |
| 192 | + |
| 193 | +# TODO update this example using ``.coords`` and ``.data_vars`` as setters, |
| 194 | + |
| 195 | +.. ipython:: python |
| 196 | +
|
| 197 | + dt = DataTree() |
| 198 | + dt["foo"] = "orange" |
| 199 | + dt["a"] = DataTree(data=xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])})) |
| 200 | + dt["a/b/zed"] = np.NaN |
| 201 | + dt["a/c/d"] = DataTree() |
| 202 | + dt |
| 203 | +
|
| 204 | +To change the variables in a node of a ``DataTree``, you can use all the standard dictionary |
| 205 | +methods, including ``values``, ``items``, ``__delitem__``, ``get`` and |
| 206 | +:py:meth:`~xarray.DataTree.update`. |
| 207 | +Note that assigning a ``DataArray`` object to a ``DataTree`` variable using ``__setitem__`` or ``update`` will |
| 208 | +:ref:`automatically align<update>` the array(s) to the original node's indexes. |
| 209 | + |
| 210 | +If you copy a ``DataTree`` using the ``:py:func::copy`` function or the :py:meth:`~xarray.DataTree.copy` it will copy the entire tree, |
| 211 | +including all parents and children. |
| 212 | +Like for ``Dataset``, this copy is shallow by default, but you can copy all the data by calling ``dt.copy(deep=True)``. |
0 commit comments