diff --git a/docs/source/data-structures.rst b/docs/source/data-structures.rst index 67e0e608..4417e099 100644 --- a/docs/source/data-structures.rst +++ b/docs/source/data-structures.rst @@ -71,7 +71,7 @@ Again these are not normally used unless explicitly accessed by the user. Creating a DataTree ~~~~~~~~~~~~~~~~~~~ -There are three ways to create a ``DataTree`` from scratch. The first is to create each node individually, +One way to create a create a ``DataTree`` from scratch is to create each node individually, specifying the nodes' relationship to one another as you create each one. The ``DataTree`` constructor takes: @@ -81,16 +81,16 @@ The ``DataTree`` constructor takes: - ``children``: The various child nodes (if there are any), given as a mapping from string keys to ``DataTree`` objects. - ``name``: A string to use as the name of this node. -Let's make a datatree node without anything in it: +Let's make a single datatree node with some example data in it: .. ipython:: python from datatree import DataTree - # create root node - node1 = DataTree(name="Oak") + ds1 = xr.Dataset({"foo": "orange"}) + dt = DataTree(name="root", data=ds1) # create root node - node1 + dt At this point our node is also the root node, as every tree has a root node. @@ -98,56 +98,38 @@ We can add a second node to this tree either by referring to the first node in t .. ipython:: python + ds2 = xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}) # add a child by referring to the parent node - node2 = DataTree(name="Bonsai", parent=node1) + node2 = DataTree(name="a", parent=dt, data=ds2) or by dynamically updating the attributes of one node to refer to another: .. ipython:: python - # add a grandparent by updating the .parent property of an existing node - node0 = DataTree(name="General Sherman") - node1.parent = node0 + # add a second child by first creating a new node ... + ds3 = xr.Dataset({"zed": np.NaN}) + node3 = DataTree(name="b", data=ds3) + # ... then updating its .parent property + node3.parent = dt -Our tree now has three nodes within it, and one of the two new nodes has become the new root: +Our tree now has three nodes within it: .. ipython:: python - node0 + dt -Is is at tree construction time that consistency checks are enforced. For instance, if we try to create a `cycle` the constructor will raise an error: +It is at tree construction time that consistency checks are enforced. For instance, if we try to create a `cycle` the constructor will raise an error: .. ipython:: python :okexcept: - node0.parent = node2 - -The second way is to build the tree from a dictionary of filesystem-like paths and corresponding ``xarray.Dataset`` objects. - -This relies on a syntax inspired by unix-like filesystems, where the "path" to a node is specified by the keys of each intermediate node in sequence, -separated by forward slashes. The root node is referred to by ``"/"``, so the path from our current root node to its grand-child would be ``"/Oak/Bonsai"``. -A path specified from the root (as opposed to being specified relative to an arbitrary node in the tree) is sometimes also referred to as a -`"fully qualified name" <https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf-zarr-data-model-specification#nczarr_fqn>`_. - -If we have a dictionary where each key is a valid path, and each value is either valid data or ``None``, -we can construct a complex tree quickly using the alternative constructor ``:py:func::DataTree.from_dict``: + dt.parent = node3 -.. ipython:: python - - d = { - "/": xr.Dataset({"foo": "orange"}), - "/a": xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}), - "/a/b": xr.Dataset({"zed": np.NaN}), - "a/c/d": None, - } - dt = DataTree.from_dict(d) - dt +Alternatively you can also create a ``DataTree`` object from -Notice that this method will also create any intermediate empty node necessary to reach the end of the specified path -(i.e. the node labelled `"c"` in this case.) - -Finally the third way is from a file. if you have a file containing data on disk (such as a netCDF file or a Zarr Store), you can also create a datatree by opening the -file using ``:py:func::~datatree.open_datatree``. See the page on :ref:`reading and writing files <io>` for more details. +- An ``xarray.Dataset`` using ``Dataset.to_node()`` (not yet implemented), +- A dictionary mapping directory-like paths to either ``DataTree`` nodes or data, using ``DataTree.from_dict()``, +- A netCDF or Zarr file on disk with ``open_datatree()``. See :ref:`reading and writing files <io>`. DataTree Contents @@ -187,8 +169,6 @@ Like with ``Dataset``, you can access the data and coordinate variables of a nod Dictionary-like methods ~~~~~~~~~~~~~~~~~~~~~~~ -We can update the contents of the tree in-place using a dictionary-like syntax. - We can update a datatree in-place using Python's standard dictionary syntax, similar to how we can for Dataset objects. For example, to create this example datatree from scratch, we could have written: @@ -196,11 +176,10 @@ For example, to create this example datatree from scratch, we could have written .. ipython:: python - dt = DataTree() + dt = DataTree(name="root") dt["foo"] = "orange" dt["a"] = DataTree(data=xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])})) dt["a/b/zed"] = np.NaN - dt["a/c/d"] = DataTree() dt To change the variables in a node of a ``DataTree``, you can use all the standard dictionary @@ -209,6 +188,6 @@ methods, including ``values``, ``items``, ``__delitem__``, ``get`` and Note that assigning a ``DataArray`` object to a ``DataTree`` variable using ``__setitem__`` or ``update`` will :ref:`automatically align<update>` the array(s) to the original node's indexes. -If you copy a ``DataTree`` using the ``:py:func::copy`` function or the :py:meth:`~xarray.DataTree.copy` it will copy the entire tree, -including all parents and children. -Like for ``Dataset``, this copy is shallow by default, but you can copy all the data by calling ``dt.copy(deep=True)``. +If you copy a ``DataTree`` using the ``:py:func::copy`` function or the :py:meth:`~xarray.DataTree.copy` it will copy the subtree, +meaning that node and children below it, but no parents above it. +Like for ``Dataset``, this copy is shallow by default, but you can copy all the underlying data arrays by calling ``dt.copy(deep=True)``. diff --git a/docs/source/hierarchical-data.rst b/docs/source/hierarchical-data.rst new file mode 100644 index 00000000..daf4b10f --- /dev/null +++ b/docs/source/hierarchical-data.rst @@ -0,0 +1,332 @@ +.. _hierarchical-data: + +Working With Hierarchical Data +============================== + +.. ipython:: python + :suppress: + + import numpy as np + import pandas as pd + import xarray as xr + from datatree import DataTree + + np.random.seed(123456) + np.set_printoptions(threshold=10) + +Why Hierarchical Data? +---------------------- + +Many real-world datasets are composed of multiple differing components, +and it can often be be useful to think of these in terms of a hierarchy of related groups of data. +Examples of data which one might want organise in a grouped or hierarchical manner include: + +- Simulation data at multiple resolutions, +- Observational data about the same system but from multiple different types of sensors, +- Mixed experimental and theoretical data, +- A systematic study recording the same experiment but with different parameters, +- Heterogenous data, such as demographic and metereological data, + +or even any combination of the above. + +Often datasets like this cannot easily fit into a single ``xarray.Dataset`` object, +or are more usefully thought of as groups of related ``xarray.Dataset`` objects. +For this purpose we provide the :py:class:`DataTree` class. + +This page explains in detail how to understand and use the different features of the :py:class:`DataTree` class for your own heirarchical data needs. + +.. _node relationships: + +Node Relationships +------------------ + +.. _creating a family tree: + +Creating a Family Tree +~~~~~~~~~~~~~~~~~~~~~~ + +The three main ways of creating a ``DataTree`` object are described briefly in :ref:`creating a datatree`. +Here we go into more detail about how to create a tree node-by-node, using a famous family tree from the Simpsons cartoon as an example. + +Let's start by defining nodes representing the two siblings, Bart and Lisa Simpson: + +.. ipython:: python + + bart = DataTree(name="Bart") + lisa = DataTree(name="Lisa") + +Each of these node objects knows their own :py:class:`~DataTree.name`, but they currently have no relationship to one another. +We can connect them by creating another node representing a common parent, Homer Simpson: + +.. ipython:: python + + homer = DataTree(name="Homer", children={"Bart": bart, "Lisa": lisa}) + +Here we set the children of Homer in the node's constructor. +We now have a small family tree + +.. ipython:: python + + homer + +where we can see how these individual Simpson family members are related to one another. +The nodes representing Bart and Lisa are now connected - we can confirm their sibling rivalry by examining the :py:class:`~DataTree.siblings` property: + +.. ipython:: python + + list(bart.siblings) + +But oops, we forgot Homer's third daughter, Maggie! Let's add her by updating Homer's :py:class:`~DataTree.children` property to include her: + +.. ipython:: python + + maggie = DataTree(name="Maggie") + homer.children = {"Bart": bart, "Lisa": lisa, "Maggie": maggie} + homer + +Let's check that Maggie knows who her Dad is: + +.. ipython:: python + + maggie.parent.name + +That's good - updating the properties of our nodes does not break the internal consistency of our tree, as changes of parentage are automatically reflected on both nodes. + + These children obviously have another parent, Marge Simpson, but ``DataTree`` nodes can only have a maximum of one parent. + Genealogical `family trees are not even technically trees <https://en.wikipedia.org/wiki/Family_tree#Graph_theory>`_ in the mathematical sense - + the fact that distant relatives can mate makes it a directed acyclic graph. + Trees of ``DataTree`` objects cannot represent this. + +Homer is currently listed as having no parent (the so-called "root node" of this tree), but we can update his :py:class:`~DataTree.parent` property: + +.. ipython:: python + + abe = DataTree(name="Abe") + homer.parent = abe + +Abe is now the "root" of this tree, which we can see by examining the :py:class:`~DataTree.root` property of any node in the tree + +.. ipython:: python + + maggie.root.name + +We can see the whole tree by printing Abe's node or just part of the tree by printing Homer's node: + +.. ipython:: python + + abe + homer + +We can see that Homer is aware of his parentage, and we say that Homer and his children form a "subtree" of the larger Simpson family tree. + +In episode 28, Abe Simpson reveals that he had another son, Herbert "Herb" Simpson. +We can add Herbert to the family tree without displacing Homer by :py:meth:`~DataTree.assign`-ing another child to Abe: + +.. ipython:: python + + herbert = DataTree(name="Herb") + abe.assign({"Herbert": herbert}) + +.. note:: + This example shows a minor subtlety - the returned tree has Homer's brother listed as ``"Herbert"``, + but the original node was named "Herbert". Not only are names overriden when stored as keys like this, + but the new node is a copy, so that the original node that was reference is unchanged (i.e. ``herbert.name == "Herb"`` still). + In other words, nodes are copied into trees, not inserted into them. + This is intentional, and mirrors the behaviour when storing named ``xarray.DataArray`` objects inside datasets. + +Certain manipulations of our tree are forbidden, if they would create an inconsistent result. +In episode 51 of the show Futurama, Philip J. Fry travels back in time and accidentally becomes his own Grandfather. +If we try similar time-travelling hijinks with Homer, we get a :py:class:`InvalidTreeError` raised: + +.. ipython:: python + :okexcept: + + abe.parent = homer + +.. _evolutionary tree: + +Ancestry in an Evolutionary Tree +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Let's use a different example of a tree to discuss more complex relationships between nodes - the phylogenetic tree, or tree of life. + +.. ipython:: python + + vertebrates = DataTree.from_dict( + name="Vertebrae", + d={ + "/Sharks": None, + "/Bony Skeleton/Ray-finned Fish": None, + "/Bony Skeleton/Four Limbs/Amphibians": None, + "/Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Primates": None, + "/Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Rodents & Rabbits": None, + "/Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Dinosaurs": None, + "/Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Birds": None, + }, + ) + + primates = vertebrates["/Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Primates"] + dinosaurs = vertebrates[ + "/Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Dinosaurs" + ] + +We have used the :py:meth:`~DataTree.from_dict` constructor method as an alternate way to quickly create a whole tree, +and :ref:`filesystem-like syntax <filesystem paths>`_ (to be explained shortly) to select two nodes of interest. + +.. ipython:: python + + vertebrates + +This tree shows various families of species, grouped by their common features (making it technically a `"Cladogram" <https://en.wikipedia.org/wiki/Cladogram>`_, +rather than an evolutionary tree). + +Here both the species and the features used to group them are represented by ``DataTree`` node objects - there is no distinction in types of node. +We can however get a list of only the nodes we used to represent species by using the fact that all those nodes have no children - they are "leaf nodes". +We can check if a node is a leaf with :py:meth:`~DataTree.is_leaf`, and get a list of all leaves with the :py:class:`~DataTree.leaves` property: + +.. ipython:: python + + primates.is_leaf + [node.name for node in vertebrates.leaves] + +Pretending that this is a true evolutionary tree for a moment, we can find the features of the evolutionary ancestors (so-called "ancestor" nodes), +the distinguishing feature of the common ancestor of all vertebrate life (the root node), +and even the distinguishing feature of the common ancestor of any two species (the common ancestor of two nodes): + +.. ipython:: python + + [node.name for node in primates.ancestors] + primates.root.name + primates.find_common_ancestor(dinosaurs).name + +We can only find a common ancestor between two nodes that lie in the same tree. +If we try to find the common evolutionary ancestor between primates and an Alien species that has no relationship to Earth's evolutionary tree, +an error will be raised. + +.. ipython:: python + :okexcept: + + alien = DataTree(name="Xenomorph") + primates.find_common_ancestor(alien) + + +.. _navigating trees: + +Navigating Trees +---------------- + +There are various ways to access the different nodes in a tree. + +Properties +~~~~~~~~~~ + +We can navigate trees using the :py:class:`~DataTree.parent` and :py:class:`~DataTree.children` properties of each node, for example: + +.. ipython:: python + + lisa.parent.children["Bart"].name + +but there are also more convenient ways to access nodes. + +Dictionary-like interface +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Children are stored on each node as a key-value mapping from name to child node. +They can be accessed and altered via the :py:class:`~DataTree.__getitem__` and :py:class:`~DataTree.__setitem__` syntax. +In general :py:class:`~DataTree.DataTree` objects support almost the entire set of dict-like methods, +including :py:meth:`~DataTree.keys`, :py:class:`~DataTree.values`, :py:class:`~DataTree.items`, +:py:meth:`~DataTree.__delitem__` and :py:meth:`~DataTree.update`. + +.. ipython:: python + + vertebrates["Bony Skeleton"]["Ray-finned Fish"] + +Note that the dict-like interface combines access to child ``DataTree`` nodes and stored ``DataArrays``, +so if we have a node that contains both children and data, calling :py:meth:`~DataTree.keys` will list both names of child nodes and +names of data variables: + +.. ipython:: python + + dt = DataTree( + data=xr.Dataset({"foo": 0, "bar": 1}), + children={"a": DataTree(), "b": DataTree()}, + ) + print(dt) + list(dt.keys()) + +This also means that the names of variables and of child nodes must be different to one another. + +Attribute-like access +~~~~~~~~~~~~~~~~~~~~~ + +# TODO attribute-like access is not yet implemented, see issue #98 + +.. _filesystem paths: + +Filesystem-like Paths +~~~~~~~~~~~~~~~~~~~~~ + +Hierarchical trees can be thought of as analogous to file systems. +Each node is like a directory, and each directory can contain both more sub-directories and data. + +.. note:: + + You can even make the filesystem analogy concrete by using :py:func:`~DataTree.open_mfdatatree` or :py:func:`~DataTree.save_mfdatatree` # TODO not yet implemented - see GH issue 51 + +Datatree objects support a syntax inspired by unix-like filesystems, +where the "path" to a node is specified by the keys of each intermediate node in sequence, +separated by forward slashes. +This is an extension of the conventional dictionary ``__getitem__`` syntax to allow navigation across multiple levels of the tree. + +Like with filepaths, paths within the tree can either be relative to the current node, e.g. + +.. ipython:: python + + abe["Homer/Bart"].name + abe["./Homer/Bart"].name # alternative syntax + +or relative to the root node. +A path specified from the root (as opposed to being specified relative to an arbitrary node in the tree) is sometimes also referred to as a +`"fully qualified name" <https://www.unidata.ucar.edu/blogs/developer/en/entry/netcdf-zarr-data-model-specification#nczarr_fqn>`_, +or as an "absolute path". +The root node is referred to by ``"/"``, so the path from the root node to its grand-child would be ``"/child/grandchild"``, e.g. + +.. ipython:: python + + # absolute path will start from root node + lisa["/Homer/Bart"].name + +Relative paths between nodes also support the ``"../"`` syntax to mean the parent of the current node. +We can use this with ``__setitem__`` to add a missing entry to our evolutionary tree, but add it relative to a more familiar node of interest: + +.. ipython:: python + + primates["../../Two Fenestrae/Crocodiles"] = DataTree() + print(vertebrates) + +Given two nodes in a tree, we can also find their relative path: + +.. ipython:: python + + bart.relative_to(lisa) + +You can use this filepath feature to build a nested tree from a dictionary of filesystem-like paths and corresponding ``xarray.Dataset`` objects in a single step. +If we have a dictionary where each key is a valid path, and each value is either valid data or ``None``, +we can construct a complex tree quickly using the alternative constructor :py:meth:`DataTree.from_dict()`: + +.. ipython:: python + + d = { + "/": xr.Dataset({"foo": "orange"}), + "/a": xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}), + "/a/b": xr.Dataset({"zed": np.NaN}), + "a/c/d": None, + } + dt = DataTree.from_dict(d) + dt + +.. note:: + + Notice that using the path-like syntax will also create any intermediate empty nodes necessary to reach the end of the specified path + (i.e. the node labelled `"c"` in this case.) + This is to help avoid lots of redundant entries when creating deeply-nested trees using :py:meth:`DataTree.from_dict`. diff --git a/docs/source/index.rst b/docs/source/index.rst index 9448e232..e0e39de7 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -12,6 +12,7 @@ Datatree Quick Overview <quick-overview> Tutorial <tutorial> Data Model <data-structures> + Hierarchical Data <hierarchical-data> Reading and Writing Files <io> API Reference <api> Terminology <terminology> diff --git a/docs/source/whats-new.rst b/docs/source/whats-new.rst index e57e31e4..0d59e0e7 100644 --- a/docs/source/whats-new.rst +++ b/docs/source/whats-new.rst @@ -58,6 +58,8 @@ Documentation By `Tom Nicholas <https://github.com/TomNicholas>`_. - Added ``Terminology`` page. (:pull:`174`) By `Tom Nicholas <https://github.com/TomNicholas>`_. +- Added page on ``Working with Hierarchical Data`` (:pull:`179`) + By `Tom Nicholas <https://github.com/TomNicholas>`_. Internal Changes ~~~~~~~~~~~~~~~~