Skip to content
This repository was archived by the owner on Oct 24, 2024. It is now read-only.

Data structures docs #103

Merged
merged 54 commits into from
Jun 26, 2022
Merged
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
22004e4
sketching out changes needed to integrate variables into DataTree
TomNicholas Sep 22, 2021
8f0518c
Merge branch 'main' into initial_integration_refactor
TomNicholas Apr 29, 2022
3a4f874
fixed some other basic conflicts
TomNicholas Apr 29, 2022
8c6a68a
fix mypy errors
TomNicholas Apr 29, 2022
b503b06
can create basic datatree node objects again
TomNicholas Apr 29, 2022
1efd7f2
child-variable name collisions dectected correctly
TomNicholas Apr 29, 2022
438d73a
in-progres
TomNicholas May 4, 2022
904eae3
merge in main
TomNicholas May 20, 2022
2ca1c1a
add _replace method
TomNicholas May 21, 2022
547d1ac
updated tests to assert identical instead of check .ds is expected_ds
TomNicholas May 21, 2022
6f78fcd
refactor .ds setter to use _replace
TomNicholas May 21, 2022
715ce49
refactor init to use _replace
TomNicholas May 21, 2022
edd2f67
refactor test tree to avoid init
TomNicholas May 21, 2022
b2c51aa
attempt at copy methods
TomNicholas May 22, 2022
a20e85f
rewrote implementation of .copy method
TomNicholas May 23, 2022
8387a1c
xfailing test for deepcopying
TomNicholas May 23, 2022
52ef23b
pseudocode implementation of DatasetView
TomNicholas May 24, 2022
4a5317e
Revert "pseudocode implementation of DatasetView"
TomNicholas May 24, 2022
b60a4af
removed duplicated implementation of copy
TomNicholas May 25, 2022
3077bf7
reorganise API docs
TomNicholas May 25, 2022
5368f8b
expose data_vars, coords etc. properties
TomNicholas May 26, 2022
cae0a4e
try except with calculate_dimensions private import
TomNicholas May 26, 2022
72af61c
add keys/values/items methods
TomNicholas May 27, 2022
ec11072
don't use has_data when .variables would do
TomNicholas May 27, 2022
7c2c4f8
explanation of basic properties
TomNicholas May 31, 2022
66b7adf
add data structures page to index
TomNicholas May 31, 2022
b61e940
revert adding documentation in favour of that going in a different PR
TomNicholas May 31, 2022
163e54d
explanation of basic properties
TomNicholas May 31, 2022
ab0dfe1
add data structures page to index
TomNicholas May 31, 2022
5c36b18
create tree node-by-node
TomNicholas Jun 1, 2022
c75fb0b
create tree from dict
TomNicholas Jun 1, 2022
a59ff54
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 1, 2022
91c7afd
dict-like interface
TomNicholas Jun 1, 2022
0e9b384
correct deepcopy tests
TomNicholas Jun 1, 2022
d56f89b
use .data_vars in copy tests
TomNicholas Jun 1, 2022
cf5051c
Merge branch 'main' into initial_integration_refactor
TomNicholas Jun 2, 2022
78e7faa
Merge branch 'data_structures_docs2' of https://github.com/TomNichola…
TomNicholas Jun 2, 2022
0910d79
Merge branch 'initial_integration_refactor' into data_structures_docs2
TomNicholas Jun 2, 2022
e191660
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 2, 2022
0c1dd29
black
TomNicholas Jun 2, 2022
d3ebd12
Merge branch 'data_structures_docs2' of https://github.com/TomNichola…
TomNicholas Jun 2, 2022
7bd3b47
Merge branch 'main' into data_structures_docs2
TomNicholas Jun 16, 2022
438a090
Merge branch 'main' into data_structures_docs2
TomNicholas Jun 16, 2022
3ce120b
whatsnew
TomNicholas Jun 16, 2022
fcd94a6
Merge branch 'data_structures_docs2' of https://github.com/TomNichola…
TomNicholas Jun 16, 2022
f87ef2f
data contents
TomNicholas Jun 16, 2022
44b8db5
dictionary-like access
TomNicholas Jun 16, 2022
02f63a2
TODOs
TomNicholas Jun 16, 2022
b74b94f
test assigning int
TomNicholas Jun 17, 2022
86f218b
allow assigning coercible values
TomNicholas Jun 17, 2022
94fd6c7
Merge branch 'main' of https://github.com/xarray-contrib/datatree
TomNicholas Jun 17, 2022
c947539
Merge branch 'main' into data_structures_docs2
TomNicholas Jun 17, 2022
657d5c9
simplify example using #115
TomNicholas Jun 17, 2022
566ca1a
add note about fully qualified names
TomNicholas Jun 26, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 13 additions & 3 deletions datatree/datatree.py
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,10 @@ def ds(self) -> DatasetView:
An immutable Dataset-like view onto the data in this node.

For a mutable Dataset containing the same data as in this node, use `.to_dataset()` instead.

See Also
--------
DataTree.to_dataset
"""
return DatasetView._from_node(self)

Expand Down Expand Up @@ -385,7 +389,13 @@ def _pre_attach(self: DataTree, parent: DataTree) -> None:
)

def to_dataset(self) -> Dataset:
"""Return the data in this node as a new xarray.Dataset object."""
"""
Return the data in this node as a new xarray.Dataset object.

See Also
--------
DataTree.ds
"""
return Dataset._construct_direct(
self._variables,
self._coord_names,
Expand Down Expand Up @@ -424,7 +434,7 @@ def variables(self) -> Mapping[Hashable, Variable]:

@property
def attrs(self) -> Dict[Hashable, Any]:
"""Dictionary of global attributes on this node"""
"""Dictionary of global attributes on this node object."""
if self._attrs is None:
self._attrs = {}
return self._attrs
Expand All @@ -435,7 +445,7 @@ def attrs(self, value: Mapping[Any, Any]) -> None:

@property
def encoding(self) -> Dict:
"""Dictionary of global encoding attributes on this node"""
"""Dictionary of global encoding attributes on this node object."""
if self._encoding is None:
self._encoding = {}
return self._encoding
Expand Down
211 changes: 211 additions & 0 deletions docs/source/data-structures.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
.. _data structures:

Data Structures
===============

.. ipython:: python
:suppress:

import numpy as np
import pandas as pd
import xarray as xr
import datatree

np.random.seed(123456)
np.set_printoptions(threshold=10)

.. note::

This page builds on the information given in xarray's main page on
`data structures <https://docs.xarray.dev/en/stable/user-guide/data-structures.html>`_, so it is suggested that you
are familiar with those first.

DataTree
--------

:py:class:``DataTree`` is xarray's highest-level data structure, able to organise heterogeneous data which
could not be stored inside a single ``Dataset`` object. This includes representing the recursive structure of multiple
`groups`_ within a netCDF file or `Zarr Store`_.

.. _groups: https://www.unidata.ucar.edu/software/netcdf/workshops/2011/groups-types/GroupsIntro.html
.. _Zarr Store: https://zarr.readthedocs.io/en/stable/tutorial.html#groups

Each ``DataTree`` object (or "node") contains the same data that a single ``xarray.Dataset`` would (i.e. ``DataArray`` objects
stored under hashable keys), and so has the same key properties:

- ``dims``: a dictionary mapping of dimension names to lengths, for the variables in this node,
- ``data_vars``: a dict-like container of DataArrays corresponding to variables in this node,
- ``coords``: another dict-like container of DataArrays, corresponding to coordinate variables in this node,
- ``attrs``: dict to hold arbitary metadata relevant to data in this node.

A single ``DataTree`` object acts much like a single ``Dataset`` object, and has a similar set of dict-like methods
defined upon it. However, ``DataTree``'s can also contain other ``DataTree`` objects, so they can be thought of as nested dict-like
containers of both ``xarray.DataArray``'s and ``DataTree``'s.

A single datatree object is known as a "node", and its position relative to other nodes is defined by two more key
properties:

- ``children``: An ordered dictionary mapping from names to other ``DataTree`` objects, known as its' "child nodes".
- ``parent``: The single ``DataTree`` object whose children this datatree is a member of, known as its' "parent node".

Each child automatically knows about its parent node, and a node without a parent is known as a "root" node
(represented by the ``parent`` attribute pointing to ``None``).
Nodes can have multiple children, but as each child node has at most one parent, there can only ever be one root node in a given tree.

The overall structure is technically a `connected acyclic undirected rooted graph`, otherwise known as a
`"Tree" <https://en.wikipedia.org/wiki/Tree_(graph_theory)>`_.

.. note::

Technically a ``DataTree`` with more than one child node forms an `"Ordered Tree" <https://en.wikipedia.org/wiki/Tree_(graph_theory)#Ordered_tree>`_,
because the children are stored in an Ordered Dictionary. However, this distinction only really matters for a few
edge cases involving operations on multiple trees simultaneously, and can safely be ignored by most users.


``DataTree`` objects can also optionally have a ``name`` as well as ``attrs``, just like a ``DataArray``.
Again these are not normally used unless explicitly accessed by the user.


Creating a DataTree
~~~~~~~~~~~~~~~~~~~

There are two ways to create a ``DataTree`` from scratch. The first is to create each node individually,
specifying the nodes' relationship to one another as you create each one.

The ``DataTree`` constructor takes:

- ``data``: The data that will be stored in this node, represented by a single ``xarray.Dataset``, or a named ``xarray.DataArray``.
- ``parent``: The parent node (if there is one), given as a ``DataTree`` object.
- ``children``: The various child nodes (if there are any), given as a mapping from string keys to ``DataTree`` objects.
- ``name``: A string to use as the name of this node.

Let's make a datatree node without anything in it:

.. ipython:: python

from datatree import DataTree

# create root node
node1 = DataTree(name="Oak")

node1

At this point our node is also the root node, as every tree has a root node.

We can add a second node to this tree either by referring to the first node in the constructor of the second:

.. ipython:: python

# add a child by referring to the parent node
node2 = DataTree(name="Bonsai", parent=node1)

or by dynamically updating the attributes of one node to refer to another:

.. ipython:: python

# add a grandparent by updating the .parent property of an existing node
node0 = DataTree(name="General Sherman")
node1.parent = node0

Our tree now has three nodes within it, and one of the two new nodes has become the new root:

.. ipython:: python

node0

Is is at tree construction time that consistency checks are enforced. For instance, if we try to create a `cycle` the constructor will raise an error:

.. ipython:: python
:okexcept:

node0.parent = node2

The second way is to build the tree from a dictionary of filesystem-like paths and corresponding ``xarray.Dataset`` objects.

This relies on a syntax inspired by unix-like filesystems, where the "path" to a node is specified by the keys of each intermediate node in sequence,
separated by forward slashes. The root node is referred to by ``"/"``, so the path from our current root node to its grand-child would be ``"/Oak/Bonsai"``.

If we have a dictionary where each key is a valid path, and each value is either valid data or ``None``,
we can construct a complex tree quickly using the alternative constructor ``:py:func::DataTree.from_dict``:

.. ipython:: python

d = {
"/": xr.Dataset({"foo": "orange"}),
"/a": xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}),
"/a/b": xr.Dataset({"zed": np.NaN}),
"a/c/d": None,
}
dt = DataTree.from_dict(d)
dt

Notice that this method will also create any intermediate empty node necessary to reach the end of the specified path
(i.e. the node labelled `"c"` in this case.)

Finally if you have a file containing data on disk (such as a netCDF file or a Zarr Store), you can also create a datatree by opening the
file using ``:py:func::~datatree.open_datatree``.


DataTree Contents
~~~~~~~~~~~~~~~~~

Like ``xarray.Dataset``, ``DataTree`` implements the python mapping interface, but with values given by either ``xarray.DataArray`` objects or other ``DataTree`` objects.

.. ipython:: python

dt["a"]
dt["foo"]

Iterating over keys will iterate over both the names of variables and child nodes.

We can also access all the data in a single node through a dataset-like view

.. ipython:: python

dt["a"].ds

This demonstrates the fact that the data in any one node is equivalent to the contents of a single ``xarray.Dataset`` object.
The ``DataTree.ds`` property returns an immutable view, but we can instead extract the node's data contents as a new (and mutable)
``xarray.Dataset`` object via ``.to_dataset()``:

.. ipython:: python

dt["a"].to_dataset()

Like with ``Dataset``, you can access the data and coordinate variables of a node separately via the ``data_vars`` and ``coords`` attributes:

.. ipython:: python

dt["a"].data_vars
dt["a"].coords


Dictionary-like methods
~~~~~~~~~~~~~~~~~~~~~~~

We can update the contents of the tree in-place using a dictionary-like syntax.

We can update a datatree in-place using Python's standard dictionary syntax, similar to how we can for Dataset objects.
For example, to create this example datatree from scratch, we could have written:

# TODO update this example allowing non-dataarray values to ``__setitem__``
# TODO update this example using ``.coords`` and ``.data_vars`` as setters,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you forget to do this, or are you leaving it for another PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some undocumented bugs I noticed with trying to use .coords since #41 I think, so I was going to leave that problem to a future PR.


.. ipython:: python

dt = DataTree()
dt["foo"] = xr.DataArray("orange")
dt["a"] = DataTree(data=xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}))
dt["a/b/zed"] = xr.DataArray(np.NaN)
dt["a/c/d"] = DataTree()
dt

To change the variables in a node of a ``DataTree``, you can use all the standard dictionary
methods, including ``values``, ``items``, ``__delitem__``, ``get`` and
:py:meth:`~xarray.DataTree.update`.
Note that assigning a ``DataArray`` object to a ``DataTree`` variable using ``__setitem__`` or ``update`` will
:ref:`automatically align<update>` the array(s) to the original node's indexes.

If you copy a ``DataTree`` using the ``:py:func::copy`` function or the :py:meth:`~xarray.DataTree.copy` it will copy the entire tree,
including all parents and children.
Like for ``Dataset``, this copy is shallow by default, but you can copy all the data by calling ``dt.copy(deep=True)``.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Datatree
Installation <installation>
Quick Overview <quick-overview>
Tutorial <tutorial>
Data Model <data-structures>
API Reference <api>
How do I ... <howdoi>
Contributing Guide <contributing>
Expand Down
3 changes: 3 additions & 0 deletions docs/source/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@ Bug fixes
Documentation
~~~~~~~~~~~~~

- Added ``Data Structures`` page describing the internal structure of a ``DataTree`` object, and its relation to
``xarray.Dataset`` objects. (:pull:`103`)
By `Tom Nicholas <https://github.com/TomNicholas>`_.
- API page updated with all the methods that are copied from ``xarray.Dataset``. (:pull:`41`)
By `Tom Nicholas <https://github.com/TomNicholas>`_.

Expand Down