Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Dataset internals to store data variables and coordinate variables as separate dicts #9203

Open
TomNicholas opened this issue Jul 1, 2024 · 7 comments
Labels
topic-DataTree Related to the implementation of a DataTree class topic-internals

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Jul 1, 2024

There's a lot of other discussion in #9063, but I wanted to pull out this suggestion for independent discussion:

  • Our list of internal attributes on a DataTree node is still not just those on a Dataset plus the inherited coordinates

This is indeed a bit of a con, but in my mind the right fix is probably to adjust the Dataset data model to using dictionaries of data_variables and coord_variables, rather than the current solution of a dict of variables and a set of coord_names. Using a separate dictionary for coord_variables would also be more aligned with how DataArray is implemented. The internal Dataset data model is a hold-over from the very early days of Xarray, before we had a notion of coordinate variables that are not indexes.

Originally posted by @shoyer in #9063 (comment)

@shoyer shoyer changed the title Refactor internals to store data variables and coordinate variables as separate dicts Refactor Dataset internals to store data variables and coordinate variables as separate dicts Jul 1, 2024
@TomNicholas
Copy link
Member Author

In fact could the coord_variables dict just be an actual xr.Coordinates object?

(idea from #9204 (comment))

@shoyer
Copy link
Member

shoyer commented Jul 2, 2024

In fact could the coord_variables dict just be an actual xr.Coordinates object?

I think Coordinates should remain a thin wrapper object. It needs access to both coordinate Variable objects and associated indexes.

@headtr1ck
Copy link
Collaborator

I was wondering about that already quite some time, haha.

Does this mean that handling of coordinates can be handled in the parent class DataWithCoords?

@TomNicholas
Copy link
Member Author

Does this mean that handling of coordinates can be handled in the parent class DataWithCoords?

You mean have all of Dataset, DataArray, and DataTree inherit from DataWithCoords, and add things like the .coords property to DataWithCoords? Relates to #9204 (comment).

@benbovy
Copy link
Member

benbovy commented Mar 11, 2025

The refactor of Dataset internals may (should?) be an opportunity to also refactor coordinates and indexes proxies. The current situation is quite messy and fragile:

  • Coordinates is a virtual "stand-alone" container of coordinate variables and their indexes that actually wraps a Dataset object (kind of weird but it was to avoid a bigger refactor)
  • DatasetCoordinates and DataArrayCoordinates (returned by Dataset.coords and DataArray.coords respectively) both inherit from Coordinates but are container proxies, which also implement in-place mutation of coordinate variables via access to their underlying Dataset or DataArray.
  • Indexes (returned by Dataset.indexes / Dataset.xindexes) is a container proxy of indexes but also of their associated coordinate variables.

A cleaner model might be (from #7368 (comment)):

  • Have one true stand-alone Coordinates container that stores data (coordinate variables, indexes and dimension sizes) and that is encapsulated in Dataset and DataArray (instead of having _coord_names, _coords, _indexes, etc. internal dictionaries)
  • Remove DatasetCoordinates and DataArrayCoordinates, which are not needed as dataset.coords and dataarray.coords would directly return their encapsulated Coordinates instance
  • Make Dataset.xindexes an alias of Dataset.coords.xindexes for convenience
  • Get rid of Indexes? Dataset.coords.xindexes could probably just return a frozen mapping instead and specific methods like Indexes.group_by_index() (unfortunately public API) could be progressively moved into Coordinates.

This would represent a much bigger refactor, though. One difficulty with this model is also that updating coordinates cannot be done totally independently of data variables (alignment).

@TomNicholas
Copy link
Member Author

A cleaner model might be (from #7368 (comment)):

  • Have one true stand-alone Coordinates container that stores data (coordinate variables, indexes and dimension sizes) and that is encapsulated in Dataset and DataArray (instead of having _coord_names, _coords, _indexes, etc. internal dictionaries)

This sounds great, but bear in mind that that plan was conceived pre-datatree-in-xarray. The DataTree class also has these attributes, but to implement the "coordinate inheritance" feature they often actually use ChainMap to dynamically construct them from coordinates stored on parent tree nodes. Any refactor design should play nicely with this too. See this part of the code:

def _indexes(self) -> ChainMap[Hashable, Index]:

@shoyer
Copy link
Member

shoyer commented Mar 11, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class topic-internals
Projects
None yet
Development

No branches or pull requests

4 participants