From d5072dce7282c755b202f67ad2700876d82ece3a Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Wed, 4 Dec 2024 16:32:03 -0500 Subject: [PATCH 01/24] add draft datatree blog post --- src/posts/datatree/index.md | 85 +++++++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) create mode 100644 src/posts/datatree/index.md diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md new file mode 100644 index 00000000..181bf977 --- /dev/null +++ b/src/posts/datatree/index.md @@ -0,0 +1,85 @@ +--- +title: 'Xarray x NASA: New xarray.DataTree for hierarchical data structures' +date: '2024-12-04' +authors: + - name: Tom Nicholas + github: tomnicholas + - name: Owen Littlejohns + github: owenlittlejohns + - name: Matt Savoie + github: flamingbear + - name: Eni Awowale + github: eni-awowale + - name: Alfonso Ladino + github: aladinor + - name: Justus Magin + github: keewis + - name: Stephan Hoyer + github: shoyer +summary: "The new xarray.DataTree class allows working with netCDF/Zarr groups, brought to you via collaboration with NASA!" +--- + +## TL;DR + +``xarray.DataTree`` has been released, and the prototype [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree) archived, after collaboration between the xarray team and [NASA ESDIS](https://www.earthdata.nasa.gov/about/esdis). + +ESDIS OR EOSDIS?? + +## Why trees? + +- Motivate why users wanted a hierarchical structure (e.g. https://github.com/pydata/xarray/issues/4118 and https://github.com/pydata/xarray/issues/1092) + +## What is a DataTree? + +- Very brief explanation of the solution we have ended up with + - Doesn't need to explain much about actually using datatree - that should be covered by pointing people to the docs. + +## A Big Addition! + +- Emphasise that this is a big deal + - Arguably the single largest feature added to xarray in 10 years? (I think it is by LoC) + - Metrics for # commits, LoC, contributors + - Some really gnarly design questions (link to issues about inheritance) + - For a decade there have been 3 public xarray data structures, now there are 4 (`Variable`, `DataArray`, `Dataset`, and now `DataTree`). LINKS +- Mention the prototype in xarray-contrib/datatree repo + - Explain how old repository is now archived + - And link to migration guide https://github.com/pydata/xarray/issues/8807#issuecomment-2338869819 + +## How did this happen? + +DataTree didn't get implemented overnight - it was a multi-year effort that took place in a number of steps. + +Initially, the xarray team applied for funding from the [Chan-Zuckerberg initiative]() LINK to develop something like datatree, citing bioscience use cases (e.g. microscopy image pyramids LINK). Unfortunately whilst we've been lucky to receive CZI funding before (LINK), on this occasion we didn't win money to work on the datatree idea. + +Without direct funding, Tom then used some time whilst at Columbia University to take a initial stab at the design in August 2021 - writing the first implementation on an overnight Amtrak! This simple prototype was released as a separate package in the [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree), and steadily gained a small community of intrepid users. + +A separate repository was chosen for speed of iteration, and to [avoid](https://github.com/xarray-contrib/datatree/blob/7ba05880c37f2371b5174f6e8dcfae31248fe19f/README.md#development-roadmap) giving the impression that these early experiments would have the same level of [long-term support promised](https://github.com/pydata/xarray/issues/9854) for code in xarray's main repo. However this meant that the prototype datatree was not fully integrated with xarray's main codebase, limiting possible features and requiring fragile dependencies on private xarray internals. + +The prototype then sat there for 2 years, until NASA ESDIS approached the xarray core team in August 2023. ESDIS devs wanted the ability to work with entire hierarchical files, and had experimented with the prototype version of datatree, but they wanted datatree functionality to be migrated upstream into xarray `main` so there would be more guarantees of long-term API stability and support. + +Amazingly the NASA team were able to offer engineer time, so starting in early 2024 Owen, Matt, and Eni (NASA) then worked on migrating datatree into xarray upstream, with supervision from Tom, Justus, and Stephan (existing xarray core devs). + +This second stage of development allowed us to reduce the bus factor on the datatree code, sanity check the original approach, and it gave us a chance to make some signficant changes to the design without worrying too much about backwards-incompatibility (for example enabling the [new "coordinate inheritance" feature](https://docs.xarray.dev/en/stable/user-guide/hierarchical-data.html#alignment-and-coordinate-inheritance)). + +## Lessons for future collaborations + +This gradual process of moving from idea to prototype to robust implementation is arguably a better model of + + - Took a bit longer than anticipated but otherwise worked out quite well + - Got 3 new xarray core developers now - so NASA has more explicit representation + - Was a lot easier for xarray team not to have to write a proposal to get developer time + - Ideal in the sense of literally zero overhead + - Also core dev spending 10% time spent directing someone with more time is efficient use of relative expertise + - Less ideal that Tom/Justus/Stephan didn't get paid for the work + - In future better to have one of the paid people at the contributing org already be a core dev + - This approach could work again in future! + +- Officially added Owen, Matt and Eni to the xarray core team, which also gives NASA some direct representation. + +## Go try it out! + +- Implore people to try datatree out, but also to report bugs / suggestions as it's still being built up to its full potential. + +## Thanks + +A number of other people also [contributed to datatree](https://github.com/xarray-contrib/datatree/graphs/contributors) in various ways - particular shoutout to [Alfonso Ladino](https://github.com/aladinor) and [Etinenne Schalk](https://github.com/etienneschalk) for their dedicated attendance at many of the weekly migration meetings! \ No newline at end of file From 9a94ebcf426e4ffb06adcc7e3ce603b3274e4185 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Wed, 4 Dec 2024 16:43:51 -0500 Subject: [PATCH 02/24] add funding acknowledgements --- src/posts/datatree/index.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 181bf977..405f54bf 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -49,9 +49,9 @@ ESDIS OR EOSDIS?? DataTree didn't get implemented overnight - it was a multi-year effort that took place in a number of steps. -Initially, the xarray team applied for funding from the [Chan-Zuckerberg initiative]() LINK to develop something like datatree, citing bioscience use cases (e.g. microscopy image pyramids LINK). Unfortunately whilst we've been lucky to receive CZI funding before (LINK), on this occasion we didn't win money to work on the datatree idea. +Initially, the xarray team applied for funding from the [Chan-Zuckerberg initiative]() (LINK?) in March 2021 to develop something like datatree, citing bioscience use cases (e.g. [microscopy image pyramids](https://spatialdata.scverse.org/en/latest/design_doc.html)). Unfortunately whilst we've been lucky to [receive CZI funding before](https://chanzuckerberg.com/eoss/proposals/xarray-multidimensional-labeled-arrays-and-datasets-in-python/), on this occasion we didn't win money to work on the datatree idea. -Without direct funding, Tom then used some time whilst at Columbia University to take a initial stab at the design in August 2021 - writing the first implementation on an overnight Amtrak! This simple prototype was released as a separate package in the [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree), and steadily gained a small community of intrepid users. +In the abcense of dedicated funding for datatree, Tom then used some time whilst at the [Climate Data Science Lab](https://ocean-transport.github.io/cds_lab.html) at Columbia University to take a initial stab at the design in August 2021 - writing the first implementation on an overnight Amtrak! This simple prototype was released as a separate package in the [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree), and steadily gained a small community of intrepid users. A separate repository was chosen for speed of iteration, and to [avoid](https://github.com/xarray-contrib/datatree/blob/7ba05880c37f2371b5174f6e8dcfae31248fe19f/README.md#development-roadmap) giving the impression that these early experiments would have the same level of [long-term support promised](https://github.com/pydata/xarray/issues/9854) for code in xarray's main repo. However this meant that the prototype datatree was not fully integrated with xarray's main codebase, limiting possible features and requiring fragile dependencies on private xarray internals. @@ -82,4 +82,9 @@ This gradual process of moving from idea to prototype to robust implementation i ## Thanks -A number of other people also [contributed to datatree](https://github.com/xarray-contrib/datatree/graphs/contributors) in various ways - particular shoutout to [Alfonso Ladino](https://github.com/aladinor) and [Etinenne Schalk](https://github.com/etienneschalk) for their dedicated attendance at many of the weekly migration meetings! \ No newline at end of file +A number of other people also [contributed to datatree](https://github.com/xarray-contrib/datatree/graphs/contributors) in various ways - particular shoutout to [Alfonso Ladino](https://github.com/aladinor) and [Etinenne Schalk](https://github.com/etienneschalk) for their dedicated attendance at many of the weekly migration meetings! + +## Funding Acknowledgements + +- Owen, Eni, and Matt were able to contribute development time thanks to NASA ESDIS. +- Tom was supported first by the Gordon and Betty Moore foundation as part of Ryan Abernathey's [Climate Data Science Lab](https://ocean-transport.github.io/cds_lab.html) at Columbia University, then later by various funders for a fraction of his time through [[C]Worthy](https://www.cworthy.org/). \ No newline at end of file From d1bdad25de8f5fc085c5623fafce627bd78e8afa Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Wed, 4 Dec 2024 17:27:00 -0500 Subject: [PATCH 03/24] reflections on collaboration --- src/posts/datatree/index.md | 31 ++++++++++++++++++++----------- 1 file changed, 20 insertions(+), 11 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 405f54bf..eb49fa9c 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -63,20 +63,29 @@ This second stage of development allowed us to reduce the bus factor on the data ## Lessons for future collaborations -This gradual process of moving from idea to prototype to robust implementation is arguably a better model of +This development story is different from the more typical scientific grant funding model - how did that work out for us? - - Took a bit longer than anticipated but otherwise worked out quite well - - Got 3 new xarray core developers now - so NASA has more explicit representation - - Was a lot easier for xarray team not to have to write a proposal to get developer time - - Ideal in the sense of literally zero overhead - - Also core dev spending 10% time spent directing someone with more time is efficient use of relative expertise - - Less ideal that Tom/Justus/Stephan didn't get paid for the work - - In future better to have one of the paid people at the contributing org already be a core dev - - This approach could work again in future! +The scientific grant model for funding software expects you to present a full idea in a proposal, wait 6-12 months to hopefully get funding for it, then implement the whole thing during the grant period. In contrast datatree evolved over a gradual process of moving from ideas to hacky prototype to robust implementation, with big time gaps for user feedback and experimentation. The migration was completed by developer-users who actually wanted the feature, rather than grant awardees working in service of a separate and maybe-theoretical userbase. -- Officially added Owen, Matt and Eni to the xarray core team, which also gives NASA some direct representation. +Overall while the migration effort took longer than anticipated we found it worked out quite well! -## Go try it out! +### Pros: +- **Literally zero overhead** - the existing xarray team did not to have to write a proposal to get developer time, and there was literally zero paperwork inflicted (on them at least). +- **Certainty of funding** - writing grant proposals is a lottery, so the time invested up front doesn't even come with any certainty of funding. Collaborating with another org has a much higher chance of actually leading to more money being available for developer time. +- **Time efficient** - a xarray core dev spending 10% of their time directing someone who is less familiar with the codebase but has more time is an efficient use of relative expertise. +- **Bus factor** - the new contributors reduce the bus factor on the datatree code dramatically. +- **User-driven Development** - it makes sense to have actual interested user communities involved in development. +- **Stakeholder representation** - after officially adding Owen, Matt and Eni to the [xarray core team](https://xarray.dev/team), NASA ESDIS has some direct representation in, insider understanding of, and stake in continuing to support the xarray project. + +### Cons: +- **Not everyone got direct funding** - it's less ideal that Tom, Justus, and Stephan didn't get direct funding for their supervisory work. In future it might be better to have one of the paid people at the contributing org already be a core xarray team member. +- **Tricky to accurately scope** out the duration of required work in advance, and hard to "just ship it". We hold the xarray project to high standards and backwards compatibility promises so we want to ensure that any publicly released features don't compromise on quality. + +This contributing model is more similar to how OSS has historically been supported by industry, but perhaps because xarray is primarily developed and used by the scientific community we tend to default to more grant-based funding models. + +Overall this could work again in future! So if there is an xarray or xarray-adjacent feature your organisation would like to see, **please reach out to us**. + +## Go try out `DataTree`! - Implore people to try datatree out, but also to report bugs / suggestions as it's still being built up to its full potential. From 04b5b022752b709558610106d4725d5f2f3c6cf8 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Wed, 4 Dec 2024 18:04:18 -0500 Subject: [PATCH 04/24] emphasise size of feature addition --- src/posts/datatree/index.md | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index eb49fa9c..ddc71f72 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -34,13 +34,16 @@ ESDIS OR EOSDIS?? - Very brief explanation of the solution we have ended up with - Doesn't need to explain much about actually using datatree - that should be covered by pointing people to the docs. -## A Big Addition! +## Big moves + +This was a big feature addition! For a [decade](https://github.com/pydata/xarray/discussions/8462) there have been 3 core public xarray data structures, now there are 4: [`Variable`](https://docs.xarray.dev/en/stable/generated/xarray.Variable.html#xarray.Variable), [`DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html#xarray.DataArray), [`Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html#xarray.Dataset), and now [`DataTree`](https://docs.xarray.dev/en/stable/generated/xarray.DataTree.html#xarray.DataTree). + +Datatree represents arguably the single largest feature added to xarray in 10 years - the migration alone added >10k lines of code across [80 PRs](https://github.com/pydata/xarray/pulls?q=is%3Apr+label%3Atopic-DataTree+is%3Aclosed), and the datatree code now contains contributions from at least 25 people. + +We also had to resolve some really [gnarly design questions](https://github.com/pydata/xarray/pull/9063) to make it work in a way we were happy with. + +## Deprecation -- Emphasise that this is a big deal - - Arguably the single largest feature added to xarray in 10 years? (I think it is by LoC) - - Metrics for # commits, LoC, contributors - - Some really gnarly design questions (link to issues about inheritance) - - For a decade there have been 3 public xarray data structures, now there are 4 (`Variable`, `DataArray`, `Dataset`, and now `DataTree`). LINKS - Mention the prototype in xarray-contrib/datatree repo - Explain how old repository is now archived - And link to migration guide https://github.com/pydata/xarray/issues/8807#issuecomment-2338869819 @@ -55,9 +58,9 @@ In the abcense of dedicated funding for datatree, Tom then used some time whilst A separate repository was chosen for speed of iteration, and to [avoid](https://github.com/xarray-contrib/datatree/blob/7ba05880c37f2371b5174f6e8dcfae31248fe19f/README.md#development-roadmap) giving the impression that these early experiments would have the same level of [long-term support promised](https://github.com/pydata/xarray/issues/9854) for code in xarray's main repo. However this meant that the prototype datatree was not fully integrated with xarray's main codebase, limiting possible features and requiring fragile dependencies on private xarray internals. -The prototype then sat there for 2 years, until NASA ESDIS approached the xarray core team in August 2023. ESDIS devs wanted the ability to work with entire hierarchical files, and had experimented with the prototype version of datatree, but they wanted datatree functionality to be migrated upstream into xarray `main` so there would be more guarantees of long-term API stability and support. +The prototype then sat there for 2 years, until NASA ESDIS approached the xarray core team in August 2023. ESDIS devs wanted the ability to work with entire hierarchical files, and had experimented with the prototype version of datatree, but they wanted datatree functionality to be migrated upstream into xarray's main repository so there would be more guarantees of long-term API stability and support. -Amazingly the NASA team were able to offer engineer time, so starting in early 2024 Owen, Matt, and Eni (NASA) then worked on migrating datatree into xarray upstream, with supervision from Tom, Justus, and Stephan (existing xarray core devs). +Amazingly the NASA team were able to offer engineer time, so starting in early 2024 Owen, Matt, and Eni (NASA) worked on migrating datatree into xarray upstream, with regular supervision from Tom, Justus, and Stephan (existing xarray core devs). This second stage of development allowed us to reduce the bus factor on the datatree code, sanity check the original approach, and it gave us a chance to make some signficant changes to the design without worrying too much about backwards-incompatibility (for example enabling the [new "coordinate inheritance" feature](https://docs.xarray.dev/en/stable/user-guide/hierarchical-data.html#alignment-and-coordinate-inheritance)). From fffff6f2c4235aed41b7670fce37b0b6592037a3 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Wed, 4 Dec 2024 18:13:44 -0500 Subject: [PATCH 05/24] but why trees? --- src/posts/datatree/index.md | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index ddc71f72..a76c64c9 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -16,18 +16,16 @@ authors: github: keewis - name: Stephan Hoyer github: shoyer -summary: "The new xarray.DataTree class allows working with netCDF/Zarr groups, brought to you via collaboration with NASA!" +summary: "The new xarray.DataTree class allows working with netCDF/Zarr groups, brought to you in collaboration with NASA!" --- ## TL;DR ``xarray.DataTree`` has been released, and the prototype [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree) archived, after collaboration between the xarray team and [NASA ESDIS](https://www.earthdata.nasa.gov/about/esdis). -ESDIS OR EOSDIS?? - ## Why trees? -- Motivate why users wanted a hierarchical structure (e.g. https://github.com/pydata/xarray/issues/4118 and https://github.com/pydata/xarray/issues/1092) +Xarray users have been [asking](https://github.com/pydata/xarray/issues/4118) for a way to handle multiple netCDF4 groups [since at least 2016](https://github.com/pydata/xarray/issues/1092). Such netCDF4/Zarr groups are the on-disk representation of a general problem of handling hierarchies of related but non-alignable array data. Real-world datasets often fall into this category, and users want a way to work with such hierarchical data in-memory and a way to interact with it on disk. ## What is a DataTree? @@ -54,7 +52,7 @@ DataTree didn't get implemented overnight - it was a multi-year effort that took Initially, the xarray team applied for funding from the [Chan-Zuckerberg initiative]() (LINK?) in March 2021 to develop something like datatree, citing bioscience use cases (e.g. [microscopy image pyramids](https://spatialdata.scverse.org/en/latest/design_doc.html)). Unfortunately whilst we've been lucky to [receive CZI funding before](https://chanzuckerberg.com/eoss/proposals/xarray-multidimensional-labeled-arrays-and-datasets-in-python/), on this occasion we didn't win money to work on the datatree idea. -In the abcense of dedicated funding for datatree, Tom then used some time whilst at the [Climate Data Science Lab](https://ocean-transport.github.io/cds_lab.html) at Columbia University to take a initial stab at the design in August 2021 - writing the first implementation on an overnight Amtrak! This simple prototype was released as a separate package in the [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree), and steadily gained a small community of intrepid users. +In the absence of dedicated funding for datatree, Tom then used some time whilst at the [Climate Data Science Lab](https://ocean-transport.github.io/cds_lab.html) at Columbia University to take a initial stab at the design in August 2021 - writing the first implementation on an overnight Amtrak! This simple prototype was released as a separate package in the [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree), and steadily gained a small community of intrepid users. It was driven partly by the use case of [climate model intercomparison datasets](https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114). A separate repository was chosen for speed of iteration, and to [avoid](https://github.com/xarray-contrib/datatree/blob/7ba05880c37f2371b5174f6e8dcfae31248fe19f/README.md#development-roadmap) giving the impression that these early experiments would have the same level of [long-term support promised](https://github.com/pydata/xarray/issues/9854) for code in xarray's main repo. However this meant that the prototype datatree was not fully integrated with xarray's main codebase, limiting possible features and requiring fragile dependencies on private xarray internals. @@ -86,7 +84,7 @@ Overall while the migration effort took longer than anticipated we found it work This contributing model is more similar to how OSS has historically been supported by industry, but perhaps because xarray is primarily developed and used by the scientific community we tend to default to more grant-based funding models. -Overall this could work again in future! So if there is an xarray or xarray-adjacent feature your organisation would like to see, **please reach out to us**. +Overall this type of collaboration could work again in future! So if there is an xarray or xarray-adjacent feature your organisation would like to see, **please reach out to us**. ## Go try out `DataTree`! From 8948f0187665cbb71ab0f13ea91bc86f17058b40 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Wed, 4 Dec 2024 18:29:15 -0500 Subject: [PATCH 06/24] what is a datatree --- src/posts/datatree/index.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index a76c64c9..7790246d 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -25,18 +25,20 @@ summary: "The new xarray.DataTree class allows working with netCDF/Zarr groups, ## Why trees? -Xarray users have been [asking](https://github.com/pydata/xarray/issues/4118) for a way to handle multiple netCDF4 groups [since at least 2016](https://github.com/pydata/xarray/issues/1092). Such netCDF4/Zarr groups are the on-disk representation of a general problem of handling hierarchies of related but non-alignable array data. Real-world datasets often fall into this category, and users want a way to work with such hierarchical data in-memory and a way to interact with it on disk. +Xarray users have been [asking](https://github.com/pydata/xarray/issues/4118) for a way to handle multiple netCDF4 groups [since at least 2016](https://github.com/pydata/xarray/issues/1092). Such netCDF4/Zarr groups are the on-disk representation of a general problem of handling hierarchies of related but non-alignable array data. Real-world datasets often fall into this category, and users wanted a way to work with such hierarchical data in-memory and a way to interact with it on disk. ## What is a DataTree? -- Very brief explanation of the solution we have ended up with - - Doesn't need to explain much about actually using datatree - that should be covered by pointing people to the docs. +Our solution is the new high-level container class `xarray.DataTree`. +It acts like a tree of linked `xarray.Dataset` objects, with alignment enforced between variables in sibling nodes but not between parents and children. It can be written to and opened from formats containing multiple groups, such as netCDF4 files and Zarr stores. + +For more details please see the [high-level description](https://docs.xarray.dev/en/stable/user-guide/data-structures.html#datatree) and the [dedicated page on hierarchical data](https://docs.xarray.dev/en/stable/user-guide/hierarchical-data.html), and the [section on IO with groups](https://docs.xarray.dev/en/stable/user-guide/io.html#groups) in the xarray documentation. ## Big moves This was a big feature addition! For a [decade](https://github.com/pydata/xarray/discussions/8462) there have been 3 core public xarray data structures, now there are 4: [`Variable`](https://docs.xarray.dev/en/stable/generated/xarray.Variable.html#xarray.Variable), [`DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html#xarray.DataArray), [`Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html#xarray.Dataset), and now [`DataTree`](https://docs.xarray.dev/en/stable/generated/xarray.DataTree.html#xarray.DataTree). -Datatree represents arguably the single largest feature added to xarray in 10 years - the migration alone added >10k lines of code across [80 PRs](https://github.com/pydata/xarray/pulls?q=is%3Apr+label%3Atopic-DataTree+is%3Aclosed), and the datatree code now contains contributions from at least 25 people. +Datatree represents arguably one of the single largest new features added to xarray in 10 years - the migration of the existing prototype alone added >10k lines of code across [80 PRs](https://github.com/pydata/xarray/pulls?q=is%3Apr+label%3Atopic-DataTree+is%3Aclosed), and the resulting datatree code now contains contributions from at least 25 people. We also had to resolve some really [gnarly design questions](https://github.com/pydata/xarray/pull/9063) to make it work in a way we were happy with. From b145ccf029982b4f66a4d47dd5aa2ae8d235cf25 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Wed, 4 Dec 2024 18:33:05 -0500 Subject: [PATCH 07/24] deprecation of the old repo --- src/posts/datatree/index.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 7790246d..5e0fa896 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -34,6 +34,10 @@ It acts like a tree of linked `xarray.Dataset` objects, with alignment enforced For more details please see the [high-level description](https://docs.xarray.dev/en/stable/user-guide/data-structures.html#datatree) and the [dedicated page on hierarchical data](https://docs.xarray.dev/en/stable/user-guide/hierarchical-data.html), and the [section on IO with groups](https://docs.xarray.dev/en/stable/user-guide/io.html#groups) in the xarray documentation. +## Deprecation + +If you previously had used the `DataTree` prototype in the [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree), that has now been archived and will no longer be supported. Instead we encourage you to migrate to the implementation of `DataTree` that you can import from xarray, following the [migration guide](https://github.com/pydata/xarray/blob/main/DATATREE_MIGRATION_GUIDE.md). + ## Big moves This was a big feature addition! For a [decade](https://github.com/pydata/xarray/discussions/8462) there have been 3 core public xarray data structures, now there are 4: [`Variable`](https://docs.xarray.dev/en/stable/generated/xarray.Variable.html#xarray.Variable), [`DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html#xarray.DataArray), [`Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html#xarray.Dataset), and now [`DataTree`](https://docs.xarray.dev/en/stable/generated/xarray.DataTree.html#xarray.DataTree). @@ -42,11 +46,7 @@ Datatree represents arguably one of the single largest new features added to xar We also had to resolve some really [gnarly design questions](https://github.com/pydata/xarray/pull/9063) to make it work in a way we were happy with. -## Deprecation -- Mention the prototype in xarray-contrib/datatree repo - - Explain how old repository is now archived - - And link to migration guide https://github.com/pydata/xarray/issues/8807#issuecomment-2338869819 ## How did this happen? From 1e4735cb88615c1b0d001c8114b18364c5ecfdea Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Wed, 4 Dec 2024 18:37:05 -0500 Subject: [PATCH 08/24] try it out --- src/posts/datatree/index.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 5e0fa896..1601b809 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -21,7 +21,7 @@ summary: "The new xarray.DataTree class allows working with netCDF/Zarr groups, ## TL;DR -``xarray.DataTree`` has been released, and the prototype [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree) archived, after collaboration between the xarray team and [NASA ESDIS](https://www.earthdata.nasa.gov/about/esdis). +``xarray.DataTree`` has been released in [v2024.10.0](https://github.com/pydata/xarray/releases/tag/v2024.10.0), and the prototype [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree) archived, after collaboration between the xarray team and [NASA ESDIS](https://www.earthdata.nasa.gov/about/esdis). ## Why trees? @@ -46,8 +46,6 @@ Datatree represents arguably one of the single largest new features added to xar We also had to resolve some really [gnarly design questions](https://github.com/pydata/xarray/pull/9063) to make it work in a way we were happy with. - - ## How did this happen? DataTree didn't get implemented overnight - it was a multi-year effort that took place in a number of steps. @@ -90,7 +88,9 @@ Overall this type of collaboration could work again in future! So if there is an ## Go try out `DataTree`! -- Implore people to try datatree out, but also to report bugs / suggestions as it's still being built up to its full potential. +Please try datatree out! The hierarchical structure is potentially useful to any xarray users who work with more than one dataset at a time. + +Be aware that as `xarray.DataTree` is still new there will likely be some bugs lurking, as well as as-yet [unimplemented features](https://github.com/pydata/xarray/issues?q=is%3Aissue+is%3Aopen+label%3Atopic-DataTree) (as there always are)! ## Thanks From 697310ea48b4edfbf652f3249594ba933e1e1eac Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Wed, 4 Dec 2024 23:47:19 +0000 Subject: [PATCH 09/24] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- src/posts/datatree/index.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 1601b809..89b59062 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -16,12 +16,12 @@ authors: github: keewis - name: Stephan Hoyer github: shoyer -summary: "The new xarray.DataTree class allows working with netCDF/Zarr groups, brought to you in collaboration with NASA!" +summary: 'The new xarray.DataTree class allows working with netCDF/Zarr groups, brought to you in collaboration with NASA!' --- ## TL;DR -``xarray.DataTree`` has been released in [v2024.10.0](https://github.com/pydata/xarray/releases/tag/v2024.10.0), and the prototype [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree) archived, after collaboration between the xarray team and [NASA ESDIS](https://www.earthdata.nasa.gov/about/esdis). +`xarray.DataTree` has been released in [v2024.10.0](https://github.com/pydata/xarray/releases/tag/v2024.10.0), and the prototype [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree) archived, after collaboration between the xarray team and [NASA ESDIS](https://www.earthdata.nasa.gov/about/esdis). ## Why trees? @@ -56,7 +56,7 @@ In the absence of dedicated funding for datatree, Tom then used some time whilst A separate repository was chosen for speed of iteration, and to [avoid](https://github.com/xarray-contrib/datatree/blob/7ba05880c37f2371b5174f6e8dcfae31248fe19f/README.md#development-roadmap) giving the impression that these early experiments would have the same level of [long-term support promised](https://github.com/pydata/xarray/issues/9854) for code in xarray's main repo. However this meant that the prototype datatree was not fully integrated with xarray's main codebase, limiting possible features and requiring fragile dependencies on private xarray internals. -The prototype then sat there for 2 years, until NASA ESDIS approached the xarray core team in August 2023. ESDIS devs wanted the ability to work with entire hierarchical files, and had experimented with the prototype version of datatree, but they wanted datatree functionality to be migrated upstream into xarray's main repository so there would be more guarantees of long-term API stability and support. +The prototype then sat there for 2 years, until NASA ESDIS approached the xarray core team in August 2023. ESDIS devs wanted the ability to work with entire hierarchical files, and had experimented with the prototype version of datatree, but they wanted datatree functionality to be migrated upstream into xarray's main repository so there would be more guarantees of long-term API stability and support. Amazingly the NASA team were able to offer engineer time, so starting in early 2024 Owen, Matt, and Eni (NASA) worked on migrating datatree into xarray upstream, with regular supervision from Tom, Justus, and Stephan (existing xarray core devs). @@ -71,6 +71,7 @@ The scientific grant model for funding software expects you to present a full id Overall while the migration effort took longer than anticipated we found it worked out quite well! ### Pros: + - **Literally zero overhead** - the existing xarray team did not to have to write a proposal to get developer time, and there was literally zero paperwork inflicted (on them at least). - **Certainty of funding** - writing grant proposals is a lottery, so the time invested up front doesn't even come with any certainty of funding. Collaborating with another org has a much higher chance of actually leading to more money being available for developer time. - **Time efficient** - a xarray core dev spending 10% of their time directing someone who is less familiar with the codebase but has more time is an efficient use of relative expertise. @@ -79,6 +80,7 @@ Overall while the migration effort took longer than anticipated we found it work - **Stakeholder representation** - after officially adding Owen, Matt and Eni to the [xarray core team](https://xarray.dev/team), NASA ESDIS has some direct representation in, insider understanding of, and stake in continuing to support the xarray project. ### Cons: + - **Not everyone got direct funding** - it's less ideal that Tom, Justus, and Stephan didn't get direct funding for their supervisory work. In future it might be better to have one of the paid people at the contributing org already be a core xarray team member. - **Tricky to accurately scope** out the duration of required work in advance, and hard to "just ship it". We hold the xarray project to high standards and backwards compatibility promises so we want to ensure that any publicly released features don't compromise on quality. @@ -99,4 +101,4 @@ A number of other people also [contributed to datatree](https://github.com/xarra ## Funding Acknowledgements - Owen, Eni, and Matt were able to contribute development time thanks to NASA ESDIS. -- Tom was supported first by the Gordon and Betty Moore foundation as part of Ryan Abernathey's [Climate Data Science Lab](https://ocean-transport.github.io/cds_lab.html) at Columbia University, then later by various funders for a fraction of his time through [[C]Worthy](https://www.cworthy.org/). \ No newline at end of file +- Tom was supported first by the Gordon and Betty Moore foundation as part of Ryan Abernathey's [Climate Data Science Lab](https://ocean-transport.github.io/cds_lab.html) at Columbia University, then later by various funders for a fraction of his time through [[C]Worthy](https://www.cworthy.org/). From 5148990fc7b3dbe61a1a51d8f7f3ef4f2c6f9847 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Thu, 5 Dec 2024 00:47:50 -0500 Subject: [PATCH 10/24] link to CZI proposal --- src/posts/datatree/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 89b59062..c866d7fb 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -50,11 +50,11 @@ We also had to resolve some really [gnarly design questions](https://github.com/ DataTree didn't get implemented overnight - it was a multi-year effort that took place in a number of steps. -Initially, the xarray team applied for funding from the [Chan-Zuckerberg initiative]() (LINK?) in March 2021 to develop something like datatree, citing bioscience use cases (e.g. [microscopy image pyramids](https://spatialdata.scverse.org/en/latest/design_doc.html)). Unfortunately whilst we've been lucky to [receive CZI funding before](https://chanzuckerberg.com/eoss/proposals/xarray-multidimensional-labeled-arrays-and-datasets-in-python/), on this occasion we didn't win money to work on the datatree idea. +In March 2021, the xarray team submitted a [funding proposal](https://zenodo.org/records/5484176) to the [Chan-Zuckerberg Initiative](https://chanzuckerberg.com/eoss/) to develop "TreeDataset", citing bioscience use cases such as [microscopy image pyramids](https://spatialdata.scverse.org/en/latest/design_doc.html). Unfortunately whilst we've been lucky to [receive CZI funding before](https://chanzuckerberg.com/eoss/proposals/xarray-multidimensional-labeled-arrays-and-datasets-in-python/), on this occasion we didn't win money to work on the datatree idea. In the absence of dedicated funding for datatree, Tom then used some time whilst at the [Climate Data Science Lab](https://ocean-transport.github.io/cds_lab.html) at Columbia University to take a initial stab at the design in August 2021 - writing the first implementation on an overnight Amtrak! This simple prototype was released as a separate package in the [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree), and steadily gained a small community of intrepid users. It was driven partly by the use case of [climate model intercomparison datasets](https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114). -A separate repository was chosen for speed of iteration, and to [avoid](https://github.com/xarray-contrib/datatree/blob/7ba05880c37f2371b5174f6e8dcfae31248fe19f/README.md#development-roadmap) giving the impression that these early experiments would have the same level of [long-term support promised](https://github.com/pydata/xarray/issues/9854) for code in xarray's main repo. However this meant that the prototype datatree was not fully integrated with xarray's main codebase, limiting possible features and requiring fragile dependencies on private xarray internals. +A separate repository was chosen for speed of iteration, and to [avoid](https://github.com/xarray-contrib/datatree/blob/7ba05880c37f2371b5174f6e8dcfae31248fe19f/README.md#development-roadmap) giving the impression that these early experiments would have the same level of [long-term support promised](https://github.com/pydata/xarray/issues/9854) for code in xarray's main repo. However this meant that the prototype `datatree` library was not fully integrated with xarray's main codebase, limiting possible features and requiring fragile dependencies on private xarray internals. The prototype then sat there for 2 years, until NASA ESDIS approached the xarray core team in August 2023. ESDIS devs wanted the ability to work with entire hierarchical files, and had experimented with the prototype version of datatree, but they wanted datatree functionality to be migrated upstream into xarray's main repository so there would be more guarantees of long-term API stability and support. From 80011a6367fe0fd9b88bbea7e157c2756f64ea43 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Thu, 5 Dec 2024 10:17:07 -0500 Subject: [PATCH 11/24] various small fixes after viewing rendered build --- src/posts/datatree/index.md | 51 ++++++++++++++++++++++--------------- 1 file changed, 31 insertions(+), 20 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index c866d7fb..5cd43017 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -1,5 +1,5 @@ --- -title: 'Xarray x NASA: New xarray.DataTree for hierarchical data structures' +title: 'Xarray x NASA: xarray.DataTree for hierarchical data structures' date: '2024-12-04' authors: - name: Tom Nicholas @@ -19,13 +19,13 @@ authors: summary: 'The new xarray.DataTree class allows working with netCDF/Zarr groups, brought to you in collaboration with NASA!' --- -## TL;DR +## tl;dr -`xarray.DataTree` has been released in [v2024.10.0](https://github.com/pydata/xarray/releases/tag/v2024.10.0), and the prototype [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree) archived, after collaboration between the xarray team and [NASA ESDIS](https://www.earthdata.nasa.gov/about/esdis). +[`xarray.DataTree`](https://docs.xarray.dev/en/stable/user-guide/data-structures.html#datatree) has been [released](https://github.com/pydata/xarray/discussions/9680) in [v2024.10.0](https://github.com/pydata/xarray/releases/tag/v2024.10.0), and the prototype [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree) archived, after collaboration between the xarray team and [NASA ESDIS](https://www.earthdata.nasa.gov/about/esdis). 🤝 ## Why trees? -Xarray users have been [asking](https://github.com/pydata/xarray/issues/4118) for a way to handle multiple netCDF4 groups [since at least 2016](https://github.com/pydata/xarray/issues/1092). Such netCDF4/Zarr groups are the on-disk representation of a general problem of handling hierarchies of related but non-alignable array data. Real-world datasets often fall into this category, and users wanted a way to work with such hierarchical data in-memory and a way to interact with it on disk. +Xarray users have been [asking](https://github.com/pydata/xarray/issues/4118) for a way to handle multiple netCDF4 groups [since at least 2016](https://github.com/pydata/xarray/issues/1092). Such netCDF4/Zarr groups are the on-disk representation of a general problem of handling hierarchies of related but non-alignable array data. Real-world datasets (such as [climate model intercomparisons]https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114)) often fall into this category, and users wanted a way to work with such hierarchical data in-memory and a way to interact with it on disk. ## What is a DataTree? @@ -42,61 +42,72 @@ If you previously had used the `DataTree` prototype in the [`xarray-contrib/data This was a big feature addition! For a [decade](https://github.com/pydata/xarray/discussions/8462) there have been 3 core public xarray data structures, now there are 4: [`Variable`](https://docs.xarray.dev/en/stable/generated/xarray.Variable.html#xarray.Variable), [`DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html#xarray.DataArray), [`Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html#xarray.Dataset), and now [`DataTree`](https://docs.xarray.dev/en/stable/generated/xarray.DataTree.html#xarray.DataTree). -Datatree represents arguably one of the single largest new features added to xarray in 10 years - the migration of the existing prototype alone added >10k lines of code across [80 PRs](https://github.com/pydata/xarray/pulls?q=is%3Apr+label%3Atopic-DataTree+is%3Aclosed), and the resulting datatree code now contains contributions from at least 25 people. +Datatree represents arguably one of the single largest new features added to xarray in 10 years - the migration of the existing prototype alone added >10k lines of code across [80 pull requests](https://github.com/pydata/xarray/pulls?q=is%3Apr+label%3Atopic-DataTree+is%3Aclosed), and the resulting datatree implementation now contains contributions from at least 25 people. We also had to resolve some really [gnarly design questions](https://github.com/pydata/xarray/pull/9063) to make it work in a way we were happy with. ## How did this happen? -DataTree didn't get implemented overnight - it was a multi-year effort that took place in a number of steps. +DataTree didn't get implemented overnight - it was a multi-year effort that took place in a number of steps, and there are some lessons to be learned from the story. In March 2021, the xarray team submitted a [funding proposal](https://zenodo.org/records/5484176) to the [Chan-Zuckerberg Initiative](https://chanzuckerberg.com/eoss/) to develop "TreeDataset", citing bioscience use cases such as [microscopy image pyramids](https://spatialdata.scverse.org/en/latest/design_doc.html). Unfortunately whilst we've been lucky to [receive CZI funding before](https://chanzuckerberg.com/eoss/proposals/xarray-multidimensional-labeled-arrays-and-datasets-in-python/), on this occasion we didn't win money to work on the datatree idea. In the absence of dedicated funding for datatree, Tom then used some time whilst at the [Climate Data Science Lab](https://ocean-transport.github.io/cds_lab.html) at Columbia University to take a initial stab at the design in August 2021 - writing the first implementation on an overnight Amtrak! This simple prototype was released as a separate package in the [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree), and steadily gained a small community of intrepid users. It was driven partly by the use case of [climate model intercomparison datasets](https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114). -A separate repository was chosen for speed of iteration, and to [avoid](https://github.com/xarray-contrib/datatree/blob/7ba05880c37f2371b5174f6e8dcfae31248fe19f/README.md#development-roadmap) giving the impression that these early experiments would have the same level of [long-term support promised](https://github.com/pydata/xarray/issues/9854) for code in xarray's main repo. However this meant that the prototype `datatree` library was not fully integrated with xarray's main codebase, limiting possible features and requiring fragile dependencies on private xarray internals. +A separate repository was chosen for speed of iteration, and to be able to more easily [make changes](https://github.com/xarray-contrib/datatree/blob/7ba05880c37f2371b5174f6e8dcfae31248fe19f/README.md#development-roadmap) without worrying as much about [backwards compatibility](https://github.com/pydata/xarray/issues/9854) as code in xarray's main repo does. However the separate repo meant that the prototype `datatree` library was not fully integrated with xarray's main codebase, limiting possible features and requiring fragile dependencies on private xarray internals. The prototype then sat there for 2 years, until NASA ESDIS approached the xarray core team in August 2023. ESDIS devs wanted the ability to work with entire hierarchical files, and had experimented with the prototype version of datatree, but they wanted datatree functionality to be migrated upstream into xarray's main repository so there would be more guarantees of long-term API stability and support. -Amazingly the NASA team were able to offer engineer time, so starting in early 2024 Owen, Matt, and Eni (NASA) worked on migrating datatree into xarray upstream, with regular supervision from Tom, Justus, and Stephan (existing xarray core devs). +Amazingly the NASA team were able to offer engineer time, so starting in late 2023 Owen, Matt, and Eni (NASA) worked on migrating datatree into xarray upstream, with regular supervision from Tom, Justus, and Stephan (existing xarray core devs). -This second stage of development allowed us to reduce the bus factor on the datatree code, sanity check the original approach, and it gave us a chance to make some signficant changes to the design without worrying too much about backwards-incompatibility (for example enabling the [new "coordinate inheritance" feature](https://docs.xarray.dev/en/stable/user-guide/hierarchical-data.html#alignment-and-coordinate-inheritance)). +This second stage of development allowed us to reduce the bus factor on the datatree code, sanity check the original approach, and it gave us a chance to make some significant improvements to the design without backwards-compatibility concerns (for example enabling the [new "coordinate inheritance" feature](https://docs.xarray.dev/en/stable/user-guide/hierarchical-data.html#alignment-and-coordinate-inheritance)). ## Lessons for future collaborations This development story is different from the more typical scientific grant funding model - how did that work out for us? -The scientific grant model for funding software expects you to present a full idea in a proposal, wait 6-12 months to hopefully get funding for it, then implement the whole thing during the grant period. In contrast datatree evolved over a gradual process of moving from ideas to hacky prototype to robust implementation, with big time gaps for user feedback and experimentation. The migration was completed by developer-users who actually wanted the feature, rather than grant awardees working in service of a separate and maybe-theoretical userbase. +The scientific grant model for funding software expects you to present a full idea in a proposal, wait 6-12 months to hopefully get funding for it, then implement the whole thing during the grant period. In contrast datatree evolved over a gradual process of moving from ideas to hacky prototype to robust implementation, with big time gaps for user feedback and experimentation. The migration was completed by developer-users who actually wanted the feature, rather than grant awardees working in service of a separate and maybe-only-theoretical userbase. -Overall while the migration effort took longer than anticipated we found it worked out quite well! +Overall while the migration effort took longer than anticipated we think it worked out quite well! ### Pros: -- **Literally zero overhead** - the existing xarray team did not to have to write a proposal to get developer time, and there was literally zero paperwork inflicted (on them at least). +- **Zero overhead** - the existing xarray team did not to have to write a proposal to get developer time, and there was literally zero paperwork inflicted (on them at least). - **Certainty of funding** - writing grant proposals is a lottery, so the time invested up front doesn't even come with any certainty of funding. Collaborating with another org has a much higher chance of actually leading to more money being available for developer time. - **Time efficient** - a xarray core dev spending 10% of their time directing someone who is less familiar with the codebase but has more time is an efficient use of relative expertise. -- **Bus factor** - the new contributors reduce the bus factor on the datatree code dramatically. +- **Bus factor** - the new contributors reduced the bus factor on the datatree code dramatically. - **User-driven Development** - it makes sense to have actual interested user communities involved in development. - **Stakeholder representation** - after officially adding Owen, Matt and Eni to the [xarray core team](https://xarray.dev/team), NASA ESDIS has some direct representation in, insider understanding of, and stake in continuing to support the xarray project. ### Cons: -- **Not everyone got direct funding** - it's less ideal that Tom, Justus, and Stephan didn't get direct funding for their supervisory work. In future it might be better to have one of the paid people at the contributing org already be a core xarray team member. +- **Not everyone got direct funding** - it's less ideal that Tom, Justus, and Stephan didn't get direct funding for their supervisory work. In future it might be better to have one of the paid people at the contributing org already be a core xarray team member, or perhaps find some way to pay them as a "consultant". - **Tricky to accurately scope** out the duration of required work in advance, and hard to "just ship it". We hold the xarray project to high standards and backwards compatibility promises so we want to ensure that any publicly released features don't compromise on quality. -This contributing model is more similar to how OSS has historically been supported by industry, but perhaps because xarray is primarily developed and used by the scientific community we tend to default to more grant-based funding models. +This contributing model is more similar to how open-source software has historically been supported by industry, but perhaps because xarray is primarily developed and used by the scientific community we tend to default to more grant-based funding models. -Overall this type of collaboration could work again in future! So if there is an xarray or xarray-adjacent feature your organisation would like to see, **please reach out to us**. +Overall we think this type of collaboration could work again in future! So if there is an xarray or xarray-adjacent feature your organisation would like to see, **please reach out to us**. -## Go try out `DataTree`! +## Go try out DataTree! -Please try datatree out! The hierarchical structure is potentially useful to any xarray users who work with more than one dataset at a time. +Please try datatree out! The hierarchical structure is potentially useful to any xarray users who work with more than one dataset at a time. Simply do -Be aware that as `xarray.DataTree` is still new there will likely be some bugs lurking, as well as as-yet [unimplemented features](https://github.com/pydata/xarray/issues?q=is%3Aissue+is%3Aopen+label%3Atopic-DataTree) (as there always are)! +```python +from xarray import DataTree +``` + +or + +```python +open_datatree(...) +``` +on a netCDF4 file / Zarr store containing multiple groups. + +Be aware that as `xarray.DataTree` is still new there will likely be some bugs lurking or places that performance could be improved, as well as as-yet [unimplemented features](https://github.com/pydata/xarray/issues?q=is%3Aissue+is%3Aopen+label%3Atopic-DataTree) (as there always are)! ## Thanks -A number of other people also [contributed to datatree](https://github.com/xarray-contrib/datatree/graphs/contributors) in various ways - particular shoutout to [Alfonso Ladino](https://github.com/aladinor) and [Etinenne Schalk](https://github.com/etienneschalk) for their dedicated attendance at many of the weekly migration meetings! +A number of other people also [contributed to datatree](https://github.com/xarray-contrib/datatree/graphs/contributors) in various ways - particular shoutout to [Alfonso Ladino](https://github.com/aladinor) and [Etinenne Schalk](https://github.com/etienneschalk) for their dedicated attendance at many of the [weekly migration meetings](https://github.com/pydata/xarray/issues/8747)! ## Funding Acknowledgements From f771b81583ae52291658fae58c8dfa34e1084b9b Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Thu, 5 Dec 2024 15:17:23 +0000 Subject: [PATCH 12/24] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- src/posts/datatree/index.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 5cd43017..687668c7 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -90,17 +90,18 @@ Overall we think this type of collaboration could work again in future! So if th ## Go try out DataTree! -Please try datatree out! The hierarchical structure is potentially useful to any xarray users who work with more than one dataset at a time. Simply do +Please try datatree out! The hierarchical structure is potentially useful to any xarray users who work with more than one dataset at a time. Simply do ```python from xarray import DataTree ``` -or +or ```python open_datatree(...) ``` + on a netCDF4 file / Zarr store containing multiple groups. Be aware that as `xarray.DataTree` is still new there will likely be some bugs lurking or places that performance could be improved, as well as as-yet [unimplemented features](https://github.com/pydata/xarray/issues?q=is%3Aissue+is%3Aopen+label%3Atopic-DataTree) (as there always are)! From 95ce3e4b4134fd9565057c2190ce75a17ff7c69c Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Thu, 5 Dec 2024 10:21:40 -0500 Subject: [PATCH 13/24] fix bad link formatting --- src/posts/datatree/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 5cd43017..09f010a9 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -25,7 +25,7 @@ summary: 'The new xarray.DataTree class allows working with netCDF/Zarr groups, ## Why trees? -Xarray users have been [asking](https://github.com/pydata/xarray/issues/4118) for a way to handle multiple netCDF4 groups [since at least 2016](https://github.com/pydata/xarray/issues/1092). Such netCDF4/Zarr groups are the on-disk representation of a general problem of handling hierarchies of related but non-alignable array data. Real-world datasets (such as [climate model intercomparisons]https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114)) often fall into this category, and users wanted a way to work with such hierarchical data in-memory and a way to interact with it on disk. +Xarray users have been [asking](https://github.com/pydata/xarray/issues/4118) for a way to handle multiple netCDF4 groups [since at least 2016](https://github.com/pydata/xarray/issues/1092). Such netCDF4/Zarr groups are the on-disk representation of a general problem of handling hierarchies of related but non-alignable array data. Real-world datasets (such as [climate model intercomparisons](https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114)) often fall into this category, and users wanted a way to work with such hierarchical data in-memory and a way to interact with it on disk. ## What is a DataTree? From 9077fd9fbfd89c63d285bc1d2483cd5219429916 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Thu, 5 Dec 2024 10:22:54 -0500 Subject: [PATCH 14/24] comma --- src/posts/datatree/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 09f010a9..b8fef3a8 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -32,7 +32,7 @@ Xarray users have been [asking](https://github.com/pydata/xarray/issues/4118) fo Our solution is the new high-level container class `xarray.DataTree`. It acts like a tree of linked `xarray.Dataset` objects, with alignment enforced between variables in sibling nodes but not between parents and children. It can be written to and opened from formats containing multiple groups, such as netCDF4 files and Zarr stores. -For more details please see the [high-level description](https://docs.xarray.dev/en/stable/user-guide/data-structures.html#datatree) and the [dedicated page on hierarchical data](https://docs.xarray.dev/en/stable/user-guide/hierarchical-data.html), and the [section on IO with groups](https://docs.xarray.dev/en/stable/user-guide/io.html#groups) in the xarray documentation. +For more details please see the [high-level description](https://docs.xarray.dev/en/stable/user-guide/data-structures.html#datatree), the [dedicated page on hierarchical data](https://docs.xarray.dev/en/stable/user-guide/hierarchical-data.html), and the [section on IO with groups](https://docs.xarray.dev/en/stable/user-guide/io.html#groups) in the xarray documentation. ## Deprecation From df9d8e711293d191bcabaa184261a41604d691d6 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Thu, 5 Dec 2024 10:24:00 -0500 Subject: [PATCH 15/24] clarify outdated module --- src/posts/datatree/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 74c66b87..ec2b0aeb 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -36,7 +36,7 @@ For more details please see the [high-level description](https://docs.xarray.dev ## Deprecation -If you previously had used the `DataTree` prototype in the [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree), that has now been archived and will no longer be supported. Instead we encourage you to migrate to the implementation of `DataTree` that you can import from xarray, following the [migration guide](https://github.com/pydata/xarray/blob/main/DATATREE_MIGRATION_GUIDE.md). +If you previously had used the `datatree.DataTree` prototype in the [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree), that has now been archived and will no longer be supported. Instead we encourage you to migrate to the implementation of `DataTree` that you can import from xarray, following the [migration guide](https://github.com/pydata/xarray/blob/main/DATATREE_MIGRATION_GUIDE.md). ## Big moves From 1996ba832cd59be3bbc2a5f5a7144ac03dc6ea05 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Thu, 5 Dec 2024 10:25:44 -0500 Subject: [PATCH 16/24] advising --- src/posts/datatree/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index ec2b0aeb..35997892 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -74,7 +74,7 @@ Overall while the migration effort took longer than anticipated we think it work - **Zero overhead** - the existing xarray team did not to have to write a proposal to get developer time, and there was literally zero paperwork inflicted (on them at least). - **Certainty of funding** - writing grant proposals is a lottery, so the time invested up front doesn't even come with any certainty of funding. Collaborating with another org has a much higher chance of actually leading to more money being available for developer time. -- **Time efficient** - a xarray core dev spending 10% of their time directing someone who is less familiar with the codebase but has more time is an efficient use of relative expertise. +- **Time efficient** - an xarray core dev spending 10% of their time advising someone who is less familiar with the codebase but has more time is an efficient use of relative expertise. - **Bus factor** - the new contributors reduced the bus factor on the datatree code dramatically. - **User-driven Development** - it makes sense to have actual interested user communities involved in development. - **Stakeholder representation** - after officially adding Owen, Matt and Eni to the [xarray core team](https://xarray.dev/team), NASA ESDIS has some direct representation in, insider understanding of, and stake in continuing to support the xarray project. From 15c2030ef26d75621b77fd1833e8d392905ecf49 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Thu, 5 Dec 2024 10:27:44 -0500 Subject: [PATCH 17/24] demote code blocks to be inline --- src/posts/datatree/index.md | 14 +------------- 1 file changed, 1 insertion(+), 13 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 35997892..96e85223 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -90,19 +90,7 @@ Overall we think this type of collaboration could work again in future! So if th ## Go try out DataTree! -Please try datatree out! The hierarchical structure is potentially useful to any xarray users who work with more than one dataset at a time. Simply do - -```python -from xarray import DataTree -``` - -or - -```python -open_datatree(...) -``` - -on a netCDF4 file / Zarr store containing multiple groups. +Please try datatree out! The hierarchical structure is potentially useful to any xarray users who work with more than one dataset at a time. Simply do `from xarray import DataTree` or call [`open_datatree(...)`](https://docs.xarray.dev/en/stable/generated/xarray.open_datatree.html) on a netCDF4 file / Zarr store containing multiple groups. Be aware that as `xarray.DataTree` is still new there will likely be some bugs lurking or places that performance could be improved, as well as as-yet [unimplemented features](https://github.com/pydata/xarray/issues?q=is%3Aissue+is%3Aopen+label%3Atopic-DataTree) (as there always are)! From d9fbad4470f4ecac7f7972faf91adf875d398c71 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Thu, 5 Dec 2024 10:28:41 -0500 Subject: [PATCH 18/24] typo in Etienne's name --- src/posts/datatree/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 96e85223..9c76a259 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -96,7 +96,7 @@ Be aware that as `xarray.DataTree` is still new there will likely be some bugs l ## Thanks -A number of other people also [contributed to datatree](https://github.com/xarray-contrib/datatree/graphs/contributors) in various ways - particular shoutout to [Alfonso Ladino](https://github.com/aladinor) and [Etinenne Schalk](https://github.com/etienneschalk) for their dedicated attendance at many of the [weekly migration meetings](https://github.com/pydata/xarray/issues/8747)! +A number of other people also [contributed to datatree](https://github.com/xarray-contrib/datatree/graphs/contributors) in various ways - particular shoutout to [Alfonso Ladino](https://github.com/aladinor) and [Etienne Schalk](https://github.com/etienneschalk) for their dedicated attendance at many of the [weekly migration meetings](https://github.com/pydata/xarray/issues/8747)! ## Funding Acknowledgements From 80a10d209158db9aa81ab66fb1127b31a0e4e252 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Thu, 5 Dec 2024 10:35:53 -0500 Subject: [PATCH 19/24] the ESDIS project --- src/posts/datatree/index.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 9c76a259..82d19dc2 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -21,7 +21,7 @@ summary: 'The new xarray.DataTree class allows working with netCDF/Zarr groups, ## tl;dr -[`xarray.DataTree`](https://docs.xarray.dev/en/stable/user-guide/data-structures.html#datatree) has been [released](https://github.com/pydata/xarray/discussions/9680) in [v2024.10.0](https://github.com/pydata/xarray/releases/tag/v2024.10.0), and the prototype [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree) archived, after collaboration between the xarray team and [NASA ESDIS](https://www.earthdata.nasa.gov/about/esdis). 🤝 +[`xarray.DataTree`](https://docs.xarray.dev/en/stable/user-guide/data-structures.html#datatree) has been [released](https://github.com/pydata/xarray/discussions/9680) in [v2024.10.0](https://github.com/pydata/xarray/releases/tag/v2024.10.0), and the prototype [`xarray-contrib/datatree` repository](https://github.com/xarray-contrib/datatree) archived, after collaboration between the xarray team and the [NASA ESDIS project](https://www.earthdata.nasa.gov/about/esdis). 🤝 ## Why trees? @@ -56,9 +56,9 @@ In the absence of dedicated funding for datatree, Tom then used some time whilst A separate repository was chosen for speed of iteration, and to be able to more easily [make changes](https://github.com/xarray-contrib/datatree/blob/7ba05880c37f2371b5174f6e8dcfae31248fe19f/README.md#development-roadmap) without worrying as much about [backwards compatibility](https://github.com/pydata/xarray/issues/9854) as code in xarray's main repo does. However the separate repo meant that the prototype `datatree` library was not fully integrated with xarray's main codebase, limiting possible features and requiring fragile dependencies on private xarray internals. -The prototype then sat there for 2 years, until NASA ESDIS approached the xarray core team in August 2023. ESDIS devs wanted the ability to work with entire hierarchical files, and had experimented with the prototype version of datatree, but they wanted datatree functionality to be migrated upstream into xarray's main repository so there would be more guarantees of long-term API stability and support. +The prototype then sat there for 2 years, until the NASA ESDIS team approached the xarray core team in August 2023. ESDIS devs wanted the ability to work with entire hierarchical files, and had experimented with the prototype version of datatree, but they wanted datatree functionality to be migrated upstream into xarray's main repository so there would be more guarantees of long-term API stability and support. -Amazingly the NASA team were able to offer engineer time, so starting in late 2023 Owen, Matt, and Eni (NASA) worked on migrating datatree into xarray upstream, with regular supervision from Tom, Justus, and Stephan (existing xarray core devs). +Amazingly the NASA team were able to offer engineer time, so starting in late 2023 Owen, Matt, and Eni (NASA) worked on migrating the prototype datatree into xarray upstream, with regular supervision from Tom, Justus, and Stephan (existing xarray team). This second stage of development allowed us to reduce the bus factor on the datatree code, sanity check the original approach, and it gave us a chance to make some significant improvements to the design without backwards-compatibility concerns (for example enabling the [new "coordinate inheritance" feature](https://docs.xarray.dev/en/stable/user-guide/hierarchical-data.html#alignment-and-coordinate-inheritance)). @@ -77,7 +77,7 @@ Overall while the migration effort took longer than anticipated we think it work - **Time efficient** - an xarray core dev spending 10% of their time advising someone who is less familiar with the codebase but has more time is an efficient use of relative expertise. - **Bus factor** - the new contributors reduced the bus factor on the datatree code dramatically. - **User-driven Development** - it makes sense to have actual interested user communities involved in development. -- **Stakeholder representation** - after officially adding Owen, Matt and Eni to the [xarray core team](https://xarray.dev/team), NASA ESDIS has some direct representation in, insider understanding of, and stake in continuing to support the xarray project. +- **Stakeholder representation** - after officially adding Owen, Matt and Eni to the [xarray core team](https://xarray.dev/team), the NASA ESDIS project has some direct representation in, insider understanding of, and stake in continuing to support the xarray project. ### Cons: @@ -100,5 +100,5 @@ A number of other people also [contributed to datatree](https://github.com/xarra ## Funding Acknowledgements -- Owen, Eni, and Matt were able to contribute development time thanks to NASA ESDIS. +- Owen, Eni, and Matt were able to contribute development time thanks to the [NASA ESDIS project](https://www.earthdata.nasa.gov/about/esdis). - Tom was supported first by the Gordon and Betty Moore foundation as part of Ryan Abernathey's [Climate Data Science Lab](https://ocean-transport.github.io/cds_lab.html) at Columbia University, then later by various funders for a fraction of his time through [[C]Worthy](https://www.cworthy.org/). From aec4c4cce8766973e350e091abb37680d095344d Mon Sep 17 00:00:00 2001 From: Tom Nicholas Date: Thu, 5 Dec 2024 12:42:01 -0700 Subject: [PATCH 20/24] Correct alignment model description Co-authored-by: Stephan Hoyer --- src/posts/datatree/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 82d19dc2..c1540f27 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -30,7 +30,7 @@ Xarray users have been [asking](https://github.com/pydata/xarray/issues/4118) fo ## What is a DataTree? Our solution is the new high-level container class `xarray.DataTree`. -It acts like a tree of linked `xarray.Dataset` objects, with alignment enforced between variables in sibling nodes but not between parents and children. It can be written to and opened from formats containing multiple groups, such as netCDF4 files and Zarr stores. +It acts like a tree of linked `xarray.Dataset` objects, with alignment enforced between parent and child nodes, but not between siblings. It can be written to and opened from formats containing multiple groups, such as netCDF4 files and Zarr stores. For more details please see the [high-level description](https://docs.xarray.dev/en/stable/user-guide/data-structures.html#datatree), the [dedicated page on hierarchical data](https://docs.xarray.dev/en/stable/user-guide/hierarchical-data.html), and the [section on IO with groups](https://docs.xarray.dev/en/stable/user-guide/io.html#groups) in the xarray documentation. From 9dc4aafacf9d331775fd8ac8cbc08feeec996580 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Wed, 18 Dec 2024 18:24:01 -0500 Subject: [PATCH 21/24] full affiliations --- src/posts/datatree/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index 82d19dc2..bf6a0d5f 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -58,7 +58,7 @@ A separate repository was chosen for speed of iteration, and to be able to more The prototype then sat there for 2 years, until the NASA ESDIS team approached the xarray core team in August 2023. ESDIS devs wanted the ability to work with entire hierarchical files, and had experimented with the prototype version of datatree, but they wanted datatree functionality to be migrated upstream into xarray's main repository so there would be more guarantees of long-term API stability and support. -Amazingly the NASA team were able to offer engineer time, so starting in late 2023 Owen, Matt, and Eni (NASA) worked on migrating the prototype datatree into xarray upstream, with regular supervision from Tom, Justus, and Stephan (existing xarray team). +Amazingly NASA were able to offer the time of 3 engineers: Owen (NASA EOSDIS Evolution and Development 3 (EED-3) contract), Matt (NASA National Snow and Ice Data Center Distributed Active Archive Center (NSIDC)), and Eni (Goddard Earth Sciences Data and Information Services Center (GES DISC)). So starting in late 2023 the NASA trio worked on migrating the prototype datatree into xarray upstream, with regular supervision from Tom, Justus, and Stephan (existing xarray core team). This second stage of development allowed us to reduce the bus factor on the datatree code, sanity check the original approach, and it gave us a chance to make some significant improvements to the design without backwards-compatibility concerns (for example enabling the [new "coordinate inheritance" feature](https://docs.xarray.dev/en/stable/user-guide/hierarchical-data.html#alignment-and-coordinate-inheritance)). @@ -100,5 +100,5 @@ A number of other people also [contributed to datatree](https://github.com/xarra ## Funding Acknowledgements -- Owen, Eni, and Matt were able to contribute development time thanks to the [NASA ESDIS project](https://www.earthdata.nasa.gov/about/esdis). +- Owen, Eni, and Matt were able to contribute development time as part of the [NASA ESDIS project](https://www.earthdata.nasa.gov/about/esdis). - Tom was supported first by the Gordon and Betty Moore foundation as part of Ryan Abernathey's [Climate Data Science Lab](https://ocean-transport.github.io/cds_lab.html) at Columbia University, then later by various funders for a fraction of his time through [[C]Worthy](https://www.cworthy.org/). From e09381c93aec62e754081b1005ce8c13a4e33524 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Wed, 18 Dec 2024 18:25:30 -0500 Subject: [PATCH 22/24] minor improvements to wording --- src/posts/datatree/index.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index bf6a0d5f..e38befbb 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -42,7 +42,7 @@ If you previously had used the `datatree.DataTree` prototype in the [`xarray-con This was a big feature addition! For a [decade](https://github.com/pydata/xarray/discussions/8462) there have been 3 core public xarray data structures, now there are 4: [`Variable`](https://docs.xarray.dev/en/stable/generated/xarray.Variable.html#xarray.Variable), [`DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html#xarray.DataArray), [`Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html#xarray.Dataset), and now [`DataTree`](https://docs.xarray.dev/en/stable/generated/xarray.DataTree.html#xarray.DataTree). -Datatree represents arguably one of the single largest new features added to xarray in 10 years - the migration of the existing prototype alone added >10k lines of code across [80 pull requests](https://github.com/pydata/xarray/pulls?q=is%3Apr+label%3Atopic-DataTree+is%3Aclosed), and the resulting datatree implementation now contains contributions from at least 25 people. +Datatree represents arguably one of the largest new features added to xarray in 10 years - the migration of the existing prototype alone added >10k lines of code across [80 pull requests](https://github.com/pydata/xarray/pulls?q=is%3Apr+label%3Atopic-DataTree+is%3Aclosed), and the resulting datatree implementation now contains contributions from at least 25 people. We also had to resolve some really [gnarly design questions](https://github.com/pydata/xarray/pull/9063) to make it work in a way we were happy with. @@ -81,8 +81,8 @@ Overall while the migration effort took longer than anticipated we think it work ### Cons: -- **Not everyone got direct funding** - it's less ideal that Tom, Justus, and Stephan didn't get direct funding for their supervisory work. In future it might be better to have one of the paid people at the contributing org already be a core xarray team member, or perhaps find some way to pay them as a "consultant". -- **Tricky to accurately scope** out the duration of required work in advance, and hard to "just ship it". We hold the xarray project to high standards and backwards compatibility promises so we want to ensure that any publicly released features don't compromise on quality. +- **Not everyone got direct funding** - It's less ideal that Tom, Justus, and Stephan didn't get direct funding for their supervisory work. In future it might be better to have one of the paid people at the contributing org already be a core xarray team member, or perhaps find some way to pay them as a consultant. +- **Tricky to accurately scope** - The duration of required work was tricky to estimate in advance, and we didn't want to "just ship it". We hold the xarray project to high standards and backwards compatibility promises so we want to ensure that any publicly released features don't compromise on quality. This contributing model is more similar to how open-source software has historically been supported by industry, but perhaps because xarray is primarily developed and used by the scientific community we tend to default to more grant-based funding models. From 22264dc4917a9e6e1aaffd50e67cc081087f138c Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Wed, 18 Dec 2024 18:42:11 -0500 Subject: [PATCH 23/24] update motivation as per Stephan's comments --- src/posts/datatree/index.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index e38befbb..d37457a1 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -25,7 +25,15 @@ summary: 'The new xarray.DataTree class allows working with netCDF/Zarr groups, ## Why trees? -Xarray users have been [asking](https://github.com/pydata/xarray/issues/4118) for a way to handle multiple netCDF4 groups [since at least 2016](https://github.com/pydata/xarray/issues/1092). Such netCDF4/Zarr groups are the on-disk representation of a general problem of handling hierarchies of related but non-alignable array data. Real-world datasets (such as [climate model intercomparisons](https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114)) often fall into this category, and users wanted a way to work with such hierarchical data in-memory and a way to interact with it on disk. +DataTree allows for organizing heterogeneous collections of scientific data in the same way that a nested directory structure facilitates organizing large numbers of files on disk. It does so in a way that preserves common structure between data in the collections, such as aligned arrays and common coordinates. + +For those familiar with netCDF4/Zarr groups, DataTree can also be thought of as an in-memory representation of a file's group structure. Xarray users have been [asking](https://github.com/pydata/xarray/issues/4118) for a way to handle multiple netCDF4 groups [since at least 2016](https://github.com/pydata/xarray/issues/1092)! + +DataTree enables xarray to be used for various new use cases, including: + +- [Climate model intercomparisons](https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114), +- Multi-scale image pyramids, e.g. in [genomics](https://spatialdata.scverse.org/en/latest/design_doc.html), +- Organising heterogenous data, such as satellite observations and model simulations. ## What is a DataTree? From cbdcd335998c1269545ea2337da51e51bdc27553 Mon Sep 17 00:00:00 2001 From: TomNicholas Date: Wed, 18 Dec 2024 18:47:18 -0500 Subject: [PATCH 24/24] make first two sections flow into one another better --- src/posts/datatree/index.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/posts/datatree/index.md b/src/posts/datatree/index.md index b6ead33c..5707cc59 100644 --- a/src/posts/datatree/index.md +++ b/src/posts/datatree/index.md @@ -25,20 +25,20 @@ summary: 'The new xarray.DataTree class allows working with netCDF/Zarr groups, ## Why trees? -DataTree allows for organizing heterogeneous collections of scientific data in the same way that a nested directory structure facilitates organizing large numbers of files on disk. It does so in a way that preserves common structure between data in the collections, such as aligned arrays and common coordinates. +The DataTree concept allows for organizing heterogeneous collections of scientific data in the same way that a nested directory structure facilitates organizing large numbers of files on disk. It does so in a way that preserves common structure between data in the collections, such as aligned arrays and common coordinates. -For those familiar with netCDF4/Zarr groups, DataTree can also be thought of as an in-memory representation of a file's group structure. Xarray users have been [asking](https://github.com/pydata/xarray/issues/4118) for a way to handle multiple netCDF4 groups [since at least 2016](https://github.com/pydata/xarray/issues/1092)! +For those familiar with netCDF4/Zarr groups, a DataTree can also be thought of as an in-memory representation of a file's group structure. Xarray users have been [asking](https://github.com/pydata/xarray/issues/4118) for a way to handle multiple netCDF4 groups [since at least 2016](https://github.com/pydata/xarray/issues/1092)! DataTree enables xarray to be used for various new use cases, including: - [Climate model intercomparisons](https://medium.com/pangeo/easy-ipcc-part-1-multi-model-datatree-469b87cf9114), - Multi-scale image pyramids, e.g. in [genomics](https://spatialdata.scverse.org/en/latest/design_doc.html), - Organising heterogenous data, such as satellite observations and model simulations. +- Simple and convenient access to entire hierarchical files. -## What is a DataTree? +## What is a DataTree exactly? -Our solution is the new high-level container class `xarray.DataTree`. -It acts like a tree of linked `xarray.Dataset` objects, with alignment enforced between parent and child nodes, but not between siblings. It can be written to and opened from formats containing multiple groups, such as netCDF4 files and Zarr stores. +The new high-level container class `xarray.DataTree` acts like a tree of linked `xarray.Dataset` objects, with alignment enforced between arrays in parent and child nodes, but not between those in sibling nodes. It can be written to and opened from formats containing multiple groups, such as netCDF4 files and Zarr stores. For more details please see the [high-level description](https://docs.xarray.dev/en/stable/user-guide/data-structures.html#datatree), the [dedicated page on hierarchical data](https://docs.xarray.dev/en/stable/user-guide/hierarchical-data.html), and the [section on IO with groups](https://docs.xarray.dev/en/stable/user-guide/io.html#groups) in the xarray documentation.