Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repo layout proposal #10089

Open
max-sixty opened this issue Mar 2, 2025 · 16 comments
Open

Repo layout proposal #10089

max-sixty opened this issue Mar 2, 2025 · 16 comments

Comments

@max-sixty
Copy link
Collaborator

max-sixty commented Mar 2, 2025

What is your issue?

As part of the efforts described in #10039, I added #10088, and noticed the repo layout has arguably not kept up with the code growth over the past decade. This isn't the most pressing issue, but it does make the returns to refactors lower, since we're moving lines from 11K LOC files to 1K LOC files, rather than anything smaller.

(Even if you think LLMs aren't that useful / aren't going to get better / etc; these changes would still make the repo easier for people to navigate...)

In particular, 2/3 of our code is in xarray/core — 66873 LOC vs 97118 LOC in xarray

I can imagine splitting this up into a few categories:

  • compat — dask_array_*, npcompat, pdcompat, array_api_compat
  • compute / computation — computation, arithmetic, nanops, weighted, the curvefit that's currently in dataset, rolling, rolling_exp, maybe missing
  • reshape / align / merge (need a better name) — merge, alignment, concat

I'd propose having each of those be paths within xarray/. Then there's more freedom to make new files within those paths relative to the current state, where a new file means adding onto a very long list of files in xarray/core.

I'm not confident on how much disruption that can cause to existing PRs. I think if we land them as commits which mostly just move the files, then git will mostly handle merges well. We can start slowly and see how it goes...

@max-sixty max-sixty added the needs triage Issue that has not been reviewed by xarray team member label Mar 2, 2025
@dcherian dcherian removed the needs triage Issue that has not been reviewed by xarray team member label Mar 3, 2025
@dcherian
Copy link
Contributor

dcherian commented Mar 3, 2025

I agree. One specific suggestion: breaking apply_ufunc.py out of computation.py is easy, and shouldn't be disruptive.

@TomNicholas
Copy link
Member

I also support this.

reshape / align / merge (need a better name)

manipulation? restructuring?

There's another opportunity for refactoring that could split up large files in #9203.

Related is the general issue of scope creep within the main repository. I think at some point we should revisit the idea of splitting out as much non-core functionality as possible into a separate package (there was a very old issue about this that I'm struggling to find right now - proposing "xr-scipy"). Distinguishing between crucial things such as apply_ufunc and other things that currently live in computation.py such as curvefit would be important step towards that.

@benbovy
Copy link
Member

benbovy commented Mar 11, 2025

(Even if you think LLMs aren't that useful / aren't going to get better / etc; these changes would still make the repo easier for people to navigate...)

This would also improve the performance of real-time developer tools like lsp servers, linters, auto-formatters, tree-sitter, etc. (EDIT: not reworking the repo layout but reducing LOC files).

One small suggestion: xarray/core/indexes.py may probably be moved within xarray/indexes which already exists.

max-sixty added a commit to max-sixty/xarray that referenced this issue Mar 15, 2025
Start of pydata#10089

Move compatibility-related modules from xarray.core to new xarray.compat package:
- array_api_compat.py
- dask_array_compat.py
- dask_array_ops.py
- npcompat.py
- pdcompat.py
max-sixty added a commit to max-sixty/xarray that referenced this issue Mar 15, 2025
Start of pydata#10089

Move compatibility-related modules from xarray.core to new xarray.compat package:
- array_api_compat.py
- dask_array_compat.py
- dask_array_ops.py
- npcompat.py
- pdcompat.py
max-sixty added a commit that referenced this issue Mar 15, 2025
Start of #10089

Move compatibility-related modules from xarray.core to new xarray.compat package:
- array_api_compat.py
- dask_array_compat.py
- dask_array_ops.py
- npcompat.py
- pdcompat.py
@max-sixty
Copy link
Collaborator Author

To what extent do we need to maintain backward-compat? (I would obv prefer to just move files and not build stub layers, but if there are other libs relying on these location of files, rather than just importing from the top level, then we can add them...)

@TomNicholas
Copy link
Member

TomNicholas commented Mar 15, 2025

Any library not importing from the top level is using private API, which has no backwards compatibility guarantees. That's what our docs say.

We have actually made significant changes to the locations of code multiple times in the past couple of years (especially as part of the NamedArray refactor, DataTree integration, and adding the ChunkManager entrypoint), with no real complaints.

@max-sixty
Copy link
Collaborator Author

So far, I've:

  • Moved a bunch of files to xarray/computation
  • Moved a bunch of files to xarray/structure
  • Moved a bunch of files to xarray/compat
  • Moved apply_ufunc to its own file in xarray/computation

I'll probably let that sit for a while and see how it goes.

Some things we could do next:

  • Take some computation code out of dataset.py; likely make a computation/curvefit.py
  • Refine some of the groupings — for example, there's a unify_chunks function in computation/computation.py, but also a core/chunks.py. Maybe we want those in the same place... More generally I'm sure there are other things, either caused by me or previously existing, which could be improved.
  • Start breaking up the test files — this is fairly easy, just need to make an xarray/tests/{dataset,dataarray} path and make some groupings within files there

The more difficult thing is going to be breaking up dataset.py; currently 10K lines long, and is basically just Dataset now. We can probably thin 2K lines by extracting some of the code in methods. But to make real progress, given classes can't split over multiple files in python, we would need to do something like DatasetSel / DatasetReduce, and then have Dataset inherit from them. (Unless there's a cleverer approach?)

The alternative is to do nothing, wait for LLMs to get better with big files (Claude Code can already grab segments of files, for example). Many of the annoyances of us working with huge files, like black taking a few seconds on every save, have gone away with tools like ruff, so maybe waiting is OK.

@TomNicholas
Copy link
Member

That all sounds great @max-sixty . I think you can be relatively aggressive about this, and I'm happy to review things.

The more difficult thing is going to be breaking up dataset.py; currently 10K lines long, and is basically just Dataset now.
...
(Unless there's a cleverer approach?)

I asked Claude for ideas but several of them require doing things dynamically upon import.

@benbovy
Copy link
Member

benbovy commented Mar 17, 2025

Results from a quick search of from xarray.core through all github, just in case.

@max-sixty
Copy link
Collaborator Author

great idea @benbovy !

I don't see anything too concerning in the first few results — mostly hits for dataarray & dataset...

@TomNicholas
Copy link
Member

Wow this is interesting!

So many people importing classes from xarray.core that are available in the top-level xarray namespace 🤦

Quite a few hits for things in the formatting/formatting_html modules.

A lot of the rest seem to be for the purposes of type hinting (e.g. NestedSequence, which is only ever used to type hint a public argument to xr.combine_nested), or importing error classes that they presumably want to catch.

I agree overall this don't look too concerning.

@mathause
Copy link
Collaborator

  • Would maybe be nice to have an public xr.typing?
  • Slightly related: moving generated files (e.g. _typed_ops.py) in a separate folder could make it more obvious that they are generated.

@TomNicholas
Copy link
Member

The tests need to be reorganized too. A particular offender is test_backends.py, which is >6600 lines long and tests multiple different backends with different dependencies. It also doesn't include the datatree IO tests, which are in a different file. Refactoring this might require more than just copy-pasting though

cc @dcherian

@TomNicholas
Copy link
Member

TomNicholas commented Mar 20, 2025

Also @max-sixty what do you think of having some kind of data_structures grouping? I would like to have the code that defines the main xarray classes (i.e. Variable, Coordinates, DataArray, Dataset, DataTree) all live together, separate from all the details of how they are indexed or missing data or dtypes or anything like that.

But I'm not sure how to name it - would it be core/data_structures/dataset.py? But in this model core is then not really core at all, it's a collection of ancillary utilities for the real core classes...

@shoyer
Copy link
Member

shoyer commented Mar 21, 2025

The more difficult thing is going to be breaking up dataset.py; currently 10K lines long, and is basically just Dataset now.
...
(Unless there's a cleverer approach?)

I asked Claude for ideas but several of them require doing things dynamically upon import.

The second approach, which Claude calls "2. Module-level functions with __all__" looks like the way to go to me.

I don't think you need __all__, though. Something like following should work:

# xarray/core/curvefitting.py
from typing import TYPE_CHECKING

if TYPE_CHECKING:  # avoid recursive import
    from xarray.core.dataset import Dataset

def curvefit(self) -> Dataset:  # or use typing.Self
     """Curve fitting optimization for arbitrary functions."""
     ...

# xarray/core/dataset.py
from . import curvefitting

class Dataset:
    curvefit = curvefitting.curvefit

@max-sixty
Copy link
Collaborator Author

max-sixty commented Mar 24, 2025

Also @max-sixty what do you think of having some kind of data_structures grouping? I would like to have the code that defines the main xarray classes (i.e. Variable, Coordinates, DataArray, Dataset, DataTree) all live together, separate from all the details of how they are indexed or missing data or dtypes or anything like that.

I don't have a strong view on the exact layout

one option — given we don't have many data structures and would like to split up Dataset, I could imagine an xarray/dataset path which contains the various parts of dataset, which is constructed into Dataset like the example above in an xarray/dataset/dataset.py or xarray/dataset/__init__.py file.

then similarly for the other data structures

core would then gradually & happily devolve...

@max-sixty
Copy link
Collaborator Author

separate from all the details of how they are indexed or missing data or dtypes or anything like that.

what did you mean by this @TomNicholas ? that there is code which we apply to multiple data structures? I would have thought that then lives in xarray/computation, for example?

or is there code which is data structure related but exists across data structures?

max-sixty added a commit to max-sixty/xarray that referenced this issue Mar 25, 2025
max-sixty added a commit to max-sixty/xarray that referenced this issue Mar 25, 2025
max-sixty added a commit to max-sixty/xarray that referenced this issue Mar 25, 2025
max-sixty added a commit that referenced this issue Mar 25, 2025
* Move chunks-related functions to a new file

Part of #10089

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
max-sixty added a commit to max-sixty/xarray that referenced this issue Mar 25, 2025
max-sixty added a commit that referenced this issue Mar 25, 2025
* Move fit computation code to dedicated new file

Part of #10089

* .

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants