Generate batches lazily

### Is your feature request related to a problem?

This relates to several existing issues in the xbatcher repository, including https://github.com/xarray-contrib/xbatcher/issues/30, https://github.com/xarray-contrib/xbatcher/issues/37, and https://github.com/xarray-contrib/xbatcher/issues/109. As mentioned in https://github.com/xarray-contrib/xbatcher/issues/109#issuecomment-1282778364, iterating over the entire dataset to build the batches can be prohibitively slow and memory intensive for large datasets.

### Describe the solution you'd like

Based on past discussions with @jhamman as well as TJ’s comment, we should consider a solution in which the batch generator lazily constructs batches without iterating through the entire dataset/dataarray. One key design decision that we should reconsider is whether the `_batches` dictionary needs to store the sliced Dataset/Dataarrays. Alternatively, as TJ and Joe have mentioned, the `_batches` attribute could store the indices associated with each batch. In this case, the Dataset/Dataarray would be constructed when `__getitem__()` or `__iter__()` are called. In the simplest case, I expect the indices could be stored as a dict of dimension, slice pairs that can be unpacked and passed to `.isel()` (e.g., [https://github.com/xarray-contrib/xbatcher/blob/4d8e2c84a2d405e237f60f1df5286dd766e06ff0/xbatcher/generators.py#L42](https://github.com/xarray-contrib/xbatcher/blob/4d8e2c84a2d405e237f60f1df5286dd766e06ff0/xbatcher/generators.py#L42)). As another possibility, it could be worthwhile to explore whether a custom index for xarray can be used to define batches.

One complicating factor is that the indices associated with a batch depends on `concat_input_dims`, such that two schemas seem required for defining the coordinates/dimensions associated with a batch. Similar to the suggestion in [https://github.com/xarray-contrib/xbatcher/issues/93#issuecomment-1272397149](https://github.com/xarray-contrib/xbatcher/issues/93#issuecomment-1272397149), I wonder if batch generation for concatenated input dimensions should be separate from the core `BatchGenerator` class.

### Describe alternatives you've considered

Optimize the code under the assumption that all batches will be generated eagerly when creating a BatchGenerator object, for example by stacking non-input dims before iterating over batch_dims and input_dims. This would still not work well for large datasets.

### Additional context

This relates to https://github.com/xarray-contrib/xbatcher/issues/30 because I expect that the proposal to define a batch based on the indexes will be simpler if samples are sliced and combined into batches, rather than the current behavior of slicing samples after slicing the dataset into batches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate batches lazily #111

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Generate batches lazily #111

Description

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions