Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazily generate batches #112

Merged
merged 4 commits into from
Oct 21, 2022
Merged

Conversation

tjvandal
Copy link
Contributor

Generating batches in __init__ is slow and memory intensive, related to #111 and #109. Initialization is changed to load indices into memory rather than corresponding datasets. The change enabled the initialization a 1.7 TB dataset with 1m+ samples, 30+ spatial features, in about 10 seconds which was previously overloading memory.

Rather than filling _batches with DataArrays and Datasets, it is filled with indices from selector = {key: slice for key, slice in zip(dims, slices)}. This required an update to the concat_input_dims option where the operation is done in __getitem__(). It is possible that this change decreases performance when this concat_input_dims=True.

@weiji14 weiji14 mentioned this pull request Oct 21, 2022
5 tasks
tjvandal and others added 2 commits October 21, 2022 11:31
Co-authored-by: Joe Hamman <[email protected]>
Co-authored-by: Joe Hamman <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Oct 21, 2022

Codecov Report

Merging #112 (f6535f5) into main (4d8e2c8) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #112   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            5         5           
  Lines          190       192    +2     
  Branches        35        35           
=========================================
+ Hits           190       192    +2     
Impacted Files Coverage Δ
xbatcher/generators.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after Joe's last comment. Thanks @tjvandal for the contribution and @jhamman for the code review!

@jhamman
Copy link
Contributor

jhamman commented Oct 21, 2022

I made 3 small one-liner comments/suggestions but this looks rad. Its right along the lines of some ideas that @maxrjones and I discussed a while back about schematizing the generation of batch selectors rather than eagerly generating batches. This is a great step in that direction. Thanks @tjvandal!

Co-authored-by: Joe Hamman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants