-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazily generate batches #112
Conversation
Co-authored-by: Joe Hamman <[email protected]>
Co-authored-by: Joe Hamman <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #112 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 5 5
Lines 190 192 +2
Branches 35 35
=========================================
+ Hits 190 192 +2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made 3 small one-liner comments/suggestions but this looks rad. Its right along the lines of some ideas that @maxrjones and I discussed a while back about schematizing the generation of batch selectors rather than eagerly generating batches. This is a great step in that direction. Thanks @tjvandal! |
Co-authored-by: Joe Hamman <[email protected]>
Generating batches in
__init__
is slow and memory intensive, related to #111 and #109. Initialization is changed to load indices into memory rather than corresponding datasets. The change enabled the initialization a 1.7 TB dataset with 1m+ samples, 30+ spatial features, in about 10 seconds which was previously overloading memory.Rather than filling
_batches
withDataArrays
andDatasets
, it is filled with indices fromselector = {key: slice for key, slice in zip(dims, slices)}
. This required an update to theconcat_input_dims
option where the operation is done in__getitem__()
. It is possible that this change decreases performance when thisconcat_input_dims=True
.