-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove __len__
from CombinedStreamingDataset
#19321
Conversation
for more information, see https://pre-commit.ci
@awaelchli we could make the termination deterministic by breaking when the len has been reached. For very large datasets, it is good to nice to know the len of the combined dataset. |
The termination is deterministic, since we use a seed as input. You could only know the length and how many times each dataset gets called if you know in advance how many times the user calls next() on the combined iterator. But this is not the case.
I argue that the user has access to the list of datasets and can compute the total length themselves. I think providing a length that is inaccurate is not a nice feature. Iterable datasets have no requirement to define a length because of exactly the type of sampling we are dealing with here. |
What does this PR do?
The combined dataset is an iterable dataset that samples randomly from a list of datasets. Since the termination is non-deterministic, and the datasets can have different lengths, it is not meaningful to define a length. As far as I know, there is no code that depends on this definition, so it's best to remove it.
📚 Documentation preview 📚: https://pytorch-lightning--19321.org.readthedocs.build/en/19321/
cc @Borda