Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove __len__ from CombinedStreamingDataset #19321

Merged
merged 2 commits into from
Jan 24, 2024
Merged

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Jan 20, 2024

What does this PR do?

The combined dataset is an iterable dataset that samples randomly from a list of datasets. Since the termination is non-deterministic, and the datasets can have different lengths, it is not meaningful to define a length. As far as I know, there is no code that depends on this definition, so it's best to remove it.


📚 Documentation preview 📚: https://pytorch-lightning--19321.org.readthedocs.build/en/19321/

cc @Borda

@awaelchli awaelchli added the data (external) litdata package label Jan 20, 2024
@awaelchli awaelchli added this to the 2.1.x milestone Jan 20, 2024
@awaelchli awaelchli marked this pull request as ready for review January 21, 2024 15:21
@awaelchli awaelchli requested a review from tchaton as a code owner January 21, 2024 15:21
Copy link
Contributor

github-actions bot commented Jan 21, 2024

⚡ Required checks status: All passing 🟢

Groups summary

🟢 lightning_data: CPU workflow
Check ID Status
data-cpu (macOS-11, lightning, 3.10, 2.1) success
data-cpu (ubuntu-20.04, lightning, 3.10, 2.1) success
data-cpu (windows-2022, lightning, 3.10, 2.1) success

These checks are required after the changes to src/lightning/data/streaming/combined.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/lightning/data/streaming/combined.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.8) success
install-pkg (ubuntu-22.04, app, 3.11) success
install-pkg (ubuntu-22.04, fabric, 3.8) success
install-pkg (ubuntu-22.04, fabric, 3.11) success
install-pkg (ubuntu-22.04, pytorch, 3.8) success
install-pkg (ubuntu-22.04, pytorch, 3.11) success
install-pkg (ubuntu-22.04, lightning, 3.8) success
install-pkg (ubuntu-22.04, lightning, 3.11) success
install-pkg (ubuntu-22.04, notset, 3.8) success
install-pkg (ubuntu-22.04, notset, 3.11) success
install-pkg (macOS-12, app, 3.8) success
install-pkg (macOS-12, app, 3.11) success
install-pkg (macOS-12, fabric, 3.8) success
install-pkg (macOS-12, fabric, 3.11) success
install-pkg (macOS-12, pytorch, 3.8) success
install-pkg (macOS-12, pytorch, 3.11) success
install-pkg (macOS-12, lightning, 3.8) success
install-pkg (macOS-12, lightning, 3.11) success
install-pkg (macOS-12, notset, 3.8) success
install-pkg (macOS-12, notset, 3.11) success
install-pkg (windows-2022, app, 3.8) success
install-pkg (windows-2022, app, 3.11) success
install-pkg (windows-2022, fabric, 3.8) success
install-pkg (windows-2022, fabric, 3.11) success
install-pkg (windows-2022, pytorch, 3.8) success
install-pkg (windows-2022, pytorch, 3.11) success
install-pkg (windows-2022, lightning, 3.8) success
install-pkg (windows-2022, lightning, 3.11) success
install-pkg (windows-2022, notset, 3.8) success
install-pkg (windows-2022, notset, 3.11) success

These checks are required after the changes to src/lightning/data/streaming/combined.py.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@tchaton
Copy link
Contributor

tchaton commented Jan 21, 2024

@awaelchli we could make the termination deterministic by breaking when the len has been reached. For very large datasets, it is good to nice to know the len of the combined dataset.

@awaelchli awaelchli added the fun Staff contributions outside working hours - to differentiate from the "community" label label Jan 23, 2024
@awaelchli
Copy link
Contributor Author

make the termination deterministic by breaking when the len has been reached

The termination is deterministic, since we use a seed as input. You could only know the length and how many times each dataset gets called if you know in advance how many times the user calls next() on the combined iterator. But this is not the case.

For very large datasets, it is good to nice to know the len of the combined dataset.

I argue that the user has access to the list of datasets and can compute the total length themselves. I think providing a length that is inaccurate is not a nice feature. Iterable datasets have no requirement to define a length because of exactly the type of sampling we are dealing with here.

@mergify mergify bot added the ready PRs ready to be merged label Jan 24, 2024
@awaelchli awaelchli merged commit 71bfdc3 into master Jan 24, 2024
102 checks passed
@awaelchli awaelchli deleted the data/combined-length branch January 24, 2024 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data (external) litdata package fun Staff contributions outside working hours - to differentiate from the "community" label ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants