Remove `len` from CombinedStreamingDataset #19321

awaelchli · 2024-01-20T13:31:35Z

What does this PR do?

The combined dataset is an iterable dataset that samples randomly from a list of datasets. Since the termination is non-deterministic, and the datasets can have different lengths, it is not meaningful to define a length. As far as I know, there is no code that depends on this definition, so it's best to remove it.

📚 Documentation preview 📚: https://pytorch-lightning--19321.org.readthedocs.build/en/19321/

cc @Borda

for more information, see https://pre-commit.ci

github-actions · 2024-01-21T15:21:59Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 lightning_data: CPU workflow

Check ID	Status
data-cpu (macOS-11, lightning, 3.10, 2.1)	success	✅
data-cpu (ubuntu-20.04, lightning, 3.10, 2.1)	success	✅
data-cpu (windows-2022, lightning, 3.10, 2.1)	success	✅

These checks are required after the changes to src/lightning/data/streaming/combined.py.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/lightning/data/streaming/combined.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.8)	success	✅
install-pkg (ubuntu-22.04, app, 3.11)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.8)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.11)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.8)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.11)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.8)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.11)	success	✅
install-pkg (ubuntu-22.04, notset, 3.8)	success	✅
install-pkg (ubuntu-22.04, notset, 3.11)	success	✅
install-pkg (macOS-12, app, 3.8)	success	✅
install-pkg (macOS-12, app, 3.11)	success	✅
install-pkg (macOS-12, fabric, 3.8)	success	✅
install-pkg (macOS-12, fabric, 3.11)	success	✅
install-pkg (macOS-12, pytorch, 3.8)	success	✅
install-pkg (macOS-12, pytorch, 3.11)	success	✅
install-pkg (macOS-12, lightning, 3.8)	success	✅
install-pkg (macOS-12, lightning, 3.11)	success	✅
install-pkg (macOS-12, notset, 3.8)	success	✅
install-pkg (macOS-12, notset, 3.11)	success	✅
install-pkg (windows-2022, app, 3.8)	success	✅
install-pkg (windows-2022, app, 3.11)	success	✅
install-pkg (windows-2022, fabric, 3.8)	success	✅
install-pkg (windows-2022, fabric, 3.11)	success	✅
install-pkg (windows-2022, pytorch, 3.8)	success	✅
install-pkg (windows-2022, pytorch, 3.11)	success	✅
install-pkg (windows-2022, lightning, 3.8)	success	✅
install-pkg (windows-2022, lightning, 3.11)	success	✅
install-pkg (windows-2022, notset, 3.8)	success	✅
install-pkg (windows-2022, notset, 3.11)	success	✅

These checks are required after the changes to src/lightning/data/streaming/combined.py.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

tchaton · 2024-01-21T18:20:12Z

@awaelchli we could make the termination deterministic by breaking when the len has been reached. For very large datasets, it is good to nice to know the len of the combined dataset.

awaelchli · 2024-01-23T01:30:26Z

make the termination deterministic by breaking when the len has been reached

The termination is deterministic, since we use a seed as input. You could only know the length and how many times each dataset gets called if you know in advance how many times the user calls next() on the combined iterator. But this is not the case.

For very large datasets, it is good to nice to know the len of the combined dataset.

I argue that the user has access to the list of datasets and can compute the total length themselves. I think providing a length that is inaccurate is not a nice feature. Iterable datasets have no requirement to define a length because of exactly the type of sampling we are dealing with here.

Remove length

45a2285

awaelchli added the data (external) litdata package label Jan 20, 2024

awaelchli added this to the 2.1.x milestone Jan 20, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

fcffc6d

for more information, see https://pre-commit.ci

awaelchli marked this pull request as ready for review January 21, 2024 15:21

awaelchli requested a review from tchaton as a code owner January 21, 2024 15:21

awaelchli added the fun Staff contributions outside working hours - to differentiate from the "community" label label Jan 23, 2024

tchaton approved these changes Jan 24, 2024

View reviewed changes

Borda approved these changes Jan 24, 2024

View reviewed changes

mergify bot added the ready PRs ready to be merged label Jan 24, 2024

awaelchli merged commit 71bfdc3 into master Jan 24, 2024
102 checks passed

awaelchli deleted the data/combined-length branch January 24, 2024 16:07

awaelchli added a commit that referenced this pull request Jan 30, 2024

Remove __len__ from CombinedStreamingDataset (#19321)

021f17d

lexierule pushed a commit that referenced this pull request Jan 31, 2024

Remove __len__ from CombinedStreamingDataset (#19321)

2403d89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `len` from CombinedStreamingDataset #19321

Remove `len` from CombinedStreamingDataset #19321

awaelchli commented Jan 20, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Jan 21, 2024 •

edited

Loading

tchaton commented Jan 21, 2024 •

edited

Loading

awaelchli commented Jan 23, 2024

Remove __len__ from CombinedStreamingDataset #19321

Remove __len__ from CombinedStreamingDataset #19321

Conversation

awaelchli commented Jan 20, 2024 • edited by github-actions bot Loading

What does this PR do?

github-actions bot commented Jan 21, 2024 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

tchaton commented Jan 21, 2024 • edited Loading

awaelchli commented Jan 23, 2024

Remove `len` from CombinedStreamingDataset #19321

Remove `len` from CombinedStreamingDataset #19321

awaelchli commented Jan 20, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Jan 21, 2024 •

edited

Loading

tchaton commented Jan 21, 2024 •

edited

Loading