Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate datasets from OpenML to Figshare #1217

Open
Vincent-Maladiere opened this issue Jan 15, 2025 · 4 comments
Open

Migrate datasets from OpenML to Figshare #1217

Vincent-Maladiere opened this issue Jan 15, 2025 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@Vincent-Maladiere
Copy link
Member

Problem Description

After IRL discussions with @jeromedockes, we observed that we're still experiencing a lot of CI disruptions and errors when fetching datasets from OpenML. Figshare seems to be a more reliable alternative.

Feature Description

Move all skrub datasets to Figshare.

Alternative Solutions

No response

Additional Context

No response

@Vincent-Maladiere Vincent-Maladiere added the enhancement New feature or request label Jan 15, 2025
@Vincent-Maladiere Vincent-Maladiere self-assigned this Jan 15, 2025
@jeromedockes
Copy link
Member

also, we should mock downloads in unit tests (less urgent)

@joaquinvanschoren
Copy link

joaquinvanschoren commented Jan 15, 2025

OpenML is currently unreachable because of a cyberattack that hit TU Eindhoven. The service has been very reliable but this event is sadly out of our control. OpenML itself is not affected and we're in contact with the university IT team to bring it back soon. We have redundancy but sadly all within the tue network.

In the meantime, we are preparing to set up a secondary deployment in the Dutch supercomputing center so that such an outage won't happen again.

@jeromedockes
Copy link
Member

Hi @joaquinvanschoren , thanks very much for explaining this outage -- I hope this attack gets resolved quickly and without major consequences for the university! Indeed OpenML has been generally reliable and we're very grateful for it.

However for the skrub datasets we don't really need the great features of OpenML because they are just a handful of fixed, pre-defined datasets so we basically just need a place to store a few small parquet files (and some of the datasets are already stored like this). So I still think it makes sense to have a copy of those datasets on figshare if their license allows it. Of course in any case skrub depends on scikit-learn so skrub users will always have easy access to any OpenML dataset through scikit-learn's fetch_openml. we could consider fetching from openml by default and falling back to figshare if it's not available 🤔

@glemaitre
Copy link
Member

I would be more in favor to have the redundancy instead of migrating. We also experience issue in the past with scikit-learn: scikit-learn/scikit-learn#28297

The resolution of the issue took several months and I would say that the resolution of the ticket was not easy. At least, it is a plus on my side when trying to resolve an issue with OpenML because @joaquinvanschoren and the team do a great job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants