Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hf_load_dataset split options are limited #43

Open
samterfa opened this issue Jan 3, 2023 · 3 comments
Open

hf_load_dataset split options are limited #43

samterfa opened this issue Jan 3, 2023 · 3 comments

Comments

@samterfa
Copy link
Collaborator

samterfa commented Jan 3, 2023

There are some slick ways to pull only subsets of splits you can read up on here. Current implementation doesn't allow for these. They used to work by being passed in via ... An example is hf_load_dataset("the_dataset", split="train[:10%]")

@jpcompartir
Copy link
Collaborator

jpcompartir commented Jan 3, 2023

If we decide to implement this, I think we should separate it from the current function. Name it something like hf_load_dataset_slice(), thoughts?

Do wonder if re-directing users to sample_n() or one of the slice functions isn't preferable, though?

@samterfa
Copy link
Collaborator Author

samterfa commented Jan 3, 2023

I'd like to be consistent with other functions we've created. The non-ez functions we've implemented generally allow a user who knows what they're doing to get most/all the functionality of the python libraries we are emulating, but also abstract a little of the hard stuff away. In that spirit it seems like we should provide a function that accommodates the raw split argument, and perhaps also have a version that abstracts away the hard stuff, or incorporate them into the same function.

The nice thing about split = 'train[:10]' is that you can preview a dataset without downloading the whole thing. I think we should allow the user to pull a portion of any split, and I think we can do this in a single function. Perhaps we should create a hf_dataset_info() function which pulls the available splits and other useful info. If we wanted to be fancy we could make hf_load_dataset() act like a lazy query so you could do things like hf_load_dataset(split = 'train') %>% slice_sample(n = 10).

@samterfa samterfa closed this as completed Jan 3, 2023
@samterfa samterfa reopened this Jan 3, 2023
@jpcompartir
Copy link
Collaborator

Yeah nice, allowing that functionality would be great - previewing the large datasets would be a real boon. I'm wondering if a hf_preview_dataset() function would do the trick - very wary of trying to do too much with hf_load_dataset() - it already does too much, IMO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants