hf_load_dataset split options are limited #43

samterfa · 2023-01-03T03:10:39Z

There are some slick ways to pull only subsets of splits you can read up on here. Current implementation doesn't allow for these. They used to work by being passed in via ... An example is hf_load_dataset("the_dataset", split="train[:10%]")

The text was updated successfully, but these errors were encountered:

jpcompartir · 2023-01-03T13:40:34Z

If we decide to implement this, I think we should separate it from the current function. Name it something like hf_load_dataset_slice(), thoughts?

Do wonder if re-directing users to sample_n() or one of the slice functions isn't preferable, though?

samterfa · 2023-01-03T13:59:52Z

I'd like to be consistent with other functions we've created. The non-ez functions we've implemented generally allow a user who knows what they're doing to get most/all the functionality of the python libraries we are emulating, but also abstract a little of the hard stuff away. In that spirit it seems like we should provide a function that accommodates the raw split argument, and perhaps also have a version that abstracts away the hard stuff, or incorporate them into the same function.

The nice thing about split = 'train[:10]' is that you can preview a dataset without downloading the whole thing. I think we should allow the user to pull a portion of any split, and I think we can do this in a single function. Perhaps we should create a hf_dataset_info() function which pulls the available splits and other useful info. If we wanted to be fancy we could make hf_load_dataset() act like a lazy query so you could do things like hf_load_dataset(split = 'train') %>% slice_sample(n = 10).

jpcompartir · 2023-01-03T14:51:06Z

Yeah nice, allowing that functionality would be great - previewing the large datasets would be a real boon. I'm wondering if a hf_preview_dataset() function would do the trick - very wary of trying to do too much with hf_load_dataset() - it already does too much, IMO

samterfa closed this as completed Jan 3, 2023

samterfa reopened this Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hf_load_dataset split options are limited #43

hf_load_dataset split options are limited #43

samterfa commented Jan 3, 2023 •

edited

Loading

jpcompartir commented Jan 3, 2023 •

edited

Loading

samterfa commented Jan 3, 2023 •

edited

Loading

jpcompartir commented Jan 3, 2023

hf_load_dataset split options are limited #43

hf_load_dataset split options are limited #43

Comments

samterfa commented Jan 3, 2023 • edited Loading

jpcompartir commented Jan 3, 2023 • edited Loading

samterfa commented Jan 3, 2023 • edited Loading

jpcompartir commented Jan 3, 2023

samterfa commented Jan 3, 2023 •

edited

Loading

jpcompartir commented Jan 3, 2023 •

edited

Loading

samterfa commented Jan 3, 2023 •

edited

Loading