-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Ability to Sub-Sample Data for Data Constrained Scaling Law Experiments #872
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about making a custom asyncdataset that is basically just a slice and leave this logic out of the mixture dataset?
With the idea of having it wrap the mixture dataset in the simulation case? Or each of the subdatasets of a mixture would be of this type? I think the prior makes a lot of sense! I just need to wrap my head around the impl. that would lead to consistent slices of each sub-dataset as well. The latter seems like it would require more configuration wiring that might not be worth it. |
Looking at it more, I don't know if there's actually a clean way to make it work for the prior. The latter feels fine as well though and likely has less footguns with respect to the effective dataset size being different than all the different utilities so I'll get that in. |
i was thinking the latter yeah. basically just "if subsample: datasets = map(datasets, slice_dataset)" but I guess your point is we don't know how many samples we're going to go through until we're in the mixture dataset |
I've got a rework I'll push after NLP lunch that I feel like is probably cleaner. |
actually no i don't understand why isn't it just:
|
Ok - how does this look? The test is probably in the wrong spot though realistically? |
efdc13a
to
4297011
Compare
Just saw this! My one concern with this v.s. what I have in the current one is that it makes more assumptions about the implementation of the underlying (edit: for example, you couldn't wrap a mixture in this class right?) |
4297011
to
bbec00b
Compare
i don't see why not? |
Maybe I'm missing something, but this assumes the underlying dataset uses get_batch logic which depends on it's own For a mixture dataset, this isn't quite true - it instead uses the |
Oop just saw the slice version I have on my branch isn't pushed here. I think that'll be less mixed concerns than the current PR |
Ok - pushed. I realized I hadn't pushed before since this currently has a pre-commit type failure which I'll fix assuming the core logic makes sense to you! The core difference between this and your proposal is overriding get_batch rather than overriding the length waiting function.
I prefer this because the effects of calling the slice (to me) are clearer for the above code without checking anything else, whereas I feel like understanding the effects of the LMK what you think! |
simulated_data_ratio = self.experiment_budget / self.target_budget | ||
for name, ds in token_datasets.items(): | ||
# Note(Will): This blocks on datasets being fully processed even for small simulated runs making simulating data size slightly latency inducing but I think that's ok | ||
true_length_of_dataset = len(ds.as_sync_dataset()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the thing i was trying to work around but I agree it's not worth fixing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG!
can you fix the mypy errors then good to merge |
Fixed! |
Allows mixture datasets to specify a target budget and a experiment budget. This then computes what percentage of the data to sample overall in order to enable data constrained experiments like the above figure.