Adds Ability to Sub-Sample Data for Data Constrained Scaling Law Experiments #872

Helw150 · 2025-01-30T05:58:20Z

Allows mixture datasets to specify a target budget and a experiment budget. This then computes what percentage of the data to sample overall in order to enable data constrained experiments like the above figure.

…riments

dlwh

what do you think about making a custom asyncdataset that is basically just a slice and leave this logic out of the mixture dataset?

Helw150 · 2025-01-30T17:25:27Z

what do you think about making a custom asyncdataset that is basically just a slice and leave this logic out of the mixture dataset?

With the idea of having it wrap the mixture dataset in the simulation case? Or each of the subdatasets of a mixture would be of this type?

I think the prior makes a lot of sense! I just need to wrap my head around the impl. that would lead to consistent slices of each sub-dataset as well.

The latter seems like it would require more configuration wiring that might not be worth it.

Helw150 · 2025-01-30T18:21:47Z

Looking at it more, I don't know if there's actually a clean way to make it work for the prior.

The latter feels fine as well though and likely has less footguns with respect to the effective dataset size being different than all the different utilities so I'll get that in.

dlwh · 2025-01-30T18:58:43Z

i was thinking the latter yeah. basically just "if subsample: datasets = map(datasets, slice_dataset)" but I guess your point is we don't know how many samples we're going to go through until we're in the mixture dataset

Helw150 · 2025-01-30T21:19:21Z

I've got a rework I'll push after NLP lunch that I feel like is probably cleaner.

dlwh · 2025-01-30T21:41:05Z

actually no i don't understand why isn't it just:

class SimulatedDataRatioDataset:
  async def wait_until_len_at_least(len):
          target_for_len = len * self.ratio_target
          actual = await self.ds.wait_until_len_at_least(int(target_for_len))

          return int(actual * self.ratio_target))

Helw150 · 2025-01-30T23:40:19Z

Ok - how does this look?

The test is probably in the wrong spot though realistically?

Helw150 · 2025-01-30T23:50:00Z

actually no i don't understand why isn't it just:

class SimulatedDataRatioDataset:
  async def wait_until_len_at_least(len):
          target_for_len = len * self.ratio_target
          actual = await self.ds.wait_until_len_at_least(int(target_for_len))

          return int(actual * self.ratio_target))

Just saw this! My one concern with this v.s. what I have in the current one is that it makes more assumptions about the implementation of the underlying get_batch code.

(edit: for example, you couldn't wrap a mixture in this class right?)

dlwh · 2025-01-31T07:47:03Z

i don't see why not?

Helw150 · 2025-01-31T18:08:24Z

actually no i don't understand why isn't it just:

class SimulatedDataRatioDataset:
  async def wait_until_len_at_least(len):
          target_for_len = len * self.ratio_target
          actual = await self.ds.wait_until_len_at_least(int(target_for_len))

          return int(actual * self.ratio_target))

Maybe I'm missing something, but this assumes the underlying dataset uses get_batch logic which depends on it's own wait_until_len_at_least to determine the dataset is exhausted.

For a mixture dataset, this isn't quite true - it instead uses the wait_until_len_at_least of each of the sub-datasets. Since overriding the parents call doesn't affect these, it wouldn't change the epoching behavior right?

Helw150 · 2025-01-31T18:16:47Z

Oop just saw the slice version I have on my branch isn't pushed here. I think that'll be less mixed concerns than the current PR

Helw150 · 2025-01-31T20:08:30Z

Ok - pushed. I realized I hadn't pushed before since this currently has a pre-commit type failure which I'll fix assuming the core logic makes sense to you!

The core difference between this and your proposal is overriding get_batch rather than overriding the length waiting function.

    async def get_batch(self, indices: Sequence[int]) -> Sequence[T_co]:
        shifted_indices = [(index + self.start_index) for index in indices]
        max_index = max(shifted_indices)

        if self.end_index is not None and max_index > self.end_index:
            raise ValueError("Requested indices beyond the end of the dataset")

        return await self.dataset.get_batch(indices)

I prefer this because the effects of calling the slice (to me) are clearer for the above code without checking anything else, whereas I feel like understanding the effects of the wait_until_len_at_least override feels like it would require knowing how the get_batch of dataset you are wrapping is implemented for it to make sense.

LMK what you think!

dlwh · 2025-01-31T20:20:40Z

src/levanter/data/text.py

+            simulated_data_ratio = self.experiment_budget / self.target_budget
+            for name, ds in token_datasets.items():
+                # Note(Will): This blocks on datasets being fully processed even for small simulated runs making simulating data size slightly latency inducing but I think that's ok
+                true_length_of_dataset = len(ds.as_sync_dataset())


this is the thing i was trying to work around but I agree it's not worth fixing

dlwh · 2025-01-31T20:21:22Z

can you fix the mypy errors then good to merge

Helw150 · 2025-02-01T02:52:58Z

can you fix the mypy errors then good to merge

Fixed!

Adds Ability to Sub-Sample Data for Data Constrained Scaling Law Expe…

4fb5cd7

…riments

Helw150 requested review from dlwh and ahmeda14960 January 30, 2025 05:58

Sanity Check via a second condition

bbec00b

dlwh reviewed Jan 30, 2025

View reviewed changes

Helw150 force-pushed the will/constrain branch from efdc13a to 4297011 Compare January 30, 2025 23:42

Helw150 force-pushed the will/constrain branch from 4297011 to bbec00b Compare January 31, 2025 00:00

Slice Implementation

65fbbdb

dlwh approved these changes Jan 31, 2025

View reviewed changes

PR and Sanity

ae0305d

dlwh approved these changes Feb 1, 2025

View reviewed changes

dlwh merged commit 1d216d1 into main Feb 1, 2025
7 of 8 checks passed

dlwh deleted the will/constrain branch February 1, 2025 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Ability to Sub-Sample Data for Data Constrained Scaling Law Experiments #872

Adds Ability to Sub-Sample Data for Data Constrained Scaling Law Experiments #872

Helw150 commented Jan 30, 2025

dlwh left a comment

Helw150 commented Jan 30, 2025

Helw150 commented Jan 30, 2025

dlwh commented Jan 30, 2025

Helw150 commented Jan 30, 2025

dlwh commented Jan 30, 2025

Helw150 commented Jan 30, 2025

Helw150 commented Jan 30, 2025 •

edited

Loading

dlwh commented Jan 31, 2025

Helw150 commented Jan 31, 2025

Helw150 commented Jan 31, 2025

Helw150 commented Jan 31, 2025 •

edited

Loading

dlwh Jan 31, 2025

Helw150 Feb 1, 2025

dlwh commented Jan 31, 2025

Helw150 commented Feb 1, 2025

Adds Ability to Sub-Sample Data for Data Constrained Scaling Law Experiments #872

Adds Ability to Sub-Sample Data for Data Constrained Scaling Law Experiments #872

Conversation

Helw150 commented Jan 30, 2025

dlwh left a comment

Choose a reason for hiding this comment

Helw150 commented Jan 30, 2025

Helw150 commented Jan 30, 2025

dlwh commented Jan 30, 2025

Helw150 commented Jan 30, 2025

dlwh commented Jan 30, 2025

Helw150 commented Jan 30, 2025

Helw150 commented Jan 30, 2025 • edited Loading

dlwh commented Jan 31, 2025

Helw150 commented Jan 31, 2025

Helw150 commented Jan 31, 2025

Helw150 commented Jan 31, 2025 • edited Loading

dlwh Jan 31, 2025

Choose a reason for hiding this comment

Helw150 Feb 1, 2025

Choose a reason for hiding this comment

dlwh commented Jan 31, 2025

Helw150 commented Feb 1, 2025

Helw150 commented Jan 30, 2025 •

edited

Loading

Helw150 commented Jan 31, 2025 •

edited

Loading