Hi, I just uploaded pyedu, a subset of the "stack-edu" subset from smollm-corpus(https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus).
Although high-quality according to the tech report of smollm-v2, it's relatively small with only ~6GB.
Maybe it can be used for further training, such as annealing or synthesizing datasets.
https://huggingface.co/datasets/Leon-Leee/unofficial-pyedu
Hi, I just uploaded pyedu, a subset of the "stack-edu" subset from smollm-corpus(https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus).
Although high-quality according to the tech report of smollm-v2, it's relatively small with only ~6GB.
Maybe it can be used for further training, such as annealing or synthesizing datasets.
https://huggingface.co/datasets/Leon-Leee/unofficial-pyedu