Skip to content

we need more code data #81

@zacliu2023

Description

@zacliu2023

Hi, I just uploaded pyedu, a subset of the "stack-edu" subset from smollm-corpus(https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus).
Although high-quality according to the tech report of smollm-v2, it's relatively small with only ~6GB.
Maybe it can be used for further training, such as annealing or synthesizing datasets.
https://huggingface.co/datasets/Leon-Leee/unofficial-pyedu

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions