This repo is for data preparation and cleaning for Joey LLM training.
- Use one parquet file from the FineWeb dataset (any file is fine for now).
- Clean the data and tokenize it.
- Build a dataset class in PyTorch to prepare the data for training.
- Output should be ready to plug into the training pipeline in models-magpie.
A script that:
- Loads and cleans a parquet file from FineWeb.
- Tokenizes the data.
- Defines a PyTorch dataset/dataloader that can be used in training.
- Swap in the full FineWeb dataset once it is available.
- Optimize cleaning/tokenization for larger scale data.
- Keep it simple: one parquet file, cleaned and tokenized.
- If needed, look up examples of cleaning + tokenizing text for LLM training.
- The important part: working code that outputs a dataset ready for training.