Skip to content

joeyllm/data-magpie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Data Magpie

This repo is for data preparation and cleaning for Joey LLM training.


๐ŸŽฏ Focus for This Sprint

  • Use one parquet file from the FineWeb dataset (any file is fine for now).
  • Clean the data and tokenize it.
  • Build a dataset class in PyTorch to prepare the data for training.
  • Output should be ready to plug into the training pipeline in models-magpie.

๐Ÿ“ฆ Deliverable

A script that:

  1. Loads and cleans a parquet file from FineWeb.
  2. Tokenizes the data.
  3. Defines a PyTorch dataset/dataloader that can be used in training.

๐Ÿ”œ Coming Next Sprint

  • Swap in the full FineWeb dataset once it is available.
  • Optimize cleaning/tokenization for larger scale data.

๐Ÿ“ Notes

  • Keep it simple: one parquet file, cleaned and tokenized.
  • If needed, look up examples of cleaning + tokenizing text for LLM training.
  • The important part: working code that outputs a dataset ready for training.

About

Data prep for finetuing mistral 7b

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages