Data Magpie

This repo is for data preparation and cleaning for Joey LLM training.

🎯 Focus for This Sprint

Use one parquet file from the FineWeb dataset (any file is fine for now).
Clean the data and tokenize it.
Build a dataset class in PyTorch to prepare the data for training.
Output should be ready to plug into the training pipeline in models-magpie.

📦 Deliverable

A script that:

Loads and cleans a parquet file from FineWeb.
Tokenizes the data.
Defines a PyTorch dataset/dataloader that can be used in training.

🔜 Coming Next Sprint

Swap in the full FineWeb dataset once it is available.
Optimize cleaning/tokenization for larger scale data.

📝 Notes

Keep it simple: one parquet file, cleaned and tokenized.
If needed, look up examples of cleaning + tokenizing text for LLM training.
The important part: working code that outputs a dataset ready for training.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
au_filter.py		au_filter.py
preprocess_data.py		preprocess_data.py
preprocess_data_au.py		preprocess_data_au.py
temp_download_script.py		temp_download_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Magpie

🎯 Focus for This Sprint

📦 Deliverable

🔜 Coming Next Sprint

📝 Notes

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

joeyllm/data-magpie

Folders and files

Latest commit

History

Repository files navigation

Data Magpie

🎯 Focus for This Sprint

📦 Deliverable

🔜 Coming Next Sprint

📝 Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages