A minimal yet efficient implementation of causal language modeling in PyTorch.
It features a custom torch-compilable Transformer model implementation supporting RoPE, GLU, and RMSNorm. It supports distributed training via Distributed Data Parallel (DDP).
A dedicated script is included for downloading, tokenizing, and chunking data, making data preparation seamless.
We recommend running plainLM in a dedicated Python environment. To install dependencies in an Anaconda environment, execute:
conda create --name plainLM python=3.12 -y && conda activate plainLM && cd plainLM
pip install .We provide a script for downloading, tokenizing, chunking and saving Hugging Face datasets: data/datasets/prepare.py.
You can specify any HF dataset and tokenizer. To avoid downloading the entire corpus, we support streaming, tokenizing, and chunking data on-the-fly. We provide an example for FineWebEdu-100BT in data/datasets/prepare_finewebedu_100BT.sh.
Specify hyperparameters in config.yaml and launch training as follows:
python train.py --config=config/config.yaml torchrun --nnodes=1 --nproc_per_node=4 train.py --config=code/config/sweep.yaml- Define hyperparameter sweep:
create a single YAML file with lists of hyperparameter values. Each value in the list will represent a different configuration, e.g.:
lr: [0.1, 0.01] wd: [0.1, 0.2, 0.5] beta1: 0.9 ...
- Submit the sweep:
Submit a job-array, where each job executes the same python script and reads the same configuration, but with a different
job_idx. We usejob_idxto map a job to its hyperparameters;job_idxshould range from0ton-1, wherenis the number of Cartesian product configurations in the YAML. This is done automatically bycluster/slurm.shandcluster/condor.sub. Python takes care of assigning the corresponding configuration to each job based on the value ofjob_idx.
plainLM/
├── cluster/ # HPC scripts (SLURM & Condor)
├── config/ # Configuration files for training and model setup
├── data/ # Everything regarding data preparation and data stream
│ └── datasets/ # Data preprocessing files to download, tokenize, chunk and save data
│ └── dataloaders.py # Dataloader utilities
│ └── datasamplers.py # Custom stateful distributed samplers
├── engine/ # Core implementation of the model engine: a torch.nn.Module implementing training steps and evaluations
├── models/ # Model architectures
├── optim/ # Optimization utilities
├── checkpoint_utils.py # Checkpoint utilities
├── torch_utils.py # PyTorch utilities (DDP, seed, TF32...)
├── train.py # Main training script ⭐️
└── utils.py # Miscellaneous helper functions
- FSDP2 support, ZeRO-2 and tensor parallel compatibility
- dummy data
- unit tests
- add seed to
DistributedSampler
@misc{ajroldi2024plainlm,
author = {Niccolò Ajroldi},
title = {plainLM: Language Model Pretraining in PyTorch},
year = {2024},
howpublished = {\url{https://github.com/Niccolo-Ajroldi/plainLM}}
}This project was inspired by:
Huge thanks to these projects for their contributions to open-source language model pretraining!
Some recent projects using plainLM:
- Orvieto, A., & Gower, R. (2025). In search of Adam’s secret sauce ArXiv.
- Ajroldi, N., Orvieto, A., & Geiping, J. (2025). When, where and why to average weights? In Proceedings of ICML 2025.
- Srećković, T., Geiping, J., & Orvieto, A. (2025). Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling. ArXiv.
- Belloni, A., Noci, L., & Orvieto, A. (2025). Universal Dynamics of Warmup Stable Decay: Understanding WSD Beyond Transformers. [MOSS Workshop, ICML 2025].(https://icml.cc/virtual/2025/47679)