Pretrain a Transformer on Language Modeling

A minimal yet efficient implementation of causal language modeling in PyTorch.

It features a custom torch-compilable Transformer model implementation supporting RoPE, GLU, and RMSNorm. It supports distributed training via Distributed Data Parallel (DDP).

A dedicated script is included for downloading, tokenizing, and chunking data, making data preparation seamless.

🛠 Installation

We recommend running plainLM in a dedicated Python environment. To install dependencies in an Anaconda environment, execute:

conda create --name plainLM python=3.12 -y && conda activate plainLM && cd plainLM
pip install .

📚 Data

We provide a script for downloading, tokenizing, chunking and saving Hugging Face datasets: data/datasets/prepare.py. You can specify any HF dataset and tokenizer. To avoid downloading the entire corpus, we support streaming, tokenizing, and chunking data on-the-fly. We provide an example for FineWebEdu-100BT in data/datasets/prepare_finewebedu_100BT.sh.

⚡️ Usage

Specify hyperparameters in config.yaml and launch training as follows:

Single GPU/CPU:

  python train.py --config=config/config.yaml

Multiple GPUs:

  torchrun --nnodes=1 --nproc_per_node=4 train.py --config=code/config/sweep.yaml

Run a sweep in parallel on a SLURM or Condor HPC cluster:

Define hyperparameter sweep: create a single YAML file with lists of hyperparameter values. Each value in the list will represent a different configuration, e.g.:
```
lr: [0.1, 0.01]
wd: [0.1, 0.2, 0.5]
beta1: 0.9
...
```
Submit the sweep: Submit a job-array, where each job executes the same python script and reads the same configuration, but with a different job_idx. We use job_idx to map a job to its hyperparameters; job_idx should range from 0 to n-1, where n is the number of Cartesian product configurations in the YAML. This is done automatically by cluster/slurm.sh and cluster/condor.sub. Python takes care of assigning the corresponding configuration to each job based on the value of job_idx.

📂 Structure

plainLM/
├── cluster/             # HPC scripts (SLURM & Condor)
├── config/              # Configuration files for training and model setup
├── data/                # Everything regarding data preparation and data stream
│   └── datasets/        # Data preprocessing files to download, tokenize, chunk and save data
│   └── dataloaders.py   # Dataloader utilities
│   └── datasamplers.py  # Custom stateful distributed samplers
├── engine/              # Core implementation of the model engine: a torch.nn.Module implementing training steps and evaluations
├── models/              # Model architectures
├── optim/               # Optimization utilities
├── checkpoint_utils.py  # Checkpoint utilities
├── torch_utils.py       # PyTorch utilities (DDP, seed, TF32...)
├── train.py             # Main training script ⭐️
└── utils.py             # Miscellaneous helper functions

☑️ TODO

FSDP2 support, ZeRO-2 and tensor parallel compatibility
dummy data
unit tests
add seed to DistributedSampler

Citation

@misc{ajroldi2024plainlm,
  author = {Niccolò Ajroldi},
  title = {plainLM: Language Model Pretraining in PyTorch},
  year = {2024},
  howpublished = {\url{https://github.com/Niccolo-Ajroldi/plainLM}}
}

Credits

This project was inspired by:

Cramming by Jonas Geiping
GPT-NeoX by EleutherAI
NanoGPT by Andrej Karpathy

Huge thanks to these projects for their contributions to open-source language model pretraining!

Published works using `plainLM`

Some recent projects using plainLM:

Orvieto, A., & Gower, R. (2025). In search of Adam’s secret sauce ArXiv.
Ajroldi, N., Orvieto, A., & Geiping, J. (2025). When, where and why to average weights? In Proceedings of ICML 2025.
Srećković, T., Geiping, J., & Orvieto, A. (2025). Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling. ArXiv.
Belloni, A., Noci, L., & Orvieto, A. (2025). Universal Dynamics of Warmup Stable Decay: Understanding WSD Beyond Transformers. [MOSS Workshop, ICML 2025].(https://icml.cc/virtual/2025/47679)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pretrain a Transformer on Language Modeling

🛠 Installation

📚 Data

⚡️ Usage

Single GPU/CPU:

Multiple GPUs:

Run a sweep in parallel on a SLURM or Condor HPC cluster:

📂 Structure

☑️ TODO

Citation

Credits

Published works using `plainLM`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
cluster		cluster
config		config
data		data
engine		engine
models		models
optim		optim
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
checkpoint_utils.py		checkpoint_utils.py
pyproject.toml		pyproject.toml
torch_utils.py		torch_utils.py
train.py		train.py
utils.py		utils.py

License

Niccolo-Ajroldi/plainLM

Folders and files

Latest commit

History

Repository files navigation

Pretrain a Transformer on Language Modeling

🛠 Installation

📚 Data

⚡️ Usage

Single GPU/CPU:

Multiple GPUs:

Run a sweep in parallel on a SLURM or Condor HPC cluster:

📂 Structure

☑️ TODO

Citation

Credits

Published works using plainLM

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Published works using `plainLM`

Packages