Toy implementation of gpt2. Because language models are cool.
Due to compute restraints I can not train the full size GPT-2 model. The largest one I could train is a 352M variant (can be run with the train.sh
script) and it converged to a loss of 3.1, which is alright.
The total footprint of the code is quite small so it is fairly easy to modifie. train.py
exposes a cli to set all the hyperparameters of the model to make it easy to train and iterate on model.