A complete implementation of a GPT (Generative Pre-trained Transformer) model from scratch using PyTorch. This project demonstrates how to build, train, and deploy a transformer-based language model similar to OpenAI's GPT architecture.
- Overview
- Features
- Project Structure
- Installation
- Architecture
- Data Processing
- Model Components
- Training
- Inference
- Hyperparameters
- Usage Examples
This project implements a GPT-style transformer model from the ground up, including:
- Multi-head self-attention mechanism
- Positional encoding
- Feed-forward networks
- Layer normalization
- Residual connections
- Dropout regularization
The model is trained on RFC (Request for Comments) documents and can generate text in a similar style.
- Full GPT Architecture: Implements all core components of the GPT model
- Flexible Tokenization: Supports both character-level and subword tokenization (tiktoken)
- Training Infrastructure: Includes checkpointing, mixed precision training, and learning rate scheduling
- Efficient Training: Optimized for GPU training with gradient accumulation
- Text Generation: Includes inference code for generating text from trained models
GPT-from-scratch/
├── GPT_from_scratch.ipynb # Main notebook with all code
└── README.md # This file
pip install torch numpy tiktoken- PyTorch: Deep learning framework
- NumPy: Numerical computations
- tiktoken: Fast BPE tokenizer (optional, can use character-level tokenization)
The model follows the standard GPT architecture:
Input Tokens
↓
Token Embeddings + Positional Embeddings
↓
[Transformer Block] × N layers
├── Multi-Head Self-Attention
├── Feed-Forward Network
└── Layer Normalization + Residual Connections
↓
Final Layer Normalization
↓
Language Model Head (Linear Layer)
↓
Output Logits (Vocabulary Size)
The model loads text data from a file (default: RFC documents). The data is then tokenized and split into training and validation sets.
# Load text data
with open(input_file, 'r', encoding='utf-8') as f:
text = f.read()The project supports two tokenization methods:
-
Character-level Tokenization: Maps each character to a unique integer
the_chars = sorted(list(set(text))) stoi = {ch: i for i, ch in enumerate(the_chars)} encode = lambda s: [stoi[c] for c in s]
-
Subword Tokenization (tiktoken): Uses GPT-2's BPE tokenizer
tokenizer = tiktoken.get_encoding("gpt2") encode = lambda s: tokenizer.encode(s)
The dataset is split into 90% training and 10% validation:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]The get_batch() function creates random batches of sequences for training:
- Randomly selects starting positions in the dataset
- Creates input sequences of length
block_size - Creates target sequences (shifted by one position)
- Returns batches of shape
(batch_size, block_size)
Implements a single attention head with scaled dot-product attention:
Key Components:
- Query, Key, Value projections: Linear transformations of input embeddings
- Causal masking: Prevents attention to future tokens (lower triangular matrix)
- Scaled dot-product attention:
Attention(Q, K, V) = softmax(QK^T / √d_k) V - Dropout: Regularization during training
Forward Pass:
- Compute Q, K, V from input embeddings
- Calculate attention weights:
wei = Q @ K^T / √head_size - Apply causal mask (set future positions to -∞)
- Apply softmax to get attention probabilities
- Weighted aggregation:
out = wei @ V
Combines multiple attention heads in parallel:
- Creates
num_headsattention heads - Concatenates outputs from all heads
- Projects concatenated output back to embedding dimension
- Applies dropout for regularization
Purpose: Allows the model to attend to different types of information simultaneously.
Two-layer MLP with ReLU activation:
Input (n_embd) → Linear(4 × n_embd) → ReLU → Linear(n_embd) → Output
Purpose: Provides non-linearity and allows the model to process information from attention layers.
A complete transformer block combining attention and feed-forward layers:
Structure:
x → LayerNorm → MultiHeadAttention → + (residual) → LayerNorm → FeedForward → + (residual) → output
Key Features:
- Residual connections: Helps with gradient flow and training stability
- Layer normalization: Applied before each sub-layer (pre-norm architecture)
- Dropout: Applied in attention and feed-forward layers
The complete GPT model architecture:
Components:
-
Token Embedding Table: Maps token IDs to dense vectors
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
-
Positional Embedding Table: Adds positional information
self.pos_emb_table = nn.Embedding(block_size, n_embd)
-
Transformer Blocks: Stack of N transformer blocks
self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)])
-
Final Layer Normalization: Applied before the language model head
-
Language Model Head: Projects embeddings to vocabulary logits
self.lm_head = nn.Linear(n_embd, vocab_size)
Forward Pass:
- Embed tokens and add positional encodings
- Pass through transformer blocks
- Apply final layer normalization
- Project to vocabulary logits
- Compute cross-entropy loss (if targets provided)
Generation Method:
- Autoregressively generates tokens one at a time
- Uses the last
block_sizetokens as context - Samples next token from probability distribution
- Appends to sequence and repeats
The training process includes:
- Gradient Accumulation: Accumulates gradients over multiple batches for effective larger batch sizes
- Mixed Precision Training: Uses automatic mixed precision (AMP) for faster training on GPUs
- Learning Rate Scheduling: Cosine annealing with warmup
- Checkpointing: Saves model checkpoints periodically
- Validation: Evaluates on validation set at regular intervals
The learning rate follows a warmup + cosine decay schedule:
- Warmup phase: Linear increase from 0 to
learning_rateoverwarmup_steps - Decay phase: Cosine decay from
learning_rateto 10% oflearning_rate
The training script saves:
- Latest checkpoint: Most recent training state
- Best checkpoint: Model with lowest validation loss
- Periodic checkpoints: Saved at regular intervals
Each checkpoint contains:
- Model state dictionary
- Optimizer state
- Gradient scaler state (for mixed precision)
- Training step number
- Best validation loss
- Model configuration
The estimate_loss() function:
- Evaluates the model on multiple batches
- Computes average loss on training and validation sets
- Uses
@torch.no_grad()for efficiency - Sets model to eval mode during evaluation
- Load checkpoint file
- Reconstruct model architecture from saved configuration
- Load model weights
- Set model to evaluation mode
The generation process:
- Encode input prompt to token IDs
- Convert to tensor and add batch dimension
- Call
model.generate()with:- Input context
- Maximum number of tokens to generate
- Decode generated token IDs back to text
Generation Strategy:
- Uses the last
block_sizetokens as context (crops if longer) - Samples from probability distribution (not greedy)
- Autoregressive: each new token depends on all previous tokens
| Parameter | Default Value | Description |
|---|---|---|
n_embd |
512 | Embedding dimension |
n_head |
8 | Number of attention heads |
n_layer |
6 | Number of transformer blocks |
dropout |
0.2 | Dropout probability |
vocab_size |
50257 (tiktoken) or variable (char-level) | Vocabulary size |
| Parameter | Default Value | Description |
|---|---|---|
block_size |
512 | Context window size (sequence length) |
batch_size |
64 | Batch size |
max_iters |
135,000 | Maximum training iterations |
learning_rate |
3e-4 | Initial learning rate |
eval_interval |
500 | Evaluation frequency |
eval_iters |
200 | Number of batches for evaluation |
grad_accum |
2 | Gradient accumulation steps |
| Parameter | Default Value | Description |
|---|---|---|
warmup_steps |
2,000 | Warmup iterations |
final_lr_ratio |
0.1 | Final LR as fraction of initial LR |
# Set hyperparameters
block_size = 512
batch_size = 64
n_embd = 512
n_head = 8
n_layer = 6
# Load and prepare data
with open('data.txt', 'r') as f:
text = f.read()
# Tokenize and create datasets
data = torch.tensor(encode(text), dtype=torch.long)
train_data = data[:int(0.9 * len(data))]
val_data = data[int(0.9 * len(data)):]
# Initialize model
model = GPTModel().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# Training loop
for iter in range(max_iters):
xb, yb = get_batch('train')
logits, loss = model(xb, yb)
optimizer.zero_grad()
loss.backward()
optimizer.step()# Load trained model
ckpt = torch.load('checkpoint.pt', map_location=device)
model.load_state_dict(ckpt['model'])
model.eval()
# Generate text
context = encode("What is IP?")
context_tensor = torch.tensor(context, dtype=torch.long, device=device).unsqueeze(0)
generated = model.generate(context_tensor, max_new_tokens=400)
output_text = decode(generated[0].tolist())
print(output_text)Self-attention allows each position in the sequence to attend to all previous positions. The attention mechanism computes:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information do I provide?"
The attention score between positions i and j is: Q_i · K_j, which determines how much position i should attend to position j.
Causal masking ensures the model can only attend to previous tokens (not future ones), making it suitable for autoregressive generation. This is implemented using a lower triangular matrix.
Since transformers have no inherent notion of sequence order, positional encodings are added to token embeddings to provide information about token positions in the sequence.
Residual connections (skip connections) allow gradients to flow directly through the network, helping with training deep models. They enable the model to learn identity mappings when needed.
Layer normalization stabilizes training by normalizing activations across the embedding dimension. It's applied before each sub-layer in the transformer block.
- GPU Training: The code automatically detects and uses CUDA if available
- Mixed Precision: Uses automatic mixed precision (AMP) to reduce memory usage and speed up training
- Gradient Accumulation: Allows effective larger batch sizes without increasing memory requirements
- Efficient Batching: Random batch sampling for better generalization
Potential enhancements:
- Add support for different model sizes (GPT-2 small, medium, large)
- Implement gradient checkpointing for memory efficiency
- Add support for distributed training
- Implement different sampling strategies (top-k, top-p, temperature)
- Add support for fine-tuning on specific tasks
- Implement model quantization for inference
This project is for educational purposes. Feel free to use and modify as needed.
This implementation is inspired by:
- The original GPT paper: "Improving Language Understanding by Generative Pre-Training"
- Andrej Karpathy's "Let's build GPT" series
- The transformer architecture from "Attention is All You Need"
Contributions are welcome! Please feel free to submit a Pull Request.