- Pytorch is tensor-library, Automatic differentiation engine, Deep learning library
- library that allows you to do deep-learning, training and fine-tuning a model
This Jupyter notebook demonstrates the process of building a Large Language Model (LLM) from scratch, covering essential components such as data preparation, tokenization, model architecture (embedding, attention mechanisms, feed-forward networks), training, and evaluation. It aims to provide a fundamental understanding of LLM construction without relying on high-level frameworks for core components.
- Introduction
- Credits
- Setup and Dependencies
- Dataset
- Tokenization
- Model Architecture
- Training
- Inference
This notebook provides a step-by-step guide to understanding the internal workings of a transformer-based LLM. It focuses on implementing the foundational blocks of such a model from first principles, making it an excellent resource for those who want to grasp the core concepts behind modern LLMs.
A few images and concepts in this notebook are inspired by Sebastian Raschka's blog.
To run this notebook, you will need the following libraries:
torch
: For building and training the neural network.torch.nn
: Neural network modules.torch.nn.functional
(asF
): Common neural network functions.torch.utils.data
: Utilities for data loading.tiktoken
: For advanced tokenization (specifically, OpenAI'scl100k_base
tokenizer is used).tqdm
: For progress bars during training.
You can install these dependencies using pip:
pip install torch tiktoken tqdm
The notebook utilizes a simple dataset for demonstration purposes. The dataset consists of pairs of input and target sequences.
-
Input Sample: [1, 5, 2, 6, 3, 7, 4, 8]
-
Target Sample: [5, 2, 6, 3, 7, 4, 8, 9]
This setup is characteristic of sequence-to-sequence or next-token prediction tasks common in LLMs.
The notebook implements a custom tokenization process and also leverages tiktoken for more robust tokenization.
-
Custom Tokenizer: A basic tokenizer is created to map characters to integers and vice versa. It builds a vocabulary from the input text and provides encode and decode functions.
-
tiktoken Integration: The cl100k_base tokenizer from tiktoken (used by OpenAI's models like gpt-4, gpt-3.5-turbo, text-embedding-ada-002) is also demonstrated for more practical tokenization.
The LLM is built component by component, detailing the implementation of each key layer:
-
Token Embeddings: Maps input tokens (integers) to dense vector representations.
-
Positional Embeddings: Adds positional information to the token embeddings, crucial for transformers to understand the order of tokens in a sequence. This notebook implements fixed positional embeddings (sine and cosine functions).
The core of the Transformer architecture. This notebook details the SelfAttention mechanism.
-
Query, Key, Value (QKV): Explains how input sequences are transformed into Query, Key, and Value matrices.
-
Scaled Dot-Product Attention: Calculation of attention scores using the formula: softmax(Q * K^T / sqrt(d_k)) * V.
-
Masked Self-Attention: For decoder-only models, a look-ahead mask is applied to prevent tokens from attending to future tokens during training.
Extends the single attention mechanism by performing multiple attention calculations in parallel.
- Combines the outputs of multiple attention heads to capture diverse relationships within the sequence.
- A simple two-layer neural network with a ReLU activation, applied independently to each position in the sequence.
- Combines Multi-Head Attention and a Feed Forward Network, along with Layer Normalization and residual connections.
-
The complete LLM architecture, stacking multiple Transformer Blocks. This architecture is suitable for generative tasks like text completion.
-
Input: Token IDs.
-
Output: Logits for the next token in the sequence.
-
The training process involves:
- Loss Function: Cross-entropy loss is used, suitable for multi-class classification (predicting the next token from the vocabulary).
- Optimizer: AdamW optimizer is employed for efficient training.
- Batching: Data is processed in batches to improve training stability and speed.
- Training Loop: Iterates over the dataset, performs forward and backward passes, and updates model weights.
- Evaluation: The model is evaluated on a test set to monitor performance.
After training, the notebook demonstrates how to use the trained model for inference to generate new sequences.
- Greedy Decoding: The model predicts the next token with the highest probability at each step.
- Generating Text: Shows how to seed the model with an initial sequence and have it generate a continuation.
This README provides a high-level overview of the llm-model-from-scratch.ipynb notebook. For detailed implementation and further understanding, please refer to the notebook itself.