From c7523380951b906c1de32839d8ef03fa26f434f8 Mon Sep 17 00:00:00 2001 From: Tim Date: Fri, 20 Dec 2024 05:58:00 -0500 Subject: [PATCH 1/4] Create adam.md #Adam Optimizer The Adam optimizer is an adaptive learning rate optimization algorithm commonly used in deep learning models. --- content/pytorch/concepts/optimizers/adam/adam.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 content/pytorch/concepts/optimizers/adam/adam.md diff --git a/content/pytorch/concepts/optimizers/adam/adam.md b/content/pytorch/concepts/optimizers/adam/adam.md new file mode 100644 index 00000000000..8b137891791 --- /dev/null +++ b/content/pytorch/concepts/optimizers/adam/adam.md @@ -0,0 +1 @@ + From 65bb4c7b8b06c5b12d3e9033e090490d4ea13060 Mon Sep 17 00:00:00 2001 From: timothydrew92 Date: Fri, 20 Dec 2024 08:19:59 -0500 Subject: [PATCH 2/4] added adam optimizer codecademy doc --- .../pytorch/concepts/optimizers/adam/adam.md | 202 ++++++++++++++++++ 1 file changed, 202 insertions(+) diff --git a/content/pytorch/concepts/optimizers/adam/adam.md b/content/pytorch/concepts/optimizers/adam/adam.md index 8b137891791..fc922734a39 100644 --- a/content/pytorch/concepts/optimizers/adam/adam.md +++ b/content/pytorch/concepts/optimizers/adam/adam.md @@ -1 +1,203 @@ +--- +Title: How to Use the Adam Optimizer in PyTorch +Description: Learn how the Adam optimizer works and how to implement it in PyTorch to train neural networks. +Subjects: [Machine Learning, Deep Learning, Neural Networks, Artificial Intelligence] +Tags: [PyTorch, Adam, Backpropagation, Gradient Descent, Algorithm, Optimization, Weight Updates] +CatalogContent: [Machine Learning, Deep Learning, PyTorch, Optimizers] +--- +> This document is part of the broader [Optimizers](../optimizers.md) category. Learn how optimizers help train machine learning models by adjusting their weights and biases. +# Adam Optimizer + +The **Adam optimizer** (short for **Adaptive Moment Estimation**) is one of the most widely used optimizers in machine learning and deep learning. It adjusts the learning rate for each parameter dynamically, making it more efficient than standard gradient descent. The Adam optimizer is used in frameworks like **PyTorch** and **TensorFlow** to train neural networks effectively. + +### **What is Gradient Descent?** +**Gradient Descent** is an optimization algorithm used to minimize a model's **loss function** by updating its weights in the direction of the negative gradient. The goal is to find the lowest point (minimum) of the loss function, allowing the model to make better predictions. + +#### **How It Relates to Adam** +**Adam** builds on gradient descent by using **momentum** and **adaptive learning rates**, which make the updates smarter, faster, and more stable. + +## 📘 **How Adam Works** +Adam is a combination of two other optimizers: +- **Momentum**: It remembers past gradients to accelerate convergence. +- **RMSProp**: It scales the learning rate using a moving average of squared gradients. + +--- + +## 🔍 **Key Steps of the Adam Optimizer** +The **Adam optimizer** is a smarter version of gradient descent. It improves how neural networks learn by using momentum, variance tracking, and bias correction to make training more stable. Here’s a step-by-step explanation of how it works: + +1️. **Calculate the Gradient**: + - Just like gradient descent, Adam calculates how much the weights need to change to reduce the model's error. + - Think of this as figuring out the "steepness" of the slope so you know which way to move. + +2️. **Track Past Changes (Momentum)**: + - Instead of only looking at the current gradient, Adam "remembers" previous gradients to make smarter updates. + - This is like having a mental **"memory of past moves"** so you can avoid overcorrecting. + +3️. **Track Size of Changes (Variance)**: + - Adam also tracks how much the gradient is changing (the "size" of the change). + - If the changes are too big, Adam slows things down to avoid large, erratic movements. + +4️. **Correct for Initial Errors (Bias Correction)**: + - At the start of training, the momentum and variance estimates can be off. Adam corrects for this so the updates aren’t too large. + - This is like giving the model a little **"warm-up" period** so it doesn't make big mistakes early on. + +5️. **Update the Weights**: + - Finally, Adam adjusts the weights of the model using the learning rate, momentum, and variance. + - This allows for **more stable, controlled weight updates** that avoid wild jumps in learning. + +--- + +## 📘 **Syntax and Usage in PyTorch** +In PyTorch, the **Adam optimizer** is part of the `torch.optim` module. Here's the syntax for initializing it: + +```python +import torch +import torch.optim as optim + +# Define a simple model +model = torch.nn.Linear(10, 2) # 10 input features, 2 output features + +# Create an Adam optimizer +optimizer = optim.Adam(model.parameters(), lr=0.001) +``` +--- + +### 🔍 **Explanation of the Syntax** +Here’s a breakdown of each key part of the syntax for `optim.Adam()`: + +- **`optim.Adam()`**: This is part of **`torch.optim`**, which is PyTorch’s official optimizer library. It provides access to optimizers like **Adam**, **SGD**, and **RMSProp**. +- **`model.parameters()`**: This returns the weights (parameters) of the model that need to be updated during training. The optimizer uses these to compute weight updates. +- **`lr=0.001`**: This sets the learning rate (step size) for the optimizer. You can customize it based on your model's needs. Other hyperparameters like **`betas`** (for momentum) and **`eps`** (for numerical stability) can also be set here. + +--- +## 📘 **Common Use Cases for the Adam Optimizer** +The **Adam optimizer** is widely used in machine learning and deep learning because of its ability to converge quickly, handle noisy gradients, and work efficiently with large datasets. Below are some of the most common use cases where **Adam** is preferred over other optimizers like **SGD** or **RMSProp**. + +--- + +### 🔍 **1️. Training Deep Neural Networks (DNNs)** +- **Why Use Adam?** Deep Neural Networks (DNNs) often have millions of parameters that need to be updated. Adam’s ability to adapt the learning rate and track momentum makes it ideal for fast, stable training. +- **Example Use Case**: Classifying handwritten digits with a neural network like **MNIST**. + +**Example Code:** +```python +# Import necessary libraries +import torch +import torch.nn as nn +import torch.optim as optim + +#Deep Neural Network (DNN) +class NeuralNet(nn.Module): + def __init__(self): + super(NeuralNet, self).__init__() + self.layer1 = nn.Linear(784, 128) # 784 inputs, 128 outputs + self.layer2 = nn.Linear(128, 64) + self.layer3 = nn.Linear(64, 10) # 10 classes for digits 0-9 + + def forward(self, x): + x = torch.relu(self.layer1(x)) + x = torch.relu(self.layer2(x)) + x = self.layer3(x) + return x + +model_dnn = NeuralNet() +optimizer_dnn = optim.Adam(model_dnn.parameters(), lr=0.001) +``` + +### 🔍 **2️. Convolutional Neural Networks (CNNs)** +- **Why Use Adam?** CNNs are used for image recognition tasks like object detection and classification. Adam’s momentum and variance tracking ensure stable updates when training large image datasets. +- **Example Use Case**: Training an image classification model on the **CIFAR-10** dataset. + +**Example Code:** +```python +import torch +import torchvision.models as models +import torch.optim as optim +from torchvision.models import ResNet18_Weights + +# Convolutional Neural Network (CNN) +model_cnn = models.resnet18(weights=ResNet18_Weights.IMAGENET1K_V1) # Load a pre-trained ResNet18 model +optimizer_cnn = optim.Adam(model_cnn.parameters(), lr=0.0001) # Smaller learning rate for fine-tuning +``` +### 🔍 **3️. Recurrent Neural Networks (RNNs) and LSTMs** +- **Why Use Adam?** RNNs and LSTMs process sequences of data, like time series or natural language. Adam helps prevent the **exploding/vanishing gradient problem** and stabilizes training. +- **Example Use Case**: Predicting **stock prices** or **weather forecasting** with an RNN or LSTM. + +**Example Code:** +```python +import torch +import torch.nn as nn +import torch.optim as optim + +#Recurrent Neural Network (RNN) +class RNN(nn.Module): + def __init__(self, input_size, hidden_size, output_size): + super(RNN, self).__init__() + self.rnn = nn.RNN(input_size, hidden_size, batch_first=True) + self.fc = nn.Linear(hidden_size, output_size) + + def forward(self, x): + out, _ = self.rnn(x) + out = self.fc(out[:, -1, :]) + return out + +model_rnn = RNN(input_size=10, hidden_size=32, output_size=1) +optimizer_rnn = optim.Adam(model_rnn.parameters(), lr=0.001) +``` +### 🔍 **4️. Natural Language Processing (NLP) and Transformers** +- **Why Use Adam?** NLP models like **Transformers** and **BERT** require huge computational resources. Adam’s adaptive learning rate and momentum make it ideal for large language models. +- **Example Use Case**: Fine-tuning a pre-trained **BERT** model for sentiment analysis. + +**Example Code:** +```python +from transformers import BertModel, AdamW + +#Natural Language Processing (NLP) with BERT +model_nlp = BertModel.from_pretrained('bert-base-uncased') +optimizer_nlp = AdamW(model_nlp.parameters(), lr=2e-5) # AdamW is a variant of Adam used in NLP +``` +___ + +### 📘 **Advanced Insight** +If you'd like a more technical explanation, here’s the **mathematical process** behind Adam. These steps are a more precise version of the Key Steps listed above. If you’re a beginner, feel free to skip this section for now. + +#### **Advanced Explanation of the Key Steps** + +1️. **Momentum (mean) update**: +\[ +m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t +\] +This tracks the "average" of the previous gradients, so instead of using only the current gradient, it combines it with past ones. + +2️. **Variance (size) update**: +\[ +v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2 +\] +This tracks the "size" of past gradient changes to prevent wild jumps. If the gradients are too large, the optimizer takes smaller steps. + +3️. **Correct for early errors (Bias Correction)**: +\[ +\hat{m_t} = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v_t} = \frac{v_t}{1 - \beta_2^t} +\] +At the beginning of training, the initial estimates for **momentum** and **variance** are not accurate. Bias correction "fixes" these early inaccuracies so training is more stable. + +4️. **Update the weights**: +\[ +\theta_{t+1} = \theta_t - \frac{\eta \cdot \hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon} +\] +This is the final step where the weights of the model are updated. Instead of using just the gradient (like in regular gradient descent), Adam combines momentum, variance, and a small constant \(\epsilon\) to avoid division by zero. + +#### **What the Variables Mean** +- \( m_t \) = **momentum** (like the "memory" of past changes) +- \( v_t \) = **variance** (how big the changes are) +- \( \beta_1, \beta_2 \) = decay rates (how quickly momentum and variance "forget" past values) +- \( \eta \) = **learning rate** (step size for updates) +- \( \epsilon \) = small constant to avoid division by zero + +This process allows the model to make more **precise, stable updates** than regular gradient descent. + +## 🔗 **Related Resources** +- [Optimizers](../optimizers.md) — Learn about other popular optimizers like **SGD** and **RMSProp**. +- [PyTorch](https://pytorch.org/docs/) — Official PyTorch documentation. From 7a8c5da1d2f7322e848d41693cba7e2db338376a Mon Sep 17 00:00:00 2001 From: Tim Date: Mon, 23 Dec 2024 13:48:11 +0000 Subject: [PATCH 3/4] fixed header structure and spelling/grammar --- .../pytorch/concepts/optimizers/adam/adam.md | 180 +++++++++++------- 1 file changed, 107 insertions(+), 73 deletions(-) diff --git a/content/pytorch/concepts/optimizers/adam/adam.md b/content/pytorch/concepts/optimizers/adam/adam.md index fc922734a39..9bbf33a3086 100644 --- a/content/pytorch/concepts/optimizers/adam/adam.md +++ b/content/pytorch/concepts/optimizers/adam/adam.md @@ -1,55 +1,71 @@ --- Title: How to Use the Adam Optimizer in PyTorch Description: Learn how the Adam optimizer works and how to implement it in PyTorch to train neural networks. -Subjects: [Machine Learning, Deep Learning, Neural Networks, Artificial Intelligence] -Tags: [PyTorch, Adam, Backpropagation, Gradient Descent, Algorithm, Optimization, Weight Updates] +Subjects: + [Machine Learning, Deep Learning, Neural Networks, Artificial Intelligence] +Tags: + [ + PyTorch, + Adam, + Backpropagation, + Gradient Descent, + Algorithm, + Optimization, + Weight Updates, + ] CatalogContent: [Machine Learning, Deep Learning, PyTorch, Optimizers] --- -> This document is part of the broader [Optimizers](../optimizers.md) category. Learn how optimizers help train machine learning models by adjusting their weights and biases. -# Adam Optimizer +## Adam Optimizer -The **Adam optimizer** (short for **Adaptive Moment Estimation**) is one of the most widely used optimizers in machine learning and deep learning. It adjusts the learning rate for each parameter dynamically, making it more efficient than standard gradient descent. The Adam optimizer is used in frameworks like **PyTorch** and **TensorFlow** to train neural networks effectively. +The **Adam optimizer** (short for Adaptive Moment Estimation) is one of the most commonly used optimizers in machine learning and deep learning. It adjusts the learning rate for each parameter dynamically, making it more efficient than standard gradient descent. The **Adam optimizer** is used in frameworks like **PyTorch** and **TensorFlow** to train neural networks effectively. -### **What is Gradient Descent?** -**Gradient Descent** is an optimization algorithm used to minimize a model's **loss function** by updating its weights in the direction of the negative gradient. The goal is to find the lowest point (minimum) of the loss function, allowing the model to make better predictions. +**What is Gradient Descent?** +**Gradient Descent** is an optimization algorithm used to minimize a model's loss function by updating its weights in the direction of the negative gradient. The goal of gradient descent is to find the lowest point of the loss function, which allows the model to make better predictions. -#### **How It Relates to Adam** -**Adam** builds on gradient descent by using **momentum** and **adaptive learning rates**, which make the updates smarter, faster, and more stable. +**How It Relates to Adam** +The **Adam optimizer** builds on gradient descent by using **momentum** and **adaptive learning rates**, which make the updates smarter, faster, and more stable. -## 📘 **How Adam Works** -Adam is a combination of two other optimizers: -- **Momentum**: It remembers past gradients to accelerate convergence. -- **RMSProp**: It scales the learning rate using a moving average of squared gradients. +## How Adam Works + +The **Adam optimizer** is a combination of two other optimizers: + +- Momentum: Remembers past gradients to accelerate convergence. +- RMSProp: Scales the learning rate using a moving average of squared gradients. --- -## 🔍 **Key Steps of the Adam Optimizer** -The **Adam optimizer** is a smarter version of gradient descent. It improves how neural networks learn by using momentum, variance tracking, and bias correction to make training more stable. Here’s a step-by-step explanation of how it works: +## Key Steps of the Adam Optimizer + +The **Adam optimizer** is a smarter version of gradient descent. It improves upon how neural networks learn by using momentum, variance tracking, and bias correction to make training more stable. Here’s a step-by-step explanation of how it works: + +1️. **Calculate the Gradient**: + +- Just like gradient descent, Adam calculates how much the weights need to change to reduce the model's error. + +2️. **Track Past Changes (Momentum)**: + +- Instead of only looking at the current gradient, the **Adam optimizer** tracks previous gradients to make smarter updates. +- This is like having a record of past runs so you can avoid overcorrecting. + +3️. **Track Size of Changes (Variance)**: -1️. **Calculate the Gradient**: - - Just like gradient descent, Adam calculates how much the weights need to change to reduce the model's error. - - Think of this as figuring out the "steepness" of the slope so you know which way to move. +- The **Adam optimizer** also tracks how much the gradient is changing (the size of the change). +- If the changes are too big, it slows things down to avoid large, erratic movements. -2️. **Track Past Changes (Momentum)**: - - Instead of only looking at the current gradient, Adam "remembers" previous gradients to make smarter updates. - - This is like having a mental **"memory of past moves"** so you can avoid overcorrecting. +4️. **Correct for Initial Errors (Bias Correction)**: -3️. **Track Size of Changes (Variance)**: - - Adam also tracks how much the gradient is changing (the "size" of the change). - - If the changes are too big, Adam slows things down to avoid large, erratic movements. +- At the start of training, the momentum and variance estimates can be off. The **Adam optimizer** corrects for this to prevent overly large udpates early on in training. -4️. **Correct for Initial Errors (Bias Correction)**: - - At the start of training, the momentum and variance estimates can be off. Adam corrects for this so the updates aren’t too large. - - This is like giving the model a little **"warm-up" period** so it doesn't make big mistakes early on. +5️. **Update the Weights**: -5️. **Update the Weights**: - - Finally, Adam adjusts the weights of the model using the learning rate, momentum, and variance. - - This allows for **more stable, controlled weight updates** that avoid wild jumps in learning. +- Finally, the **Adam Optimizer** adjusts the weights of the model using the learning rate, momentum, and variance. +- This allows for a more stable, controlled weight update that avoids large discrepancies in learning. --- -## 📘 **Syntax and Usage in PyTorch** +## Syntax and Usage in PyTorch + In PyTorch, the **Adam optimizer** is part of the `torch.optim` module. Here's the syntax for initializing it: ```python @@ -57,45 +73,51 @@ import torch import torch.optim as optim # Define a simple model -model = torch.nn.Linear(10, 2) # 10 input features, 2 output features +model = torch.nn.Linear(8, 4) # 8 input features, 4 output features # Create an Adam optimizer optimizer = optim.Adam(model.parameters(), lr=0.001) ``` + --- -### 🔍 **Explanation of the Syntax** +### Explanation of the Syntax + Here’s a breakdown of each key part of the syntax for `optim.Adam()`: - **`optim.Adam()`**: This is part of **`torch.optim`**, which is PyTorch’s official optimizer library. It provides access to optimizers like **Adam**, **SGD**, and **RMSProp**. -- **`model.parameters()`**: This returns the weights (parameters) of the model that need to be updated during training. The optimizer uses these to compute weight updates. -- **`lr=0.001`**: This sets the learning rate (step size) for the optimizer. You can customize it based on your model's needs. Other hyperparameters like **`betas`** (for momentum) and **`eps`** (for numerical stability) can also be set here. +- **`model.parameters()`**: This returns the weights of the model that need to be updated during training. The optimizer uses these to compute weight updates. +- **`lr=0.001`**: This sets the learning rate for the optimizer. You can customize it based on your model's needs. --- -## 📘 **Common Use Cases for the Adam Optimizer** -The **Adam optimizer** is widely used in machine learning and deep learning because of its ability to converge quickly, handle noisy gradients, and work efficiently with large datasets. Below are some of the most common use cases where **Adam** is preferred over other optimizers like **SGD** or **RMSProp**. + +## Common Use Cases for the Adam Optimizer + +The **Adam optimizer** is widely used in machine learning and deep learning because of its ability to converge quickly, handle noisy gradients, and work efficiently with large datasets. Below are some of the most common use cases where The **Adam Optimizer** is preferred over other optimizers. --- -### 🔍 **1️. Training Deep Neural Networks (DNNs)** -- **Why Use Adam?** Deep Neural Networks (DNNs) often have millions of parameters that need to be updated. Adam’s ability to adapt the learning rate and track momentum makes it ideal for fast, stable training. -- **Example Use Case**: Classifying handwritten digits with a neural network like **MNIST**. +### 1️. Training Deep Neural Networks (DNNs) + +- Deep Neural Networks (DNNs) often have lots of parameters that need to be updated. Adam’s ability to adapt the learning rate and track momentum makes it ideal for fast, stable training. +- Examples: Classifying handwritten digits with a neural network like MNIST. **Example Code:** + ```python # Import necessary libraries import torch import torch.nn as nn import torch.optim as optim -#Deep Neural Network (DNN) +# Deep Neural Network (DNN) class NeuralNet(nn.Module): def __init__(self): super(NeuralNet, self).__init__() self.layer1 = nn.Linear(784, 128) # 784 inputs, 128 outputs self.layer2 = nn.Linear(128, 64) self.layer3 = nn.Linear(64, 10) # 10 classes for digits 0-9 - + def forward(self, x): x = torch.relu(self.layer1(x)) x = torch.relu(self.layer2(x)) @@ -106,11 +128,13 @@ model_dnn = NeuralNet() optimizer_dnn = optim.Adam(model_dnn.parameters(), lr=0.001) ``` -### 🔍 **2️. Convolutional Neural Networks (CNNs)** -- **Why Use Adam?** CNNs are used for image recognition tasks like object detection and classification. Adam’s momentum and variance tracking ensure stable updates when training large image datasets. -- **Example Use Case**: Training an image classification model on the **CIFAR-10** dataset. +### 2️. Convolutional Neural Networks (CNNs) + +- CNNs are used for image recognition tasks like object detection and classification. Adam’s momentum and variance tracking ensure stable updates when training large image datasets. +- Examples: Training an image classification model. **Example Code:** + ```python import torch import torchvision.models as models @@ -121,59 +145,67 @@ from torchvision.models import ResNet18_Weights model_cnn = models.resnet18(weights=ResNet18_Weights.IMAGENET1K_V1) # Load a pre-trained ResNet18 model optimizer_cnn = optim.Adam(model_cnn.parameters(), lr=0.0001) # Smaller learning rate for fine-tuning ``` -### 🔍 **3️. Recurrent Neural Networks (RNNs) and LSTMs** -- **Why Use Adam?** RNNs and LSTMs process sequences of data, like time series or natural language. Adam helps prevent the **exploding/vanishing gradient problem** and stabilizes training. -- **Example Use Case**: Predicting **stock prices** or **weather forecasting** with an RNN or LSTM. + +### 3️. Recurrent Neural Networks (RNNs) and LSTMs + +- RNNs and LSTMs process sequences of data, like time series or natural language. Adam helps prevent the exploding/vanishing gradient problem and stabilizes training. +- ExampleS: Predicting stock prices or weather forecasting with an RNN or LSTM. **Example Code:** + ```python import torch import torch.nn as nn import torch.optim as optim -#Recurrent Neural Network (RNN) +# Recurrent Neural Network (RNN) class RNN(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(RNN, self).__init__() self.rnn = nn.RNN(input_size, hidden_size, batch_first=True) self.fc = nn.Linear(hidden_size, output_size) - + def forward(self, x): out, _ = self.rnn(x) - out = self.fc(out[:, -1, :]) + out = self.fc(out[:, -1, :]) return out model_rnn = RNN(input_size=10, hidden_size=32, output_size=1) optimizer_rnn = optim.Adam(model_rnn.parameters(), lr=0.001) ``` -### 🔍 **4️. Natural Language Processing (NLP) and Transformers** -- **Why Use Adam?** NLP models like **Transformers** and **BERT** require huge computational resources. Adam’s adaptive learning rate and momentum make it ideal for large language models. -- **Example Use Case**: Fine-tuning a pre-trained **BERT** model for sentiment analysis. + +### 4️. Natural Language Processing (NLP) and Transformers + +- NLP models like Transformers and BERT require huge computational resources. The **Adam optimizer's** adaptive learning rate and momentum make it ideal for large language models. +- Examples: Fine-tuning a pre-trained BERT model for sentiment analysis. **Example Code:** + ```python from transformers import BertModel, AdamW -#Natural Language Processing (NLP) with BERT +# Natural Language Processing (NLP) with BERT model_nlp = BertModel.from_pretrained('bert-base-uncased') optimizer_nlp = AdamW(model_nlp.parameters(), lr=2e-5) # AdamW is a variant of Adam used in NLP ``` -___ -### 📘 **Advanced Insight** -If you'd like a more technical explanation, here’s the **mathematical process** behind Adam. These steps are a more precise version of the Key Steps listed above. If you’re a beginner, feel free to skip this section for now. +--- + +### Advanced Insight -#### **Advanced Explanation of the Key Steps** +For a more technical explinattion, here is the mathematical process behind the **Adam optimizer**. + +#### Advanced Explanation of the Key Steps 1️. **Momentum (mean) update**: \[ -m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t +m*t = \beta_1 \cdot m*{t-1} + (1 - \beta_1) \cdot g_t \] -This tracks the "average" of the previous gradients, so instead of using only the current gradient, it combines it with past ones. +This tracks the "average" of the previous gradients, so instead of using only the current gradient, it combines it with past gradients. 2️. **Variance (size) update**: \[ -v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2 +v*t = \beta_2 \cdot v*{t-1} + (1 - \beta_2) \cdot g_t^2 \] This tracks the "size" of past gradient changes to prevent wild jumps. If the gradients are too large, the optimizer takes smaller steps. @@ -181,23 +213,25 @@ This tracks the "size" of past gradient changes to prevent wild jumps. If the gr \[ \hat{m_t} = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v_t} = \frac{v_t}{1 - \beta_2^t} \] -At the beginning of training, the initial estimates for **momentum** and **variance** are not accurate. Bias correction "fixes" these early inaccuracies so training is more stable. +At the beginning of training, the initial estimates for momentum and variance are not accurate. Bias correction fixes these early inaccuracies so training is more stable. 4️. **Update the weights**: \[ -\theta_{t+1} = \theta_t - \frac{\eta \cdot \hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon} +\theta\_{t+1} = \theta_t - \frac{\eta \cdot \hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon} \] -This is the final step where the weights of the model are updated. Instead of using just the gradient (like in regular gradient descent), Adam combines momentum, variance, and a small constant \(\epsilon\) to avoid division by zero. +This is the final step in which the weights of the model are updated. Instead of using just the gradient (like in regular gradient descent), the **Adam optimizer** combines momentum, variance, and a small constant \(\epsilon\) to avoid division by zero. + +#### What the Variables Mean + +- \( m_t \) = momentum (like the "memory" of past changes) +- \( v_t \) = variance (how big the changes are) +- \( \beta_1, \beta_2 \) = decay rates (how quickly momentum and variance forget past values) +- \( \eta \) = learning rate (step size for updates) +- \( \epsilon \) = small constant to avoid division by zero -#### **What the Variables Mean** -- \( m_t \) = **momentum** (like the "memory" of past changes) -- \( v_t \) = **variance** (how big the changes are) -- \( \beta_1, \beta_2 \) = decay rates (how quickly momentum and variance "forget" past values) -- \( \eta \) = **learning rate** (step size for updates) -- \( \epsilon \) = small constant to avoid division by zero +This process allows the model to make more precise and stable updates than regular gradient descent. -This process allows the model to make more **precise, stable updates** than regular gradient descent. +## Related Resources -## 🔗 **Related Resources** - [Optimizers](../optimizers.md) — Learn about other popular optimizers like **SGD** and **RMSProp**. - [PyTorch](https://pytorch.org/docs/) — Official PyTorch documentation. From d829a12e63e783e9ed5868935640af3f29ad3d8e Mon Sep 17 00:00:00 2001 From: Tim Date: Tue, 24 Dec 2024 11:36:15 +0000 Subject: [PATCH 4/4] added general references and double checked spelling/formatting --- .../pytorch/concepts/optimizers/adam/adam.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/content/pytorch/concepts/optimizers/adam/adam.md b/content/pytorch/concepts/optimizers/adam/adam.md index 9bbf33a3086..9835f9af561 100644 --- a/content/pytorch/concepts/optimizers/adam/adam.md +++ b/content/pytorch/concepts/optimizers/adam/adam.md @@ -18,13 +18,13 @@ CatalogContent: [Machine Learning, Deep Learning, PyTorch, Optimizers] ## Adam Optimizer -The **Adam optimizer** (short for Adaptive Moment Estimation) is one of the most commonly used optimizers in machine learning and deep learning. It adjusts the learning rate for each parameter dynamically, making it more efficient than standard gradient descent. The **Adam optimizer** is used in frameworks like **PyTorch** and **TensorFlow** to train neural networks effectively. +The **Adam optimizer** (short for Adaptive Moment Estimation) is one of the most commonly used optimizers in machine learning and deep learning. It adjusts the learning rate for each parameter dynamically, making it more efficient than standard gradient descent. The **Adam optimizer** is used in frameworks like PyTorch and TensorFlow to train neural networks effectively. **What is Gradient Descent?** **Gradient Descent** is an optimization algorithm used to minimize a model's loss function by updating its weights in the direction of the negative gradient. The goal of gradient descent is to find the lowest point of the loss function, which allows the model to make better predictions. **How It Relates to Adam** -The **Adam optimizer** builds on gradient descent by using **momentum** and **adaptive learning rates**, which make the updates smarter, faster, and more stable. +The **Adam optimizer** builds on gradient descent by using momentum and adaptive learning rates, which make the updates smarter, faster, and more stable. ## How Adam Works @@ -93,7 +93,7 @@ Here’s a breakdown of each key part of the syntax for `optim.Adam()`: ## Common Use Cases for the Adam Optimizer -The **Adam optimizer** is widely used in machine learning and deep learning because of its ability to converge quickly, handle noisy gradients, and work efficiently with large datasets. Below are some of the most common use cases where The **Adam Optimizer** is preferred over other optimizers. +The **Adam optimizer** is widely used in machine learning and deep learning because of its ability to converge quickly, handle noisy gradients, and work efficiently with large datasets. Below are some of the most common use cases where The **Adam optimizer** is preferred over other optimizers. --- @@ -193,7 +193,7 @@ optimizer_nlp = AdamW(model_nlp.parameters(), lr=2e-5) # AdamW is a variant of ### Advanced Insight -For a more technical explinattion, here is the mathematical process behind the **Adam optimizer**. +For a more technical explanation, here is the mathematical process behind the **Adam optimizer**. #### Advanced Explanation of the Key Steps @@ -231,7 +231,12 @@ This is the final step in which the weights of the model are updated. Instead of This process allows the model to make more precise and stable updates than regular gradient descent. -## Related Resources +## General Resources -- [Optimizers](../optimizers.md) — Learn about other popular optimizers like **SGD** and **RMSProp**. -- [PyTorch](https://pytorch.org/docs/) — Official PyTorch documentation. +1. Kingma, D. P., & Ba, J. (2014). _Adam: A method for stochastic optimization_. arXiv preprint arXiv:1412.6980. Retrieved from [https://arxiv.org/abs/1412.6980](https://arxiv.org/abs/1412.6980) + +2. PyTorch. (n.d.). _torch.optim.Adam_. Retrieved from [https://pytorch.org/docs/stable/generated/torch.optim.Adam.html](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) + +3. TensorFlow. (n.d.). _Adam optimizer_. Retrieved from [https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) + +4. Brownlee, J. (2020). _Gentle Introduction to the Adam Optimization Algorithm for Deep Learning_. Machine Learning Mastery. Retrieved from [https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)