Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Term Entry] PyTorch Optimizers: Adam #5879

Closed
wants to merge 7 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
242 changes: 242 additions & 0 deletions content/pytorch/concepts/optimizers/adam/adam.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
---
Title: How to Use the Adam Optimizer in PyTorch
Description: Learn how the Adam optimizer works and how to implement it in PyTorch to train neural networks.
Subjects:
[Machine Learning, Deep Learning, Neural Networks, Artificial Intelligence]
Tags:
[
PyTorch,
Adam,
Backpropagation,
Gradient Descent,
Algorithm,
Optimization,
Weight Updates,
]
CatalogContent: [Machine Learning, Deep Learning, PyTorch, Optimizers]
---

## Adam Optimizer

The **Adam optimizer** (short for Adaptive Moment Estimation) is one of the most commonly used optimizers in machine learning and deep learning. It adjusts the learning rate for each parameter dynamically, making it more efficient than standard gradient descent. The **Adam optimizer** is used in frameworks like PyTorch and TensorFlow to train neural networks effectively.

**What is Gradient Descent?**
**Gradient Descent** is an optimization algorithm used to minimize a model's loss function by updating its weights in the direction of the negative gradient. The goal of gradient descent is to find the lowest point of the loss function, which allows the model to make better predictions.

**How It Relates to Adam**
The **Adam optimizer** builds on gradient descent by using momentum and adaptive learning rates, which make the updates smarter, faster, and more stable.

## How Adam Works

The **Adam optimizer** is a combination of two other optimizers:

- Momentum: Remembers past gradients to accelerate convergence.
- RMSProp: Scales the learning rate using a moving average of squared gradients.

---

## Key Steps of the Adam Optimizer

The **Adam optimizer** is a smarter version of gradient descent. It improves upon how neural networks learn by using momentum, variance tracking, and bias correction to make training more stable. Here’s a step-by-step explanation of how it works:

1️. **Calculate the Gradient**:

- Just like gradient descent, Adam calculates how much the weights need to change to reduce the model's error.

2️. **Track Past Changes (Momentum)**:

- Instead of only looking at the current gradient, the **Adam optimizer** tracks previous gradients to make smarter updates.
- This is like having a record of past runs so you can avoid overcorrecting.

3️. **Track Size of Changes (Variance)**:

- The **Adam optimizer** also tracks how much the gradient is changing (the size of the change).
- If the changes are too big, it slows things down to avoid large, erratic movements.

4️. **Correct for Initial Errors (Bias Correction)**:

- At the start of training, the momentum and variance estimates can be off. The **Adam optimizer** corrects for this to prevent overly large udpates early on in training.

5️. **Update the Weights**:

- Finally, the **Adam Optimizer** adjusts the weights of the model using the learning rate, momentum, and variance.
- This allows for a more stable, controlled weight update that avoids large discrepancies in learning.

---

## Syntax and Usage in PyTorch

In PyTorch, the **Adam optimizer** is part of the `torch.optim` module. Here's the syntax for initializing it:

```python
import torch
import torch.optim as optim

# Define a simple model
model = torch.nn.Linear(8, 4) # 8 input features, 4 output features

# Create an Adam optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)
```

---

### Explanation of the Syntax

Here’s a breakdown of each key part of the syntax for `optim.Adam()`:

- **`optim.Adam()`**: This is part of **`torch.optim`**, which is PyTorch’s official optimizer library. It provides access to optimizers like **Adam**, **SGD**, and **RMSProp**.
- **`model.parameters()`**: This returns the weights of the model that need to be updated during training. The optimizer uses these to compute weight updates.
- **`lr=0.001`**: This sets the learning rate for the optimizer. You can customize it based on your model's needs.

---

## Common Use Cases for the Adam Optimizer

The **Adam optimizer** is widely used in machine learning and deep learning because of its ability to converge quickly, handle noisy gradients, and work efficiently with large datasets. Below are some of the most common use cases where The **Adam optimizer** is preferred over other optimizers.

---

### 1️. Training Deep Neural Networks (DNNs)

- Deep Neural Networks (DNNs) often have lots of parameters that need to be updated. Adam’s ability to adapt the learning rate and track momentum makes it ideal for fast, stable training.
- Examples: Classifying handwritten digits with a neural network like MNIST.

**Example Code:**

```python
# Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim

# Deep Neural Network (DNN)
class NeuralNet(nn.Module):
def __init__(self):
super(NeuralNet, self).__init__()
self.layer1 = nn.Linear(784, 128) # 784 inputs, 128 outputs
self.layer2 = nn.Linear(128, 64)
self.layer3 = nn.Linear(64, 10) # 10 classes for digits 0-9

def forward(self, x):
x = torch.relu(self.layer1(x))
x = torch.relu(self.layer2(x))
x = self.layer3(x)
return x

model_dnn = NeuralNet()
optimizer_dnn = optim.Adam(model_dnn.parameters(), lr=0.001)
```

### 2️. Convolutional Neural Networks (CNNs)

- CNNs are used for image recognition tasks like object detection and classification. Adam’s momentum and variance tracking ensure stable updates when training large image datasets.
- Examples: Training an image classification model.

**Example Code:**

```python
import torch
import torchvision.models as models
import torch.optim as optim
from torchvision.models import ResNet18_Weights

# Convolutional Neural Network (CNN)
model_cnn = models.resnet18(weights=ResNet18_Weights.IMAGENET1K_V1) # Load a pre-trained ResNet18 model
optimizer_cnn = optim.Adam(model_cnn.parameters(), lr=0.0001) # Smaller learning rate for fine-tuning
```

### 3️. Recurrent Neural Networks (RNNs) and LSTMs

- RNNs and LSTMs process sequences of data, like time series or natural language. Adam helps prevent the exploding/vanishing gradient problem and stabilizes training.
- ExampleS: Predicting stock prices or weather forecasting with an RNN or LSTM.

**Example Code:**

```python
import torch
import torch.nn as nn
import torch.optim as optim

# Recurrent Neural Network (RNN)
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(RNN, self).__init__()
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)

def forward(self, x):
out, _ = self.rnn(x)
out = self.fc(out[:, -1, :])
return out

model_rnn = RNN(input_size=10, hidden_size=32, output_size=1)
optimizer_rnn = optim.Adam(model_rnn.parameters(), lr=0.001)
```

### 4️. Natural Language Processing (NLP) and Transformers

- NLP models like Transformers and BERT require huge computational resources. The **Adam optimizer's** adaptive learning rate and momentum make it ideal for large language models.
- Examples: Fine-tuning a pre-trained BERT model for sentiment analysis.

**Example Code:**

```python
from transformers import BertModel, AdamW

# Natural Language Processing (NLP) with BERT
model_nlp = BertModel.from_pretrained('bert-base-uncased')
optimizer_nlp = AdamW(model_nlp.parameters(), lr=2e-5) # AdamW is a variant of Adam used in NLP
```

---

### Advanced Insight

For a more technical explanation, here is the mathematical process behind the **Adam optimizer**.

#### Advanced Explanation of the Key Steps

1️. **Momentum (mean) update**:
\[
m*t = \beta_1 \cdot m*{t-1} + (1 - \beta_1) \cdot g_t
\]
This tracks the "average" of the previous gradients, so instead of using only the current gradient, it combines it with past gradients.

2️. **Variance (size) update**:
\[
v*t = \beta_2 \cdot v*{t-1} + (1 - \beta_2) \cdot g_t^2
\]
This tracks the "size" of past gradient changes to prevent wild jumps. If the gradients are too large, the optimizer takes smaller steps.

3️. **Correct for early errors (Bias Correction)**:
\[
\hat{m_t} = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v_t} = \frac{v_t}{1 - \beta_2^t}
\]
At the beginning of training, the initial estimates for momentum and variance are not accurate. Bias correction fixes these early inaccuracies so training is more stable.

4️. **Update the weights**:
\[
\theta\_{t+1} = \theta_t - \frac{\eta \cdot \hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon}
\]
This is the final step in which the weights of the model are updated. Instead of using just the gradient (like in regular gradient descent), the **Adam optimizer** combines momentum, variance, and a small constant \(\epsilon\) to avoid division by zero.

#### What the Variables Mean

- \( m_t \) = momentum (like the "memory" of past changes)
- \( v_t \) = variance (how big the changes are)
- \( \beta_1, \beta_2 \) = decay rates (how quickly momentum and variance forget past values)
- \( \eta \) = learning rate (step size for updates)
- \( \epsilon \) = small constant to avoid division by zero

This process allows the model to make more precise and stable updates than regular gradient descent.

## General Resources

1. Kingma, D. P., & Ba, J. (2014). _Adam: A method for stochastic optimization_. arXiv preprint arXiv:1412.6980. Retrieved from [https://arxiv.org/abs/1412.6980](https://arxiv.org/abs/1412.6980)

2. PyTorch. (n.d.). _torch.optim.Adam_. Retrieved from [https://pytorch.org/docs/stable/generated/torch.optim.Adam.html](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html)

3. TensorFlow. (n.d.). _Adam optimizer_. Retrieved from [https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam)

4. Brownlee, J. (2020). _Gentle Introduction to the Adam Optimization Algorithm for Deep Learning_. Machine Learning Mastery. Retrieved from [https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)
Loading