Codecademy · timothydrew92 · Dec 20, 2024 · Dec 20, 2024 · Dec 20, 2024 · Dec 20, 2024
diff --git a/content/pytorch/concepts/optimizers/adam/adam.md b/content/pytorch/concepts/optimizers/adam/adam.md
@@ -0,0 +1,242 @@
+---
+Title: How to Use the Adam Optimizer in PyTorch
+Description: Learn how the Adam optimizer works and how to implement it in PyTorch to train neural networks.
+Subjects:
+  [Machine Learning, Deep Learning, Neural Networks, Artificial Intelligence]
+Tags:
+  [
+    PyTorch,
+    Adam,
+    Backpropagation,
+    Gradient Descent,
+    Algorithm,
+    Optimization,
+    Weight Updates,
+  ]
+CatalogContent: [Machine Learning, Deep Learning, PyTorch, Optimizers]
+---
+
+## Adam Optimizer
+
+The **Adam optimizer** (short for Adaptive Moment Estimation) is one of the most commonly used optimizers in machine learning and deep learning. It adjusts the learning rate for each parameter dynamically, making it more efficient than standard gradient descent. The **Adam optimizer** is used in frameworks like PyTorch and TensorFlow to train neural networks effectively.
+
+**What is Gradient Descent?**
+**Gradient Descent** is an optimization algorithm used to minimize a model's loss function by updating its weights in the direction of the negative gradient. The goal of gradient descent is to find the lowest point of the loss function, which allows the model to make better predictions.
+
+**How It Relates to Adam**
+The **Adam optimizer** builds on gradient descent by using momentum and adaptive learning rates, which make the updates smarter, faster, and more stable.
+
+## How Adam Works
+
+The **Adam optimizer** is a combination of two other optimizers:
+
+- Momentum: Remembers past gradients to accelerate convergence.
+- RMSProp: Scales the learning rate using a moving average of squared gradients.
+
+---
+
+## Key Steps of the Adam Optimizer
+
+The **Adam optimizer** is a smarter version of gradient descent. It improves upon how neural networks learn by using momentum, variance tracking, and bias correction to make training more stable. Here’s a step-by-step explanation of how it works:
+
+1️. **Calculate the Gradient**:
+
+- Just like gradient descent, Adam calculates how much the weights need to change to reduce the model's error.
+
+2️. **Track Past Changes (Momentum)**:
+
+- Instead of only looking at the current gradient, the **Adam optimizer** tracks previous gradients to make smarter updates.
+- This is like having a record of past runs so you can avoid overcorrecting.
+
+3️. **Track Size of Changes (Variance)**:
+
+- The **Adam optimizer** also tracks how much the gradient is changing (the size of the change).
+- If the changes are too big, it slows things down to avoid large, erratic movements.
+
+4️. **Correct for Initial Errors (Bias Correction)**:
+
+- At the start of training, the momentum and variance estimates can be off. The **Adam optimizer** corrects for this to prevent overly large udpates early on in training.
+
+5️. **Update the Weights**:
+
+- Finally, the **Adam Optimizer** adjusts the weights of the model using the learning rate, momentum, and variance.
+- This allows for a more stable, controlled weight update that avoids large discrepancies in learning.
+
+---
+
+## Syntax and Usage in PyTorch
+
+In PyTorch, the **Adam optimizer** is part of the `torch.optim` module. Here's the syntax for initializing it:
+
+```python
+import torch
+import torch.optim as optim
+
+# Define a simple model
+model = torch.nn.Linear(8, 4)  # 8 input features, 4 output features
+
+# Create an Adam optimizer
+optimizer = optim.Adam(model.parameters(), lr=0.001)
+```
+
+---
+
+### Explanation of the Syntax
+
+Here’s a breakdown of each key part of the syntax for `optim.Adam()`:
+
+- **`optim.Adam()`**: This is part of **`torch.optim`**, which is PyTorch’s official optimizer library. It provides access to optimizers like **Adam**, **SGD**, and **RMSProp**.
+- **`model.parameters()`**: This returns the weights of the model that need to be updated during training. The optimizer uses these to compute weight updates.
+- **`lr=0.001`**: This sets the learning rate for the optimizer. You can customize it based on your model's needs.
+
+---
+
+## Common Use Cases for the Adam Optimizer
+
+The **Adam optimizer** is widely used in machine learning and deep learning because of its ability to converge quickly, handle noisy gradients, and work efficiently with large datasets. Below are some of the most common use cases where The **Adam optimizer** is preferred over other optimizers.
+
+---
+
+### 1️. Training Deep Neural Networks (DNNs)
+
+- Deep Neural Networks (DNNs) often have lots of parameters that need to be updated. Adam’s ability to adapt the learning rate and track momentum makes it ideal for fast, stable training.
+- Examples: Classifying handwritten digits with a neural network like MNIST.
+
+**Example Code:**
+
+```python
+# Import necessary libraries
+import torch
+import torch.nn as nn
+import torch.optim as optim
+
+# Deep Neural Network (DNN)
+class NeuralNet(nn.Module):
+    def __init__(self):
+        super(NeuralNet, self).__init__()
+        self.layer1 = nn.Linear(784, 128)  # 784 inputs, 128 outputs
+        self.layer2 = nn.Linear(128, 64)
+        self.layer3 = nn.Linear(64, 10)  # 10 classes for digits 0-9
+
+    def forward(self, x):
+        x = torch.relu(self.layer1(x))
+        x = torch.relu(self.layer2(x))
+        x = self.layer3(x)
+        return x
+
+model_dnn = NeuralNet()
+optimizer_dnn = optim.Adam(model_dnn.parameters(), lr=0.001)
+```
+
+### 2️. Convolutional Neural Networks (CNNs)
+
+- CNNs are used for image recognition tasks like object detection and classification. Adam’s momentum and variance tracking ensure stable updates when training large image datasets.
+- Examples: Training an image classification model.
+
+**Example Code:**
+
+```python
+import torch
+import torchvision.models as models
+import torch.optim as optim
+from torchvision.models import ResNet18_Weights
+
+# Convolutional Neural Network (CNN)
+model_cnn = models.resnet18(weights=ResNet18_Weights.IMAGENET1K_V1)  # Load a pre-trained ResNet18 model
+optimizer_cnn = optim.Adam(model_cnn.parameters(), lr=0.0001)  # Smaller learning rate for fine-tuning
+```
+
+### 3️. Recurrent Neural Networks (RNNs) and LSTMs
+
+- RNNs and LSTMs process sequences of data, like time series or natural language. Adam helps prevent the exploding/vanishing gradient problem and stabilizes training.
+- ExampleS: Predicting stock prices or weather forecasting with an RNN or LSTM.
+
+**Example Code:**
+
+```python
+import torch
+import torch.nn as nn
+import torch.optim as optim
+
+# Recurrent Neural Network (RNN)
+class RNN(nn.Module):
+    def __init__(self, input_size, hidden_size, output_size):
+        super(RNN, self).__init__()
+        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
+        self.fc = nn.Linear(hidden_size, output_size)
+
+    def forward(self, x):
+        out, _ = self.rnn(x)
+        out = self.fc(out[:, -1, :])
+        return out
+
+model_rnn = RNN(input_size=10, hidden_size=32, output_size=1)
+optimizer_rnn = optim.Adam(model_rnn.parameters(), lr=0.001)
+```
+
+### 4️. Natural Language Processing (NLP) and Transformers
+
+- NLP models like Transformers and BERT require huge computational resources. The **Adam optimizer's** adaptive learning rate and momentum make it ideal for large language models.
+- Examples: Fine-tuning a pre-trained BERT model for sentiment analysis.
+
+**Example Code:**
+
+```python
+from transformers import BertModel, AdamW
+
+# Natural Language Processing (NLP) with BERT
+model_nlp = BertModel.from_pretrained('bert-base-uncased')
+optimizer_nlp = AdamW(model_nlp.parameters(), lr=2e-5)  # AdamW is a variant of Adam used in NLP
+```
+
+---
+
+### Advanced Insight
+
+For a more technical explanation, here is the mathematical process behind the **Adam optimizer**.
+
+#### Advanced Explanation of the Key Steps
+
+1️. **Momentum (mean) update**:  
+\[
+m*t = \beta_1 \cdot m*{t-1} + (1 - \beta_1) \cdot g_t
+\]
+This tracks the "average" of the previous gradients, so instead of using only the current gradient, it combines it with past gradients.
+
+2️. **Variance (size) update**:  
+\[
+v*t = \beta_2 \cdot v*{t-1} + (1 - \beta_2) \cdot g_t^2
+\]
+This tracks the "size" of past gradient changes to prevent wild jumps. If the gradients are too large, the optimizer takes smaller steps.
+
+3️. **Correct for early errors (Bias Correction)**:  
+\[
+\hat{m_t} = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v_t} = \frac{v_t}{1 - \beta_2^t}
+\]
+At the beginning of training, the initial estimates for momentum and variance are not accurate. Bias correction fixes these early inaccuracies so training is more stable.
+
+4️. **Update the weights**:  
+\[
+\theta\_{t+1} = \theta_t - \frac{\eta \cdot \hat{m_t}}{\sqrt{\hat{v_t}} + \epsilon}
+\]
+This is the final step in which the weights of the model are updated. Instead of using just the gradient (like in regular gradient descent), the **Adam optimizer** combines momentum, variance, and a small constant \(\epsilon\) to avoid division by zero.
+
+#### What the Variables Mean
+
+- \( m_t \) = momentum (like the "memory" of past changes)
+- \( v_t \) = variance (how big the changes are)
+- \( \beta_1, \beta_2 \) = decay rates (how quickly momentum and variance forget past values)
+- \( \eta \) = learning rate (step size for updates)
+- \( \epsilon \) = small constant to avoid division by zero
+
+This process allows the model to make more precise and stable updates than regular gradient descent.
+
+## General Resources
+
+1. Kingma, D. P., & Ba, J. (2014). _Adam: A method for stochastic optimization_. arXiv preprint arXiv:1412.6980. Retrieved from [https://arxiv.org/abs/1412.6980](https://arxiv.org/abs/1412.6980)
+
+2. PyTorch. (n.d.). _torch.optim.Adam_. Retrieved from [https://pytorch.org/docs/stable/generated/torch.optim.Adam.html](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html)
+
+3. TensorFlow. (n.d.). _Adam optimizer_. Retrieved from [https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam)
+
+4. Brownlee, J. (2020). _Gentle Introduction to the Adam Optimization Algorithm for Deep Learning_. Machine Learning Mastery. Retrieved from [https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)