|
115 | 115 | "\n",
|
116 | 116 | "In order to learn by example, we first need examples. In machine learning, we construct datasets of the form:\n",
|
117 | 117 | "\n",
|
118 |
| - "$$\\mathcal{D} = \\left\\{\\left(x^{(i)}, y^{(i)}\\right)\\right\\}_{i=1}^N$$\n", |
| 118 | + "$$D = \\left\\{\\left(x^{(i)}, y^{(i)}\\right)\\right\\}_{i=1}^N$$\n", |
119 | 119 | "\n",
|
120 |
| - "Written in English, dataset $\\mathcal{D}$ is composed of $N$ pairs of inputs $x$ and expected outputs $y$. $x$ and $y$ can be tabular data, images, text, or any other object that can be represented mathematically.\n", |
| 120 | + "Written in English, dataset $D$ is composed of $N$ pairs of inputs $x$ and expected outputs $y$. $x$ and $y$ can be tabular data, images, text, or any other object that can be represented mathematically.\n", |
121 | 121 | "\n",
|
122 | 122 | "\n",
|
123 | 123 | "\n",
|
|
261 | 261 | "\n",
|
262 | 262 | "If $y$ is our expected output (also called \"ground truth\") and $\\hat{y}$ is our predicted output, our goal is to minimize the difference between $y$ and $\\hat{y}$. This difference is referred to as *error* or *loss*, and the loss function tells us how big of a mistake we made. For regression tasks, a simple mean squared error is sufficient:\n",
|
263 | 263 | "\n",
|
264 |
| - "$$\\mathcal{L}(y, \\hat{y}) = \\left(y - \\hat{y}\\right)^2$$\n", |
| 264 | + "$$L(y, \\hat{y}) = \\left(y - \\hat{y}\\right)^2$$\n", |
265 | 265 | "\n",
|
266 | 266 | "For classification tasks, such as EuroSAT, we instead use a negative log-likelihood:\n",
|
267 | 267 | "\n",
|
268 |
| - "$$\\mathcal{L}_c(y, \\hat{y}) = - \\sum_{c=1}^C \\mathbb{1}_{y=\\hat{y}}\\log{p_c}$$\n", |
| 268 | + "$$L_c(y, \\hat{y}) = - \\sum_{c=1}^C \\mathbb{1}_{y=\\hat{y}}\\log{p_c}$$\n", |
269 | 269 | "\n",
|
270 | 270 | "where $\\mathbb{1}$ is the indicator function and $p_c$ is the probability with which the model predicts class $c$. By normalizing this over the log probability of all classes, we get the cross-entropy loss."
|
271 | 271 | ]
|
|
289 | 289 | "\n",
|
290 | 290 | "In order to minimize our loss, we compute the gradient of the loss function with respect to model parameters $\\theta$. We then take a small step $\\alpha$ (also called the *learning rate*) in the direction of the negative gradient to update our model parameters in a process called *backpropagation*:\n",
|
291 | 291 | "\n",
|
292 |
| - "$$\\theta \\leftarrow \\theta - \\alpha \\nabla_\\theta \\mathcal{L}(y, \\hat{y})$$\n", |
| 292 | + "$$\\theta \\leftarrow \\theta - \\alpha \\nabla_\\theta L(y, \\hat{y})$$\n", |
293 | 293 | "\n",
|
294 | 294 | "When done one image or one mini-batch at a time, this is known as *stochastic gradient descent* (SGD)."
|
295 | 295 | ]
|
|
0 commit comments