CompPhysics
diff --git a/‎doc/pub/week37/html/._week37-bs022.html‎
Lines changed: 2 additions & 2 deletions b/‎doc/pub/week37/html/._week37-bs022.html‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎doc/pub/week37/html/._week37-bs023.html‎
Lines changed: 2 additions & 2 deletions b/‎doc/pub/week37/html/._week37-bs023.html‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎doc/pub/week37/html/._week37-bs024.html‎
Lines changed: 4 additions & 4 deletions b/‎doc/pub/week37/html/._week37-bs024.html‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎doc/pub/week37/html/._week37-bs025.html‎
Lines changed: 2 additions & 2 deletions b/‎doc/pub/week37/html/._week37-bs025.html‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎doc/pub/week37/html/._week37-bs027.html‎
Lines changed: 1 addition & 1 deletion b/‎doc/pub/week37/html/._week37-bs027.html‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎doc/pub/week37/html/._week37-bs029.html‎
Lines changed: 3 additions & 3 deletions b/‎doc/pub/week37/html/._week37-bs029.html‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎doc/pub/week37/html/._week37-bs041.html‎
Lines changed: 2 additions & 2 deletions b/‎doc/pub/week37/html/._week37-bs041.html‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎doc/pub/week37/html/._week37-bs043.html‎
Lines changed: 3 additions & 3 deletions b/‎doc/pub/week37/html/._week37-bs043.html‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎doc/pub/week37/html/._week37-bs044.html‎
Lines changed: 5 additions & 5 deletions b/‎doc/pub/week37/html/._week37-bs044.html‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎doc/pub/week37/html/._week37-bs047.html‎
Lines changed: 2 additions & 2 deletions b/‎doc/pub/week37/html/._week37-bs047.html‎
Lines changed: 2 additions & 2 deletions
@@ -339,8 +339,8 @@ <h2 id="stochastic-gradient-descent" class="anchor">Stochastic Gradient Descent
 sum over \( n \) data points \( \{\mathbf{x}_i\}_{i=1}^n \),
 </p>
 $$
-C(\mathbf{\beta}) = \sum_{i=1}^n c_i(\mathbf{x}_i,
-\mathbf{\beta}). 
+C(\mathbf{\theta}) = \sum_{i=1}^n c_i(\mathbf{x}_i,
+\mathbf{\theta}). 
 $$
 
 
 
@@ -334,8 +334,8 @@ <h2 id="computation-of-gradients" class="anchor">Computation of gradients </h2>
 computed as a sum over \( i \)-gradients 
 </p>
 $$
-\nabla_\beta C(\mathbf{\beta}) = \sum_i^n \nabla_\beta c_i(\mathbf{x}_i,
-\mathbf{\beta}).
+\nabla_\theta C(\mathbf{\theta}) = \sum_i^n \nabla_\theta c_i(\mathbf{x}_i,
+\mathbf{\theta}).
 $$
 
 <p>Stochasticity/randomness is introduced by only taking the
 
@@ -344,10 +344,10 @@ <h2 id="sgd-example" class="anchor">SGD example </h2>
 picked at random in each gradient descent step 
 </p>
 $$
-\nabla_{\beta}
-C(\mathbf{\beta}) = \sum_{i=1}^n \nabla_\beta c_i(\mathbf{x}_i,
-\mathbf{\beta}) \rightarrow \sum_{i \in B_k}^n \nabla_\beta
-c_i(\mathbf{x}_i, \mathbf{\beta}).
+\nabla_{\theta}
+C(\mathbf{\theta}) = \sum_{i=1}^n \nabla_\theta c_i(\mathbf{x}_i,
+\mathbf{\theta}) \rightarrow \sum_{i \in B_k}^n \nabla_\theta
+c_i(\mathbf{x}_i, \mathbf{\theta}).
 $$
 
 
 
@@ -332,8 +332,8 @@ <h2 id="the-gradient-step" class="anchor">The gradient step </h2>
 
 <p>Thus a gradient descent step now looks like </p>
 $$
-\beta_{j+1} = \beta_j - \gamma_j \sum_{i \in B_k}^n \nabla_\beta c_i(\mathbf{x}_i,
-\mathbf{\beta})
+\theta_{j+1} = \theta_j - \gamma_j \sum_{i \in B_k}^n \nabla_\theta c_i(\mathbf{x}_i,
+\mathbf{\theta})
 $$
 
 <p>where \( k \) is picked at random with equal
 
@@ -338,7 +338,7 @@ <h2 id="when-do-we-stop" class="anchor">When do we stop? </h2>
 that we are close to a local/global minimum. However, we could also
 evaluate the cost function at this point, store the result and
 continue the search. If the test kicks in at a later stage we can
-compare the values of the cost function and keep the \( \beta \) that
+compare the values of the cost function and keep the \( \theta \) that
 gave the lowest value.
 </p>
 
 
@@ -332,10 +332,10 @@ <h2 id="time-decay-rate" class="anchor">Time decay rate </h2>
 
 <p>As an example, let \( e = 0,1,2,3,\cdots \) denote the current epoch and let \( t_0, t_1 > 0 \) be two fixed numbers. Furthermore, let \( t = e \cdot m + i \) where \( m \) is the number of minibatches and \( i=0,\cdots,m-1 \). Then the function $$\gamma_j(t; t_0, t_1) = \frac{t_0}{t+t_1} $$ goes to zero as the number of epochs gets large. I.e. we start with a step length \( \gamma_j (0; t_0, t_1) = t_0/t_1 \) which decays in <em>time</em> \( t \).</p>
 
-<p>In this way we can fix the number of epochs, compute \( \beta \) and
+<p>In this way we can fix the number of epochs, compute \( \theta \) and
 evaluate the cost function at the end. Repeating the computation will
 give a different result since the scheme is random by design. Then we
-pick the final \( \beta \) that gives the lowest value of the cost
+pick the final \( \theta \) that gives the lowest value of the cost
 function.
 </p>
 
@@ -364,7 +364,7 @@ <h2 id="time-decay-rate" class="anchor">Time decay rate </h2>
     <span style="color: #008000; font-weight: bold">for</span> i <span style="color: #AA22FF; font-weight: bold">in</span> <span style="color: #008000">range</span>(m):
         k <span style="color: #666666">=</span> np<span style="color: #666666">.</span>random<span style="color: #666666">.</span>randint(m) <span style="color: #408080; font-style: italic">#Pick the k-th minibatch at random</span>
         <span style="color: #408080; font-style: italic">#Compute the gradient using the data in minibatch Bk</span>
-        <span style="color: #408080; font-style: italic">#Compute new suggestion for beta</span>
+        <span style="color: #408080; font-style: italic">#Compute new suggestion for theta</span>
         t <span style="color: #666666">=</span> epoch<span style="color: #666666">*</span>m<span style="color: #666666">+</span>i
         gamma_j <span style="color: #666666">=</span> step_length(t,t0,t1)
         j <span style="color: #666666">+=</span> <span style="color: #666666">1</span>
 
@@ -333,10 +333,10 @@ <h2 id="rmsprop-adaptive-learning-rates" class="anchor">RMSProp: Adaptive Learni
 Uses a decaying average of squared gradients (instead of a cumulative sum):
 </p>
 $$
-v_t = \beta_2\, v_{t-1} + (1-\beta_2)\, (\nabla L(w_t))^2,
+v_t = \theta_2\, v_{t-1} + (1-\theta_2)\, (\nabla L(w_t))^2,
 $$
 
-<p>with \( \beta_2 \) typically \( 0.9 \) (or \( 0.99 \)).</p>
+<p>with \( \theta_2 \) typically \( 0.9 \) (or \( 0.99 \)).</p>
 <ol>
 <li> Update: \( w_{t+1} = w_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} \nabla L(w_t) \).</li>
 <li> Recent gradients have more weight, so \( v_t \) adapts to the current landscape.</li>
 
@@ -340,13 +340,13 @@ <h2 id="rms-prop" class="anchor">RMS prop </h2>
 \begin{align}
 \mathbf{g}_t &= \nabla_\theta E(\boldsymbol{\theta}) 
 \tag{3}\\
-\mathbf{s}_t &=\beta \mathbf{s}_{t-1} +(1-\beta)\mathbf{g}_t^2 \nonumber \\
+\mathbf{s}_t &=\theta \mathbf{s}_{t-1} +(1-\theta)\mathbf{g}_t^2 \nonumber \\
 \boldsymbol{\theta}_{t+1}&=&\boldsymbol{\theta}_t - \eta_t { \mathbf{g}_t \over \sqrt{\mathbf{s}_t +\epsilon}}, \nonumber
 \end{align}
 $$
 
-<p>where \( \beta \) controls the averaging time of the second moment and is
-typically taken to be about \( \beta=0.9 \), \( \eta_t \) is a learning rate
+<p>where \( \theta \) controls the averaging time of the second moment and is
+typically taken to be about \( \theta=0.9 \), \( \eta_t \) is a learning rate
 typically chosen to be \( 10^{-3} \), and \( \epsilon\sim 10^{-8}  \) is a
 small regularization constant to prevent divergences. Multiplication
 and division by vectors is understood as an element-wise operation. It
 
@@ -355,16 +355,16 @@ <h2 id="adam-optimizer-https-arxiv-org-abs-1412-6980" class="anchor"><a href="ht
 \begin{align}
 \mathbf{g}_t &= \nabla_\theta E(\boldsymbol{\theta}) 
 \tag{4}\\
-\mathbf{m}_t &= \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \mathbf{g}_t \nonumber \\
-\mathbf{s}_t &=\beta_2 \mathbf{s}_{t-1} +(1-\beta_2)\mathbf{g}_t^2 \nonumber \\
-\boldsymbol{\mathbf{m}}_t&={\mathbf{m}_t \over 1-\beta_1^t} \nonumber \\
-\boldsymbol{\mathbf{s}}_t &={\mathbf{s}_t \over1-\beta_2^t} \nonumber \\
+\mathbf{m}_t &= \theta_1 \mathbf{m}_{t-1} + (1-\theta_1) \mathbf{g}_t \nonumber \\
+\mathbf{s}_t &=\theta_2 \mathbf{s}_{t-1} +(1-\theta_2)\mathbf{g}_t^2 \nonumber \\
+\boldsymbol{\mathbf{m}}_t&={\mathbf{m}_t \over 1-\theta_1^t} \nonumber \\
+\boldsymbol{\mathbf{s}}_t &={\mathbf{s}_t \over1-\theta_2^t} \nonumber \\
 \boldsymbol{\theta}_{t+1}&=\boldsymbol{\theta}_t - \eta_t { \boldsymbol{\mathbf{m}}_t \over \sqrt{\boldsymbol{\mathbf{s}}_t} +\epsilon}, \nonumber \\
 \tag{5}
 \end{align}
 $$
 
-<p>where \( \beta_1 \) and \( \beta_2 \) set the memory lifetime of the first and
+<p>where \( \theta_1 \) and \( \theta_2 \) set the memory lifetime of the first and
 second moment and are typically taken to be \( 0.9 \) and \( 0.99 \)
 respectively, and \( \eta \) and \( \epsilon \) are identical to RMSprop.
 </p>
 
@@ -356,8 +356,8 @@ <h2 id="sneaking-in-automatic-differentiation-using-autograd" class="anchor">Sne
 <span style="color: #008000; font-weight: bold">import</span> <span style="color: #0000FF; font-weight: bold">matplotlib.pyplot</span> <span style="color: #008000; font-weight: bold">as</span> <span style="color: #0000FF; font-weight: bold">plt</span>
 <span style="color: #008000; font-weight: bold">from</span> <span style="color: #0000FF; font-weight: bold">autograd</span> <span style="color: #008000; font-weight: bold">import</span> grad
 
-<span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">CostOLS</span>(beta):
-    <span style="color: #008000; font-weight: bold">return</span> (<span style="color: #666666">1.0/</span>n)<span style="color: #666666">*</span>np<span style="color: #666666">.</span>sum((y<span style="color: #666666">-</span>X <span style="color: #666666">@</span> beta)<span style="color: #666666">**2</span>)
+<span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">CostOLS</span>(theta):
+    <span style="color: #008000; font-weight: bold">return</span> (<span style="color: #666666">1.0/</span>n)<span style="color: #666666">*</span>np<span style="color: #666666">.</span>sum((y<span style="color: #666666">-</span>X <span style="color: #666666">@</span> theta)<span style="color: #666666">**2</span>)
 
 n <span style="color: #666666">=</span> <span style="color: #666666">100</span>
 x <span style="color: #666666">=</span> <span style="color: #666666">2*</span>np<span style="color: #666666">.</span>random<span style="color: #666666">.</span>rand(n,<span style="color: #666666">1</span>)