Skip to content

Commit 553bcab

Browse files
committed
Update week47.do.txt
1 parent e162f98 commit 553bcab

File tree

1 file changed

+274
-0
lines changed

1 file changed

+274
-0
lines changed

doc/src/week47/week47.do.txt

Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -894,3 +894,277 @@ if __name__ == "__main__":
894894
evaluate_and_plot(model, dataset, seq_len=seq_len)
895895

896896
!ec
897+
898+
899+
900+
Simple RNN Lecture Series
901+
902+
Lecture 1: Introduction to Simple RNNs
903+
904+
\begin{frame}{Why Recurrent Networks?}
905+
\begin{itemize}
906+
\item Traditional feedforward networks process fixed-size inputs and ignore temporal order. RNNs incorporate recurrence to handle sequential data like time series or language .
907+
\item At each time step, an RNN cell processes input x_t and a hidden state h_{t-1} from the previous step, producing a new hidden state h_t and (optionally) an output y_t.
908+
\item This hidden state acts as a “memory” carrying information forward. For example, predicting stock prices or words in a sentence relies on past inputs  .
909+
\item RNNs share parameters across time steps, so they can generalize patterns regardless of sequence length .
910+
\end{itemize}
911+
\end{frame}
912+
913+
\begin{frame}{RNN Forward Pass Equations}
914+
• For a simple (vanilla) RNN with one hidden layer and no bias, the state update and output are:
915+
\mathbf{h}_t = \phi(\mathbf{W}_{xh}\mathbf{x}_t + \mathbf{W}_{hh}\mathbf{h}_{t-1})\,,\quad
916+
\mathbf{y}_t = \mathbf{W}_{yh}\mathbf{h}_t\,,
917+
where \phi is a nonlinear activation (e.g. tanh or ReLU) .
918+
• In matrix form, \mathbf{W}_{xh}\in\mathbb{R}^{h\times d}, \mathbf{W}_{hh}\in\mathbb{R}^{h\times h}, \mathbf{W}_{yh}\in\mathbb{R}^{q\times h} for input dim d, hidden dim h, output dim q .
919+
• We often also write y_t = f(\mathbf{o}_t) with \mathbf{o}_t=W_{yh}h_t to include a final activation for classification.
920+
• Because the same \mathbf{W} are used each step, gradients during training will propagate through time (see Lecture 2).
921+
\end{frame}
922+
923+
\begin{frame}{Unrolled RNN in Time}
924+
925+
\begin{itemize}
926+
\item The diagram above shows an RNN cell unrolled over three time steps. Each copy of the cell shares the same weights .
927+
\item Input x_1,x_2,x_3 feed sequentially; the hidden state flows from one step to the next, capturing past context.
928+
\item After processing the final input x_T, the network can make a prediction (many-to-one) or outputs can be produced at each step (many-to-many).
929+
\item Unrolling clarifies that training an RNN is like training a deep feedforward network of depth T, with recurrent connections tying layers together .
930+
\end{itemize}
931+
\end{frame}
932+
933+
\begin{frame}{Example Task: Character-level RNN Classification}
934+
\begin{itemize}
935+
\item A classic example: feed a name (sequence of characters) one char at a time, and classify its language of origin.
936+
\item At each step, the RNN outputs a hidden state; we use the final hidden state to predict the class of the entire sequence .
937+
\item “A character-level RNN reads words as a series of characters—outputting a prediction and ‘hidden state’ at each step, feeding the previous hidden state into the next step. We take the final prediction to be the output” .
938+
\item This illustrates sequence-to-one modeling: every output depends on all previous inputs.
939+
\end{itemize}
940+
\end{frame}
941+
942+
\begin{frame}{PyTorch: Defining a Simple RNN}
943+
\begin{lstlisting}
944+
import torch, torch.nn as nn
945+
946+
A simple RNN-based model
947+
948+
model = nn.Sequential(
949+
nn.RNN(input_size=10, hidden_size=20, num_layers=2, batch_first=True),
950+
nn.Linear(20, 5) # 5 output classes
951+
)
952+
953+
Example input: batch of 3 sequences, each of length 7, input dim 10
954+
955+
x = torch.randn(3, 7, 10)
956+
output, hn = model(x) # output shape: (3,7,20), hn shape: (2,3,20)
957+
\end{lstlisting}
958+
• PyTorch’s nn.RNN(in_features, hidden_size, num_layers) stacks RNN layers (here 2) .
959+
• The output tensor has shape (batch, seq_len, hidden_size); hn contains the last hidden states for each layer.
960+
• This code example is adapted from PyTorch docs  .
961+
\end{frame}
962+
963+
\begin{frame}{TensorFlow: Defining a Simple RNN}
964+
\begin{lstlisting}
965+
from tensorflow.keras import layers, models
966+
967+
Sequential model with a single SimpleRNN layer
968+
969+
model = models.Sequential([
970+
layers.SimpleRNN(64, input_shape=(None, 10), activation=‘tanh’),
971+
layers.Dense(5, activation=‘softmax’)
972+
])
973+
model.summary()
974+
\end{lstlisting}
975+
• This Keras model takes variable-length sequences of 10-dimensional inputs; the SimpleRNN(64) layer returns a 64-dim vector.
976+
• We then use a dense output for classification. TensorFlow usage example:
977+
model.add(layers.SimpleRNN(128)) produces output shape (None,128) .
978+
• Both frameworks require you to handle batching and sequence lengths (e.g., batch_first=True in PyTorch makes input shape (batch,seq,features), whereas in TensorFlow Keras, batch is first by default ).
979+
\end{frame}
980+
981+
Lecture 2: Backpropagation Through Time (BPTT) and Gradients
982+
983+
\begin{frame}{Backpropagation Through Time (BPTT)}
984+
\begin{itemize}
985+
\item Training an RNN involves computing gradients through time by unfolding the network: treat the unrolled RNN as a very deep feedforward net.
986+
\item We compute the loss L = \frac{1}{T}\sum_{t=1}^T \ell(y_t,\hat y_t) and backpropagate from t=T down to t=1.
987+
\item The computational graph below (for 3 steps) shows how each hidden state depends on inputs and parameters across time .
988+
\item BPTT applies the chain rule along this graph, accumulating gradients from each time step into the shared parameters.
989+
\end{itemize}
990+
\end{frame}
991+
992+
\begin{frame}{RNN Computational Graph}
993+
994+
\begin{itemize}
995+
\item Boxes (variables) and circles (operations) illustrate dependencies: each h_t depends on W_{xh},W_{hh}, h_{t-1}, x_t .
996+
\item During the backward pass, we traverse this graph in reverse, summing gradients that flow from future time steps.
997+
\item Note how the hidden state paths merge: gradients at step t come from both the loss at t and from the next step t+1.
998+
\end{itemize}
999+
\end{frame}
1000+
1001+
\begin{frame}{Gradient Computation (BPTT Equations)}
1002+
• Let L be the total loss. For the final time step T,
1003+
\frac{\partial L}{\partial h_T} = W_{yh}^T \frac{\partial L}{\partial y_T} using y_T=W_{yh}h_T .
1004+
• For any intermediate step t<T, the gradient w.r.t. the hidden state obeys the recurrence:
1005+
\displaystyle\frac{\partial L}{\partial h_t} = W_{hh}^T\frac{\partial L}{\partial h_{t+1}} + W_{yh}^T\frac{\partial L}{\partial y_t} .
1006+
• In words: error from the future step (\partial L/\partial h_{t+1}) is backpropagated through W_{hh}, and error from current output is backpropagated through W_{yh}.
1007+
• The gradients for the parameters are sums over time: e.g., \partial L/\partial W_{yh} = \sum_{t=1}^T \frac{\partial L}{\partial y_t}h_t^T  and similar for W_{xh},W_{hh}.
1008+
\end{frame}
1009+
1010+
\begin{frame}{Exploding and Vanishing Gradients}
1011+
• Unfolded BPTT shows gradients involve powers of W_{hh}. For example, expanding the recurrence gives terms like (W_{hh}^T)^{k} multiplying gradients from k steps ahead .
1012+
• Vanishing gradients: If \|W_{hh}\|<1 (eigenvalues <1), repeated multiplications shrink gradients exponentially, making it hard to learn long-range dependencies.
1013+
• Exploding gradients: If \|W_{hh}\|>1, gradients grow exponentially, causing instability.
1014+
• As noted: “eigenvalues smaller than 1 vanish and eigenvalues larger than 1 diverge. This is numerically unstable… manifesting as vanishing and exploding gradients.” .
1015+
• In practice, one mitigates this by careful weight initialization and gradient clipping (e.g., clip_grad_norm_ in PyTorch), or by truncating BPTT.
1016+
\end{frame}
1017+
1018+
\begin{frame}{Truncated BPTT and Gradient Clipping}
1019+
\begin{itemize}
1020+
\item Truncated BPTT: Instead of backpropagating through all T steps, we may backpropagate through a fixed window of length \tau. This approximates the full gradient and reduces computation .
1021+
\item Concretely, one computes gradients up to \tau steps and treats gradients beyond as zero. This still allows learning short-term patterns efficiently.
1022+
\item Gradient Clipping: Cap the gradient norm to a maximum value to prevent explosion. For example in PyTorch:
1023+
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) ensures \|\nabla\|\le1.
1024+
\item These techniques help stabilize training, but the fundamental vanishing problem motivates using alternative RNN cells (LSTM/GRU) in practice (though we do not cover them here).
1025+
\end{itemize}
1026+
\end{frame}
1027+
1028+
\begin{frame}{Mathematical Insight on Gradients}
1029+
• In a linearized RNN, one can derive (9.7.15) showing \partial L/\partial h_t = \sum_{i=t}^T (W_{hh}^T)^{T-i} W_{yh}^T \frac{\partial L}{\partial y_{T+t-i}} .
1030+
• From this, any eigenvalue \lambda of W_{hh} contributes factors \lambda^{T-t} to gradients. If |\lambda|<1, terms vanish as T\to\infty; if |\lambda|>1, they explode.
1031+
• Thus, simple RNNs struggle with long-term dependencies: information far back in time has negligible influence on the loss gradient .
1032+
• Truncation effectively ignores terms beyond a window, trading off accuracy for stability.
1033+
\end{frame}
1034+
1035+
Lecture 3: Applications of Simple RNNs
1036+
1037+
\begin{frame}{RNNs for Time Series Forecasting}
1038+
\begin{itemize}
1039+
\item Forecasting: RNNs can predict future values from historical data. Example tasks include stock prices, weather patterns, or any temporal signal .
1040+
\item By feeding in sequence \{x_1,x_2,\dots,x_T\}, an RNN can output a prediction y_T (one-step ahead) or even a full sequence \{y_2,\dots,y_{T+1}\}.
1041+
\item Unlike linear models, RNNs can capture complex temporal patterns (trends, seasonality, autocorrelation) in a data-driven way .
1042+
\item Preprocessing (normalization, sliding windows) is important. Split data into train/test by time (no shuffling).
1043+
\end{itemize}
1044+
\end{frame}
1045+
1046+
\begin{frame}{Sequence Modeling Tasks}
1047+
\begin{itemize}
1048+
\item Many-to-One: Classify or predict one value from an entire sequence (e.g., sentiment analysis of a movie review, or classifying a time series). We use the final hidden state as a summary of the sequence.
1049+
\item Many-to-Many (Prediction): Predict an output at each time step (e.g., language modeling or sequential regression). RNN outputs are used at each step.
1050+
\item Encoder–Decoder (Seq2Seq): (Advanced) Map input sequences to output sequences of different lengths. Though typically LSTM-based, it’s conceptually possible with simple RNNs.
1051+
\item RNNs also apply to physics and biology: e.g., modeling dynamical systems, protein sequences, or neuroscience time series. Any domain with sequential data can use RNN-based modeling.
1052+
\end{itemize}
1053+
\end{frame}
1054+
1055+
\begin{frame}{PyTorch Example: Time Series Regression}
1056+
\begin{lstlisting}
1057+
import torch.nn as nn
1058+
1059+
Simple RNN for sequence-to-value regression
1060+
1061+
class ForecastRNN(nn.Module):
1062+
def init(self, input_dim, hidden_dim):
1063+
super().init()
1064+
self.rnn = nn.RNN(input_dim, hidden_dim, batch_first=True)
1065+
self.fc = nn.Linear(hidden_dim, 1)
1066+
def forward(self, x):
1067+
out, _ = self.rnn(x) # out: (batch, seq_len, hidden_dim)
1068+
return self.fc(out[:, -1, :]) # use last output for prediction
1069+
1070+
model = ForecastRNN(input_dim=1, hidden_dim=20)
1071+
x = torch.randn(16, 10, 1) # batch=16, seq_len=10
1072+
y_pred = model(x) # shape (16,1)
1073+
\end{lstlisting}
1074+
• This RNN reads 10 time steps of a univariate series and outputs a single prediction (one-step ahead).
1075+
• We apply a linear layer to the last RNN output (out[:,-1,:]) to make the forecast.
1076+
• Training would minimize MSE between y_pred and true next value.
1077+
\end{frame}
1078+
1079+
\begin{frame}{TensorFlow Example: Time Series Regression}
1080+
\begin{lstlisting}
1081+
from tensorflow.keras import layers, models
1082+
1083+
model = models.Sequential([
1084+
layers.SimpleRNN(20, input_shape=(None, 1), activation=‘tanh’),
1085+
layers.Dense(1)
1086+
])
1087+
model.compile(loss=‘mse’, optimizer=‘adam’)
1088+
model.summary()
1089+
\end{lstlisting}
1090+
• Here SimpleRNN(20) processes sequences of shape (batch, seq_len, 1), outputting a 20-dimensional vector for the last time step.
1091+
• We then predict a single value with Dense(1). The model is compiled with mean-squared error for regression.
1092+
• This aligns with the PyTorch example: both frameworks allow easy stacking of an RNN layer followed by a Dense layer for output.
1093+
\end{frame}
1094+
1095+
\begin{frame}{Other Sequence Applications}
1096+
\begin{itemize}
1097+
\item Sequence Classification: Use RNN hidden state for class labels. For example, classify a time series into anomaly vs normal.
1098+
\item Sequence Labeling: Predict labels at each time step (e.g. part-of-speech tagging). The RNN outputs a vector at each step passed through a classification layer.
1099+
\item Language and Text: (Advanced) Character or word-level models use RNNs to generate text or classify documents. E.g., predicting next character from previous ones (RNN language model) .
1100+
\item Physically Motivated Data: RNNs can model dynamical systems (e.g., rolling ball trajectories, neuron spikes over time, climate data). They learn temporal patterns directly from data without explicit equations.
1101+
\end{itemize}
1102+
\end{frame}
1103+
1104+
\begin{frame}{Training and Practical Tips}
1105+
\begin{itemize}
1106+
\item Loss Functions: Use MSE for regression tasks, cross-entropy for classification tasks. Sum or average losses over time steps as needed.
1107+
\item Batching Sequences: Handle variable-length sequences by padding or using masking. PyTorch pack_padded_sequence or Keras masking can help.
1108+
\item Optimization: Standard optimizers (SGD, Adam) work. Learning rate may need tuning due to sequential correlations.
1109+
\item Initial Hidden State: Usually initialized to zeros. Can also learn an initial state or carry hidden state across batches for very long sequences (stateful=True in Keras).
1110+
\item Regularization: Dropout can be applied to inputs or recurrent states (PyTorch/RNN has dropout option; Keras has dropout/recurrent_dropout).
1111+
\end{itemize}
1112+
\end{frame}
1113+
1114+
Lecture 4: Advanced Topics in Simple RNNs
1115+
1116+
\begin{frame}{Stacked (Deep) RNNs}
1117+
1118+
\begin{itemize}
1119+
\item In a deep RNN, multiple RNN layers are stacked: the output of layer l-1 at time t feeds into layer l at the same time (see Fig. 10.3.1) .
1120+
\item Formally, for layer l: H_t^{(l)} = \phi\big(H_t^{(l-1)}W_{xh}^{(l)} + H_{t-1}^{(l)}W_{hh}^{(l)} + b_h^{(l)}\big) .
1121+
\item Stacking increases model capacity, allowing complex input-to-output transformations and temporal modeling.
1122+
\item In code (PyTorch), use num_layers>1 or explicitly stack nn.RNN layers . In Keras, add multiple SimpleRNN layers with return_sequences=True on all but the last.
1123+
\end{itemize}
1124+
\end{frame}
1125+
1126+
\begin{frame}{Bidirectional RNNs (Brief Mention)}
1127+
\begin{itemize}
1128+
\item Though beyond “simple” RNNs, note: Bidirectional RNNs process the sequence forward and backward, concatenating both hidden states. Useful when entire sequence is known (e.g. offline tasks).
1129+
\item In PyTorch: nn.RNN(..., bidirectional=True) makes a 2-directional RNN. In Keras: layers.Bidirectional(layers.SimpleRNN(...)).
1130+
\item This doubles parameters and output size but can improve context capture. However, it still uses simple RNN cells internally (no gates).
1131+
\end{itemize}
1132+
\end{frame}
1133+
1134+
\begin{frame}{Parameter and Output Shapes}
1135+
\begin{itemize}
1136+
\item For PyTorch’s nn.RNN, inputs shape (batch, seq_len, input_dim) (with batch_first=True) ; output is (batch, seq_len, hidden_size).
1137+
\item Keras SimpleRNN(units, input_shape=(None,input_dim)) outputs shape (batch, units) by default (last output). To get full sequences, use return_sequences=True.
1138+
\item Hidden state tensor shapes: PyTorch returns h_n of shape (num_layers * num_directions, batch, hidden_size).
1139+
\item Always check framework docs for exact conventions (e.g., PyTorch’s batch_first flag)  .
1140+
\end{itemize}
1141+
\end{frame}
1142+
1143+
\begin{frame}{Limitations and Considerations}
1144+
\begin{itemize}
1145+
\item Vanishing Gradients: Simple RNNs have fundamental difficulty learning long-term dependencies due to gradient decay .
1146+
\item Capacity: Without gates, RNNs may struggle with tasks requiring remembering far-back inputs. Training can be slow as it’s inherently sequential.
1147+
\item Alternatives: In practice, gated RNNs (LSTM/GRU) or Transformers are often used for long-range dependencies. However, simple RNNs are still instructive and sometimes sufficient for short sequences  .
1148+
\item Regularization: Weight decay or dropout (on inputs/states) can help generalization but must be applied carefully due to temporal correlations.
1149+
\item Statefulness: For very long sequences, one can preserve hidden state across batches (stateful RNN) to avoid resetting memory.
1150+
\end{itemize}
1151+
\end{frame}
1152+
1153+
\begin{frame}{Summary of Forward/Backward}
1154+
\begin{itemize}
1155+
\item Forward pass: h_t = \phi(W_{xh} x_t + W_{hh} h_{t-1} + b), y_t = W_{yh} h_t + c .
1156+
\item Backward pass (BPTT): Gradients flow as
1157+
\frac{\partial L}{\partial h_t} = W_{hh}^T \frac{\partial L}{\partial h_{t+1}} + W_{yh}^T \frac{\partial L}{\partial y_t} , accumulating through time.
1158+
\item Training: Use gradient descent (SGD, Adam, etc.) to update weights W. Monitor for vanishing/exploding gradients and use clipping or truncation as needed  .
1159+
\item Usage: RNNs excel at modeling sequences with short- to mid-term dependencies. They have been widely used in time series, NLP, and any domain with temporal structure  .
1160+
\end{itemize}
1161+
\end{frame}
1162+
1163+
\begin{frame}{Further Resources}
1164+
\begin{itemize}
1165+
\item Deep Learning (Goodfellow et al., 2016) – Chapter on RNNs, BPTT, and gradient issues .
1166+
\item Dive into Deep Learning (https://d2l.ai) – Sections 9.4–9.7 cover RNN implementation and BPTT  .
1167+
\item PyTorch and TensorFlow official tutorials – for practical code examples of RNNs  .
1168+
\item Original papers: Pascanu et al. (2013) On the Difficulty of Training RNNs, Bengio et al. (1994) Learning long-term dependencies is difficult.
1169+
\end{itemize}
1170+
\end{frame}

0 commit comments

Comments
 (0)