You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
\item Traditional feedforward networks process fixed-size inputs and ignore temporal order. RNNs incorporate recurrence to handle sequential data like time series or language .
907
+
\item At each time step, an RNN cell processes input x_t and a hidden state h_{t-1} from the previous step, producing a new hidden state h_t and (optionally) an output y_t.
908
+
\item This hidden state acts as a “memory” carrying information forward. For example, predicting stock prices or words in a sentence relies on past inputs  .
909
+
\item RNNs share parameters across time steps, so they can generalize patterns regardless of sequence length .
910
+
\end{itemize}
911
+
\end{frame}
912
+
913
+
\begin{frame}{RNN Forward Pass Equations}
914
+
• For a simple (vanilla) RNN with one hidden layer and no bias, the state update and output are:
where \phi is a nonlinear activation (e.g. tanh or ReLU) .
918
+
• In matrix form, \mathbf{W}_{xh}\in\mathbb{R}^{h\times d}, \mathbf{W}_{hh}\in\mathbb{R}^{h\times h}, \mathbf{W}_{yh}\in\mathbb{R}^{q\times h} for input dim d, hidden dim h, output dim q .
919
+
• We often also write y_t = f(\mathbf{o}_t) with \mathbf{o}_t=W_{yh}h_t to include a final activation for classification.
920
+
• Because the same \mathbf{W} are used each step, gradients during training will propagate through time (see Lecture 2).
921
+
\end{frame}
922
+
923
+
\begin{frame}{Unrolled RNN in Time}
924
+
925
+
\begin{itemize}
926
+
\item The diagram above shows an RNN cell unrolled over three time steps. Each copy of the cell shares the same weights .
927
+
\item Input x_1,x_2,x_3 feed sequentially; the hidden state flows from one step to the next, capturing past context.
928
+
\item After processing the final input x_T, the network can make a prediction (many-to-one) or outputs can be produced at each step (many-to-many).
929
+
\item Unrolling clarifies that training an RNN is like training a deep feedforward network of depth T, with recurrent connections tying layers together .
\item A classic example: feed a name (sequence of characters) one char at a time, and classify its language of origin.
936
+
\item At each step, the RNN outputs a hidden state; we use the final hidden state to predict the class of the entire sequence .
937
+
\item “A character-level RNN reads words as a series of characters—outputting a prediction and ‘hidden state’ at each step, feeding the previous hidden state into the next step. We take the final prediction to be the output” .
938
+
\item This illustrates sequence-to-one modeling: every output depends on all previous inputs.
• Both frameworks require you to handle batching and sequence lengths (e.g., batch_first=True in PyTorch makes input shape (batch,seq,features), whereas in TensorFlow Keras, batch is first by default ).
979
+
\end{frame}
980
+
981
+
Lecture 2: Backpropagation Through Time (BPTT) and Gradients
982
+
983
+
\begin{frame}{Backpropagation Through Time (BPTT)}
984
+
\begin{itemize}
985
+
\item Training an RNN involves computing gradients through time by unfolding the network: treat the unrolled RNN as a very deep feedforward net.
986
+
\item We compute the loss L = \frac{1}{T}\sum_{t=1}^T \ell(y_t,\hat y_t) and backpropagate from t=T down to t=1.
987
+
\item The computational graph below (for 3 steps) shows how each hidden state depends on inputs and parameters across time .
988
+
\item BPTT applies the chain rule along this graph, accumulating gradients from each time step into the shared parameters.
989
+
\end{itemize}
990
+
\end{frame}
991
+
992
+
\begin{frame}{RNN Computational Graph}
993
+
994
+
\begin{itemize}
995
+
\item Boxes (variables) and circles (operations) illustrate dependencies: each h_t depends on W_{xh},W_{hh}, h_{t-1}, x_t .
996
+
\item During the backward pass, we traverse this graph in reverse, summing gradients that flow from future time steps.
997
+
\item Note how the hidden state paths merge: gradients at step t come from both the loss at t and from the next step t+1.
• In words: error from the future step (\partial L/\partial h_{t+1}) is backpropagated through W_{hh}, and error from current output is backpropagated through W_{yh}.
1007
+
• The gradients for the parameters are sums over time: e.g., \partial L/\partial W_{yh} = \sum_{t=1}^T \frac{\partial L}{\partial y_t}h_t^T  and similar for W_{xh},W_{hh}.
1008
+
\end{frame}
1009
+
1010
+
\begin{frame}{Exploding and Vanishing Gradients}
1011
+
• Unfolded BPTT shows gradients involve powers of W_{hh}. For example, expanding the recurrence gives terms like (W_{hh}^T)^{k} multiplying gradients from k steps ahead .
1012
+
• Vanishing gradients: If \|W_{hh}\|<1 (eigenvalues <1), repeated multiplications shrink gradients exponentially, making it hard to learn long-range dependencies.
1013
+
• Exploding gradients: If \|W_{hh}\|>1, gradients grow exponentially, causing instability.
1014
+
• As noted: “eigenvalues smaller than 1 vanish and eigenvalues larger than 1 diverge. This is numerically unstable… manifesting as vanishing and exploding gradients.” .
1015
+
• In practice, one mitigates this by careful weight initialization and gradient clipping (e.g., clip_grad_norm_ in PyTorch), or by truncating BPTT.
1016
+
\end{frame}
1017
+
1018
+
\begin{frame}{Truncated BPTT and Gradient Clipping}
1019
+
\begin{itemize}
1020
+
\item Truncated BPTT: Instead of backpropagating through all T steps, we may backpropagate through a fixed window of length \tau. This approximates the full gradient and reduces computation .
1021
+
\item Concretely, one computes gradients up to \tau steps and treats gradients beyond as zero. This still allows learning short-term patterns efficiently.
1022
+
\item Gradient Clipping: Cap the gradient norm to a maximum value to prevent explosion. For example in PyTorch:
\item These techniques help stabilize training, but the fundamental vanishing problem motivates using alternative RNN cells (LSTM/GRU) in practice (though we do not cover them here).
1025
+
\end{itemize}
1026
+
\end{frame}
1027
+
1028
+
\begin{frame}{Mathematical Insight on Gradients}
1029
+
• In a linearized RNN, one can derive (9.7.15) showing \partial L/\partial h_t = \sum_{i=t}^T (W_{hh}^T)^{T-i} W_{yh}^T \frac{\partial L}{\partial y_{T+t-i}} .
1030
+
• From this, any eigenvalue \lambda of W_{hh} contributes factors \lambda^{T-t} to gradients. If |\lambda|<1, terms vanish as T\to\infty; if |\lambda|>1, they explode.
1031
+
• Thus, simple RNNs struggle with long-term dependencies: information far back in time has negligible influence on the loss gradient .
1032
+
• Truncation effectively ignores terms beyond a window, trading off accuracy for stability.
1033
+
\end{frame}
1034
+
1035
+
Lecture 3: Applications of Simple RNNs
1036
+
1037
+
\begin{frame}{RNNs for Time Series Forecasting}
1038
+
\begin{itemize}
1039
+
\item Forecasting: RNNs can predict future values from historical data. Example tasks include stock prices, weather patterns, or any temporal signal .
1040
+
\item By feeding in sequence \{x_1,x_2,\dots,x_T\}, an RNN can output a prediction y_T (one-step ahead) or even a full sequence \{y_2,\dots,y_{T+1}\}.
1041
+
\item Unlike linear models, RNNs can capture complex temporal patterns (trends, seasonality, autocorrelation) in a data-driven way .
1042
+
\item Preprocessing (normalization, sliding windows) is important. Split data into train/test by time (no shuffling).
1043
+
\end{itemize}
1044
+
\end{frame}
1045
+
1046
+
\begin{frame}{Sequence Modeling Tasks}
1047
+
\begin{itemize}
1048
+
\item Many-to-One: Classify or predict one value from an entire sequence (e.g., sentiment analysis of a movie review, or classifying a time series). We use the final hidden state as a summary of the sequence.
1049
+
\item Many-to-Many (Prediction): Predict an output at each time step (e.g., language modeling or sequential regression). RNN outputs are used at each step.
1050
+
\item Encoder–Decoder (Seq2Seq): (Advanced) Map input sequences to output sequences of different lengths. Though typically LSTM-based, it’s conceptually possible with simple RNNs.
1051
+
\item RNNs also apply to physics and biology: e.g., modeling dynamical systems, protein sequences, or neuroscience time series. Any domain with sequential data can use RNN-based modeling.
1052
+
\end{itemize}
1053
+
\end{frame}
1054
+
1055
+
\begin{frame}{PyTorch Example: Time Series Regression}
• Here SimpleRNN(20) processes sequences of shape (batch, seq_len, 1), outputting a 20-dimensional vector for the last time step.
1091
+
• We then predict a single value with Dense(1). The model is compiled with mean-squared error for regression.
1092
+
• This aligns with the PyTorch example: both frameworks allow easy stacking of an RNN layer followed by a Dense layer for output.
1093
+
\end{frame}
1094
+
1095
+
\begin{frame}{Other Sequence Applications}
1096
+
\begin{itemize}
1097
+
\item Sequence Classification: Use RNN hidden state for class labels. For example, classify a time series into anomaly vs normal.
1098
+
\item Sequence Labeling: Predict labels at each time step (e.g. part-of-speech tagging). The RNN outputs a vector at each step passed through a classification layer.
1099
+
\item Language and Text: (Advanced) Character or word-level models use RNNs to generate text or classify documents. E.g., predicting next character from previous ones (RNN language model) .
1100
+
\item Physically Motivated Data: RNNs can model dynamical systems (e.g., rolling ball trajectories, neuron spikes over time, climate data). They learn temporal patterns directly from data without explicit equations.
1101
+
\end{itemize}
1102
+
\end{frame}
1103
+
1104
+
\begin{frame}{Training and Practical Tips}
1105
+
\begin{itemize}
1106
+
\item Loss Functions: Use MSE for regression tasks, cross-entropy for classification tasks. Sum or average losses over time steps as needed.
1107
+
\item Batching Sequences: Handle variable-length sequences by padding or using masking. PyTorch pack_padded_sequence or Keras masking can help.
1108
+
\item Optimization: Standard optimizers (SGD, Adam) work. Learning rate may need tuning due to sequential correlations.
1109
+
\item Initial Hidden State: Usually initialized to zeros. Can also learn an initial state or carry hidden state across batches for very long sequences (stateful=True in Keras).
1110
+
\item Regularization: Dropout can be applied to inputs or recurrent states (PyTorch/RNN has dropout option; Keras has dropout/recurrent_dropout).
1111
+
\end{itemize}
1112
+
\end{frame}
1113
+
1114
+
Lecture 4: Advanced Topics in Simple RNNs
1115
+
1116
+
\begin{frame}{Stacked (Deep) RNNs}
1117
+
1118
+
\begin{itemize}
1119
+
\item In a deep RNN, multiple RNN layers are stacked: the output of layer l-1 at time t feeds into layer l at the same time (see Fig. 10.3.1) .
\item Stacking increases model capacity, allowing complex input-to-output transformations and temporal modeling.
1122
+
\item In code (PyTorch), use num_layers>1 or explicitly stack nn.RNN layers . In Keras, add multiple SimpleRNN layers with return_sequences=True on all but the last.
1123
+
\end{itemize}
1124
+
\end{frame}
1125
+
1126
+
\begin{frame}{Bidirectional RNNs (Brief Mention)}
1127
+
\begin{itemize}
1128
+
\item Though beyond “simple” RNNs, note: Bidirectional RNNs process the sequence forward and backward, concatenating both hidden states. Useful when entire sequence is known (e.g. offline tasks).
1129
+
\item In PyTorch: nn.RNN(..., bidirectional=True) makes a 2-directional RNN. In Keras: layers.Bidirectional(layers.SimpleRNN(...)).
1130
+
\item This doubles parameters and output size but can improve context capture. However, it still uses simple RNN cells internally (no gates).
1131
+
\end{itemize}
1132
+
\end{frame}
1133
+
1134
+
\begin{frame}{Parameter and Output Shapes}
1135
+
\begin{itemize}
1136
+
\item For PyTorch’s nn.RNN, inputs shape (batch, seq_len, input_dim) (with batch_first=True) ; output is (batch, seq_len, hidden_size).
1137
+
\item Keras SimpleRNN(units, input_shape=(None,input_dim)) outputs shape (batch, units) by default (last output). To get full sequences, use return_sequences=True.
1138
+
\item Hidden state tensor shapes: PyTorch returns h_n of shape (num_layers * num_directions, batch, hidden_size).
\item Vanishing Gradients: Simple RNNs have fundamental difficulty learning long-term dependencies due to gradient decay .
1146
+
\item Capacity: Without gates, RNNs may struggle with tasks requiring remembering far-back inputs. Training can be slow as it’s inherently sequential.
1147
+
\item Alternatives: In practice, gated RNNs (LSTM/GRU) or Transformers are often used for long-range dependencies. However, simple RNNs are still instructive and sometimes sufficient for short sequences  .
1148
+
\item Regularization: Weight decay or dropout (on inputs/states) can help generalization but must be applied carefully due to temporal correlations.
1149
+
\item Statefulness: For very long sequences, one can preserve hidden state across batches (stateful RNN) to avoid resetting memory.
\item Training: Use gradient descent (SGD, Adam, etc.) to update weights W. Monitor for vanishing/exploding gradients and use clipping or truncation as needed  .
1159
+
\item Usage: RNNs excel at modeling sequences with short- to mid-term dependencies. They have been widely used in time series, NLP, and any domain with temporal structure  .
1160
+
\end{itemize}
1161
+
\end{frame}
1162
+
1163
+
\begin{frame}{Further Resources}
1164
+
\begin{itemize}
1165
+
\item Deep Learning (Goodfellow et al., 2016) – Chapter on RNNs, BPTT, and gradient issues .
1166
+
\item Dive into Deep Learning (https://d2l.ai) – Sections 9.4–9.7 cover RNN implementation and BPTT  .
1167
+
\item PyTorch and TensorFlow official tutorials – for practical code examples of RNNs  .
1168
+
\item Original papers: Pascanu et al. (2013) On the Difficulty of Training RNNs, Bengio et al. (1994) Learning long-term dependencies is difficult.
0 commit comments