Let's turn our attention back to models for sequential data for the next few lectures. With all the excitement around large language modeling, this is a timely topic.
Consider a sequence of observations
Autoregressive models are well-suited to sequential modeling since they make forward generation or forecasting easy. As long as we have access to the conditionals, we can sample forward indefinitely.
The question is, how should we parameterize these conditional distributions? It looks challenging since each one takes in a variable-length history of previous observations.
Recurrent Neural Networks (RNNs) are autoregressive models in which the conditional distributions are functions of a finite-dimensional hidden state,
:::{admonition} Defining Property of an RNN :class: tip The conditional distribution over the next observation is a function of hidden state. The hidden state is updated recursively, and its size is fixed regardless of the sequence length. :::
The standard, "vanilla" RNN consists of a linear-nonlinear state update. For example, \begin{align*} f(\mbh_t, \mbx_t; \mbtheta) &= \tanh \left(\mbW \mbh_t + \mbB \mbx_t \right), \end{align*} where
-
$\mbW \in \reals^{K \times K}$ are the dynamics weights, -
$\mbB \in \reals^{K \times D}$ are the input weights, -
$\tanh(\cdot)$ is the hyperbolic tangent function.
:::{admonition} Hyperbolic Tangent and the Logistic Function :class: dropdown The hyperbolic tangent function equivalent is typically written as, \begin{align*} \tanh(a) &= \frac{e^a - e^{-a}}{e^a + e^{-a}}. \end{align*} We can rewrite it as, \begin{align*} \tanh(a) &= 1 - \frac{2e^{-a}}{e^a + e^{-a}} \ &= 1 - \frac{2}{e^{2a} + 1} \ &= 1 - 2 \sigma(-2a) \ &= 1 - 2 (1 - \sigma(2a)) \ &= 2 \sigma(2a) - 1. \end{align*} Thus, we see that the the hyperbolic tangent is simply a scaled and shifted logistic function. :::
The "read-out" of a vanilla RNN is typically a simple linear or generalized linear model, depending on the type of observations. For example, \begin{align*} g(\mbh_t, \mbx_t; \mbtheta) &= \mbC \mbh_t + \mbd, \end{align*} where
-
$\mbC \in \reals^{D \times K}$ are the read-out weights and -
$\mbd \in \reals^D$ is the bias.
Let
From a machine learning perspective, RNNs are useful function approximators for sequential data. From a neuroscience perspective, they have a long history as theoretical models of neural computation.
In such models, the hidden state
The dynamics weights correspond to synaptic connections, with positive weights as excitatory synapses and negative weights as inhibitory. When a presynaptic neuron spikes, it induces an electrical current in postsynaptic neurons. Under this interpretation, the activation
As a neuron receives input current, its voltage steadily increases until it reaches a threshold, at which point the voltage spikes and the neuron fires an action potential. These spikes induce currents on downstream neurons, as described above. After a cell fires, there is a short refractory period before the neuron can spike again. Thus, there is an upper bound on firing rates, which the hyperbolic tangent is meant to capture.
Artificial RNNs are trained using stochastic gradient descent (SGD) to minimize the negative log likelihood,
\begin{align*}
\cL(\mbtheta)
&= - \sum_{t=1}^T \log p(\mbx_t \mid \mbx_{1:t-1}) \
&= - \sum_{t=1}^T \log p(\mbx_t \mid g(\mbh_t; \mbtheta)) \
&= - \sum_{t=1}^T \log p(\mbx_t \mid g(f(\mbh_{t-1}, \mbx_{t-1}; \mbtheta); \mbtheta)) \
&= - \sum_{t=1}^T \log p(\mbx_t \mid g(f(\cdots f(\mbh_{1}, \mbx_{1}; \mbtheta) \cdots, \mbx_{t-1}; \mbtheta); \mbtheta)).
\end{align*}
Now, you would simply use automatic differentiation to compute the necessary gradients to minimize this loss, but we can gain some insight by working them out manually. With some algebra, we can show that the Jacobian of the loss with respect to the dynamics weights (other parameters are similar) is,
\begin{align*}
\frac{\partial \cL(\mbtheta)}{\partial \mbW} &= \sum_{t=1}^T \frac{\partial \cL(\mbtheta)}{\partial \mbh_t} \frac{ \partial \mbh_t}{\partial \mbW}
\end{align*}
We need the Jacobian of the loss with respect to the hidden states. These can be computed recursively by backpropagation through time (BPTT),
\begin{align*}
\frac{\partial \cL(\mbtheta)}{\partial \mbh_t}
&= \frac{\partial \cL(\mbtheta)}{\partial \mbh_{t+1}} \frac{\partial \mbh_{t+1}}{\partial \mbh_t} - \frac{\partial \log p(\mbx_t \mid g(\mbh_t; \mbtheta))}{\partial \mbh_t}
\end{align*}
For a vanilla RNN, the Jacobian of the next state with respect to the current state is,
\begin{align*}
\frac{\partial \mbh_{t+1}}{\partial \mbh_t}
&= \diag \left(1 - \mbh_{t+1}^2 \right) \mbW
\end{align*}
since
:::{admonition} BPTT is a linear dynamical system
The "state" of the BPTT recursions is the Jacobian, or equivalently its transpose, the gradient
:::{admonition} Biological Plausibility
:class: tip, dropdown
Backpropagation is the most effective way we know of to train artificial RNNs, so it's reasonable to think that the brain might be using a similar learning algorithm. Unfortunately, it's not clear how the backpropagation through time algorithm could be implemented by a neural circuit. The multiplication by
When the Jacobians
Vanishing gradients are especially problematic when
One way to combat the vanishing gradient problem is by modifying the RNN architecture. Architectures like long short-term memory (LSTM) networks achieve this via gated units.
An LSTM has internal (aka "cell") states
The affine term is determined by,
\begin{align*}
\mbb_t &= \mbg_t \odot \mbi_t \
\mbg_t &= \sigma(\mbW^{(g)} \mbh_{t-1} + \mbB^{(g)} \mbx_{t-1}) \
\mbi_t &= \sigma(\mbW^{(i)} \mbh_{t-1} + \mbB^{(i)} \mbx_{t-1})
\end{align*}
The vector
Finally, the hidden states
We can think of an LSTM as an RNN that operates on an extended state
There are many variants of gated RNNs. Besides the LSTM, the most commonly used in the gated recurrent unit (GRU), which has a slightly simplified architecture. See {cite:t}goodfellow2016deep
(ch. 10) for more detail.
We motivated RNNs from an autoregressive modeling persepective, but they are useful in other sequential data settings as well. For example, suppose we want to predict the sentiment of a review
Sometimes we want to map one sequence
In the example above, one challenge is that the hidden state
As with deep neural networks that stack layer upon layer, we can stack RNN upon RNN to construct a deeper model. In such models, the outputs of one layer,
It turns out, we've already seen an RNN in this class!
We presented Hidden Markov Models (HMMs) as latent variable models with hidden states
Consider an HMM with categorical emissions,
\begin{align*}
p(\mbx_t \mid z_t) &= \mathrm{Cat}(\mbx_t \mid \mbc_{z_t}),
\end{align*}
where
For a categorical HMM, we can write the likelihoods as
This formulation suggests that we could estimate the parameters of an HMM by directly maximizing the log likelihood, \begin{align*} \cL(\mbtheta) &= \log p(\mbx_{1:T}; \mbtheta) = \sum_{t=1}^T \log p(\mbx_t \mid \mbx_{1:t-1}; \mbtheta). \end{align*} With automatic differentiation at our disposal, this sounds like it might be a lot easier!
Let's pursue this idea a little further. First, we'd prefer to do unconstrained optimization, so let's parameterize the model in terms of
:::{admonition} Softmax is translation invariant
Note that the softmax operation is translation invariant,
\begin{align*}
\mathrm{softmax}(\log \mbc_k) &=
\mathrm{softmax}(\log \mbc_k + a)
\end{align*}
for any constant
To maximize the likelihood with gradient ascent, we need the Jacobians,
First, note that the posterior distribution in an HMM can be written as an exponential family,
\begin{align*}
p(\mbz_{1:T} \mid \mbx_{1:T}; \mbtheta)
&= \exp \left{ \log p(\mbz_{1:T}, \mbx_{1:T}; \mbtheta) - \log p(\mbx_{1:T}; \mbtheta) \right} \
&= \exp \left{
\log p(z_1; \mbpi_0)
+ \sum_{t=2}^T \log p(z_t \mid z_{t-1}; \mbP)
+ \sum_{t=1}^T \log p(\mbx_t \mid z_{t}; \mbC)
- \log p(\mbx_{1:T}; \mbtheta)
\right} \
&= \exp \bigg{
\sum_{k=1}^K \langle \bbI[z_1 =k], \log \pi_{0,k} \rangle \
&\hspace{6em}
+ \sum_{t=2}^T \sum_{i=1}^K \sum_{j=1}^K \langle \bbI[z_{t-1}=i \wedge z_{t} = j], \log P_{i,j} \rangle \
&\hspace{12em}
+ \sum_{t=1}^T \sum_{v=1}^V \sum_{k=1}^K \langle \bbI[x_t = v \wedge z_t = k], \log C_{v,k} \rangle \
&\hspace{18em} - A(\mbtheta)
\bigg}
\end{align*}
where the log marginal likelihood
Recall that for exponential family distributions, gradients of the log normalizer yield expected sufficient statistics. Thus, \begin{align*} \frac{\partial A(\mbtheta)}{\partial \log C_{v,k}} &= \sum_{t=1}^T \bbI[x_{t} = v] \cdot \bbE_{p(\mbz_{1:T} \mid \mbx_{1:T}; \mbtheta)} \left[\bbI[z_{t}=k] \right] \end{align*} and \begin{align*} \frac{\partial A(\mbtheta)}{\partial \log P_{i,j}} &= \sum_{t=2}^T \bbE_{p(\mbz_{1:T} \mid \mbx_{1:T}; \mbtheta)} \left[ \bbI[z_{t-1}=i \wedge z_{t} = j] \right] \end{align*} The gradients are essentially the posterior marginals and pairwise marginals we computed in the EM algorithm!
In the M-step of EM, we solve for the parameters that satisfy the constrained optimality conditions, whereas in SGD we just take a step in the direction of the gradient. EM tends to converge must faster in practice for this reason.
Consider the special case of linear RNNs,
\begin{align*} f(\mbh_{t-1}) &= \mbA \mbh_{t-1} + \mbB \mbx_t. \end{align*}
with read-out
Since a linear operators compose, we can write the entire input-output map as a linear function,
\begin{align*}
\mby_t
&= \mbC \mbh_t + \mbd \
&= \mbC (\mbA \mbh_{t-1} + \mbB \mbx_t) + \mbd \
&= \mbC (\mbA (\mbA \mbh_{t-2} + \mbB x_{t-1}) + \mbB \mbx_t) + \mbd \
&= \ldots \
&= \mbC \left(\sum_{s=0}^{t-1} \mbA^s \mbB \mbx_{t-s} \right) + \mbd
\end{align*}
with the convention that
We recognize this as a convolution,
\begin{align*} \mby_t &= \left[ \mbK \circledast \mbx \right]_t \end{align*}
with kernel,
\begin{align*} \mbK &= \begin{bmatrix}\mbC \mbB & \mbC \mbA \mbB & \mbC \mbA^2 \mbB \cdots & \mbC \mbA^{T-1} \mbB \end{bmatrix}. \end{align*}
Computationally, representing the RNN as a convolution leads to efficient implementations, since modern deep learning libraries like PyTorch and JAX have highly optimized convolution routines. Likewise, sampling the model is straightforward and efficient using the RNN formulation.
While these simple linear RNNs may seem too simple to capture complex sequential dependencies, it turns out that stacks of linear layers with simple nonlinearities in between — a deep linear state space model {cite:p}gu2021efficiently
— can be highly expressive!
:::{admonition} Note
Note that the kernel involves matrix powers gu2021efficiently
derived clever algorithms for efficiently computing the kernel and evaluating the convolution for certain structured classes of dynamics matrices.
:::
One limitation of the convolutional formulation above is that it requires the dynamics matrix smith2023simplified
showed an alternative way to evaluate the linear RNN that relaxes this constraint.
Consider two consecutive state updates, now with time-dependent dynamics matrices
\begin{align*} h_t &= \mbA_t \mbh_{t-1} + \mbb_t \ &= \mbA_t (\mbA_{t-1} \mbh_{t-2} + \mbb_{t-1}) + \mbb_t \ &= \mbA_{t-2:t} \mbh_{t-2} + \mbb_{t-2:t} \end{align*}
where
This is just another affine map, but now it takes
In the next iteration, we can combine these linear maps to obtain
After
This algorithm is called a parallel scan or binary associative scan, since it is based on a binary associative operator,
With
:::{admonition} Note
Again, notice that each update requires matrix multiplication, which is typically cubic in the state dimension. {cite:t}smith2023simplified
proposed to work with complex diagonal matrices instead, which keeps the time and memory costs in check.
:::
The parallel scan relied on the linearity of the RNN dynamics. Can the same be applied to nonlinear RNNs?
It turns out that yes, in many cases we can speed up the evaluation of nonlinear RNNs using a similar trick!
Consider an RNN with nonlinear dynamics function
The Taylor approximations yield a linear RNN with time-varying dynamics, which can be evaluated with a parallel scan to obtain a new sequence of latent states,
Repeating this process until convergence is equivalent to the Gauss-Newton method for minimizing the sum of squares loss function, \begin{align*} \cL(\mbh_{1:T}) &= \sum_{t=1}^T |\mbh_t - f(\mbh_{t-1})|_2^2. \end{align*} This idea was proposed by Lim et al (2023) and called DEER. Gonzalez et al (2024) explained the connection to the Gauss-Newton method and proposed an extension called ELK.
Recurrent neural networks are foundational models for sequential data. They're useful machine learning models, and they're standard models in theoretical neuroscience. While Transformers are the star of modern large language models, RNNs are making a comeback as efficient techniques for capturing long range dependences in sequential data.