-
Notifications
You must be signed in to change notification settings - Fork 18
Stochastic gradient methods
There are several forms of performing the parameter update.
θ_n = θ_{n-1} + α_n ∇l(θ_{n-1}; yn)
We follow Toulis and Airoldi (2014). The idea is to use information from the gradient at the current iteration rather than the previous in order to make more intelligent decisions.
θ_n = θ_{n-1} + α_n ∇l(θ_n; yn)
This can be shown to be equivalent to proximal methods.
We follow Sutskever (2012, Section 7.2) which is an expository dialogue of momentum via both the classical version and Nesterov's accelerated gradient (1983).
v_n = μ v_{n-1} + α_n ∇l(θ_{n-1}; yn)
θ_n = θ_{n-1} + v_n
The defaults are μ=
which determines how much to weight gradient history.
We follow Sutskever (2012, Section 7.2) which is an expository dialogue of momentum via both the classical version and Nesterov's accelerated gradient (1983).
v_n = μ v_{n-1} + α_n ∇l(θ_{n-1} + μv_{n-1}; yn)
θ_n = θ_{n-1} + v_n
The defaults are μ=
which determines how much to weight gradient history.
Note that the only difference from classical momentum is that one evaluates the gradient at θ_{n-1} + μv_{n-1}
instead of θ_{n-1}
. Nesterov momentum almost always performs better in practice than classical momentum. The intuition is that Nesterov momentum makes a partial update first closer to θ_n
, allowing it to change its velocity more quickly and responsively.
This can be used in conjunction with any parameter update.
θ_bar_n = (n-1)/n θ_bar_{n-1} + 1/n θ_n
We follow Le Roux et al. (2012).
We follow Johnson and Zhang (2013).
- Course notes for Stanford's CS231n Convolutional Neural Networks for Visual Recognition by Andrej Karpathy.
- Advances in optimizing Recurrent Networks by Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu.