Stochastic gradient methods

There are several forms of performing the parameter update.

standard

θ_n = θ_{n-1} + α_n ∇l(θ_{n-1}; yn)

implicit

We follow Toulis and Airoldi (2014). The idea is to use information from the gradient at the current iteration rather than the previous in order to make more intelligent decisions.

θ_n = θ_{n-1} + α_n ∇l(θ_n; yn)

This can be shown to be equivalent to proximal methods.

momentum

We follow Sutskever (2012, Section 7.2) which is an expository dialogue of momentum via both the classical version and Nesterov's accelerated gradient (1983).

v_n = μ v_{n-1} + α_n ∇l(θ_{n-1}; yn)
θ_n = θ_{n-1} + v_n

The defaults are μ= which determines how much to weight gradient history.

nesterov

We follow Sutskever (2012, Section 7.2) which is an expository dialogue of momentum via both the classical version and Nesterov's accelerated gradient (1983).

v_n = μ v_{n-1} + α_n ∇l(θ_{n-1} + μv_{n-1}; yn)
θ_n = θ_{n-1} + v_n

The defaults are μ= which determines how much to weight gradient history.

Note that the only difference from classical momentum is that one evaluates the gradient at θ_{n-1} + μv_{n-1} instead of θ_{n-1}. Nesterov momentum almost always performs better in practice than classical momentum. The intuition is that Nesterov momentum makes a partial update first closer to θ_n, allowing it to change its velocity more quickly and responsively.

averaging

This can be used in conjunction with any parameter update.

θ_bar_n = (n-1)/n θ_bar_{n-1} + 1/n θ_n

stochastic average gradient

We follow Le Roux et al. (2012).

stochastic variance reduced gradient

We follow Johnson and Zhang (2013).

Recommended resources

Course notes for Stanford's CS231n Convolutional Neural Networks for Visual Recognition by Andrej Karpathy.
Advances in optimizing Recurrent Networks by Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly