Optimization for Deep Learning

Generalization
Loss Surface
Batch Size
General
Adaptive Gradient Methods
Distributed Optimization
Initialization
Low Precision
Normalization
Regularization
Meta Learning

Generalization

2018 ICLR Sensitivity and Generalization in Neural Networks: an Empirical Study
2018 arXiv On Characterizing the Capacity of Neural Networks using Algebraic Topology
2017 arXiv Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data
2017 NIPS Exploring Generalization in Deep Learning
2017 NIPS Train longer, generalize better: closing the generalization gap in large batch training of neural networks
2017 ICML A Closer Look at Memorization in Deep Networks
2017 ICLR Understanding deep learning requires rethinking generalization

Loss Surface

2018 NIPS Visualizing the Loss Landscape of Neural Nets
2018 ICML Essentially No Barriers in Neural Network Energy Landscape
2018 arXiv Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
2018 ICML Optimization Landscape and Expressivity of Deep CNNs
2018 ICLR Measuring the Intrinsic Dimension of Objective Landscapes
2017 ICML The Loss Surface of Deep and Wide Neural Networks
2017 ICML Geometry of Neural Network Loss Surfaces via Random Matrix Theory
2017 ICML Sharp Minima Can Generalize For Deep Nets
2017 ICLR Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
2017 ICLR On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
2017 arXiv An empirical analysis of the optimization of deep network loss surfaces
2016 ICMLW Visualizing Deep Network Training Trajectories with PCA
2016 ICLRW Stuck in a What? Adventures in Weight Space
2015 ICLR Qualitatively Characterizing Neural Network Optimization Problems
2015 AISTATS The Loss Surfaces of Multilayer Networks
2014 NIPS Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

Batch Size

2018 NIPS Hessian-based Analysis of Large Batch Training and Robustness to Adversaries
2018 ICLR Don't Decay the Learning Rate, Increase the Batch Size
2017 arXiv Scaling SGD Batch Size to 32K for ImageNet Training
2017 arXiv Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2017 ICML Sharp Minima Can Generalize For Deep Nets
2017 ICLR On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

General

2016 ICML Train faster, generalize better: Stability of stochastic gradient descent
2016 arXiv Optimization Methods for Large-Scale Machine Learning
2016 Blog An overview of gradient descent optimization algorithms
2015 DL Summer School Non-Smooth, Non-Finite, and Non-Convex Optimization
2015 NIPS Training Very Deep Networks
2015 AISTATS Deeply-Supervised Nets
2014 OSLW On the Computational Complexity of Deep Learning
2011 ICML On Optimization Methods for Deep Learning
2010 AISTATS Understanding the difficulty of training deep feedforward neural networks

Adaptive Gradient Methods

2017 The Marginal Value of Adaptive Gradient Methods in Machine Learning
2017 ICLR SGDR: Stochastic Gradient Descent with Restarts
2015 ICLR Adam: A Method for Stochastic Optimization (Adam)
2013 ICML On the importance of initialization and momentum in deep learning (NAG)
2012 Lecture RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning (RMSProp)
2011 JMLR Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (Adagrad)

Distributed Optimization

2017 arXiv Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2017 NIPS TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
2017 NIPS QSGD: Communication-Efficient Stochastic Gradient Descent, with Applications to Training Neural Networks (QSGD)
2016 ICML Training Neural Networks Without Gradients: A Scalable ADMM Approach
2016 IJCAI Staleness-aware Async-SGD for Distributed Deep Learning
2016 ICLRW Revisiting Distributed Synchronous SGD
2016 Thesis Distributed Stochastic Optimization for Deep Learning (EASGD)
2015 NIPS Deep learning with Elastic Averaging SGD (EASGD)
2015 ICLR Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging

Initialization

2016 NIPS Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity
2016 ICLR All You Need is a Good Init
2016 ICLR Data-dependent Initializations of Convolutional Neural Networks
2015 ICCV Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (MSRAinit)
2014 ICLR Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
2013 ICML On the importance of initialization and momentum in deep learning
2010 AISTATS Understanding the difficulty of training deep feedforward neural networks (Xavier initialization)

Low Precision

2017 arXiv Gradient Descent for Spiking Neural Networks
2017 arXiv Training Quantized Nets: A Deeper Understanding
2017 arXiv TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
2017 ICML ZipML: Training Linear Models with End-to-End Low Precision
2016 arXiv QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks
2015 NIPS Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms
2013 arXiv Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Noise

2015 arXiv Adding Gradient Noise Improves Learning for Very Deep Networks

Normalization

2017 arXiv Self-Normalizing Neural Networks
2017 arXiv Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models
2016 NIPS Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
2016 NIPS Layer Normalization
2016 ICML Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks
2016 ICLR Data-Dependent Path Normalization in Neural Networks
2015 NIPS Path-SGD: Path-Normalized Optimization in Deep Neural Networks
2015 ICML Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Regularization

2017 arXiv L2 Regularization versus Batch and Weight Normalization
2014 JMLR Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Dropout)

Meta-Learning

2017 ICML Neural Optimizer Search with Reinforcement Learning
2017 ICML Learned Optimizers that Scale and Generalize
2017 ICML Learning to Learn without Gradient Descent by Gradient Descent
2017 ICLR Learning to Optimize
2016 arXiv Learning to reinforcement learn
2016 NIPSW Learning to Learn for Global Optimization of Black Box Functions
2016 NIPS Learning to learn by gradient descent by gradient descent
2016 ICML Meta-learning with memory-augmented neural networks

Hyperparameter

2015 ICML Gradient-based hyperparameter optimization through reversible learning

Bayesian Optimization

2012 Practical Bayesian Optimization of Machine Learning Algorithms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dl_opt.md

dl_opt.md

Optimization for Deep Learning

Generalization

Loss Surface

Batch Size

General

Adaptive Gradient Methods

Distributed Optimization

Initialization

Low Precision

Noise

Normalization

Regularization

Meta-Learning

Hyperparameter

Bayesian Optimization

Files

dl_opt.md

Latest commit

History

dl_opt.md

File metadata and controls

Optimization for Deep Learning

Generalization

Loss Surface

Batch Size

General

Adaptive Gradient Methods

Distributed Optimization

Initialization

Low Precision

Noise

Normalization

Regularization

Meta-Learning

Hyperparameter

Bayesian Optimization