- Generalization
- Loss Surface
- Batch Size
- General
- Adaptive Gradient Methods
- Distributed Optimization
- Initialization
- Low Precision
- Normalization
- Regularization
- Meta Learning
- 2018 ICLR Sensitivity and Generalization in Neural Networks: an Empirical Study
- 2018 arXiv On Characterizing the Capacity of Neural Networks using Algebraic Topology
- 2017 arXiv Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data
- 2017 NIPS Exploring Generalization in Deep Learning
- 2017 NIPS Train longer, generalize better: closing the generalization gap in large batch training of neural networks
- 2017 ICML A Closer Look at Memorization in Deep Networks
- 2017 ICLR Understanding deep learning requires rethinking generalization
- 2018 NIPS Visualizing the Loss Landscape of Neural Nets
- 2018 ICML Essentially No Barriers in Neural Network Energy Landscape
- 2018 arXiv Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
- 2018 ICML Optimization Landscape and Expressivity of Deep CNNs
- 2018 ICLR Measuring the Intrinsic Dimension of Objective Landscapes
- 2017 ICML The Loss Surface of Deep and Wide Neural Networks
- 2017 ICML Geometry of Neural Network Loss Surfaces via Random Matrix Theory
- 2017 ICML Sharp Minima Can Generalize For Deep Nets
- 2017 ICLR Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
- 2017 ICLR On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
- 2017 arXiv An empirical analysis of the optimization of deep network loss surfaces
- 2016 ICMLW Visualizing Deep Network Training Trajectories with PCA
- 2016 ICLRW Stuck in a What? Adventures in Weight Space
- 2015 ICLR Qualitatively Characterizing Neural Network Optimization Problems
- 2015 AISTATS The Loss Surfaces of Multilayer Networks
- 2014 NIPS Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
- 2018 NIPS Hessian-based Analysis of Large Batch Training and Robustness to Adversaries
- 2018 ICLR Don't Decay the Learning Rate, Increase the Batch Size
- 2017 arXiv Scaling SGD Batch Size to 32K for ImageNet Training
- 2017 arXiv Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
- 2017 ICML Sharp Minima Can Generalize For Deep Nets
- 2017 ICLR On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
- 2016 ICML Train faster, generalize better: Stability of stochastic gradient descent
- 2016 arXiv Optimization Methods for Large-Scale Machine Learning
- 2016 Blog An overview of gradient descent optimization algorithms
- 2015 DL Summer School Non-Smooth, Non-Finite, and Non-Convex Optimization
- 2015 NIPS Training Very Deep Networks
- 2015 AISTATS Deeply-Supervised Nets
- 2014 OSLW On the Computational Complexity of Deep Learning
- 2011 ICML On Optimization Methods for Deep Learning
- 2010 AISTATS Understanding the difficulty of training deep feedforward neural networks
- 2017 The Marginal Value of Adaptive Gradient Methods in Machine Learning
- 2017 ICLR SGDR: Stochastic Gradient Descent with Restarts
- 2015 ICLR Adam: A Method for Stochastic Optimization (Adam)
- 2013 ICML On the importance of initialization and momentum in deep learning (NAG)
- 2012 Lecture RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning (RMSProp)
- 2011 JMLR Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (Adagrad)
- 2017 arXiv Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
- 2017 NIPS TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
- 2017 NIPS QSGD: Communication-Efficient Stochastic Gradient Descent, with Applications to Training Neural Networks (QSGD)
- 2016 ICML Training Neural Networks Without Gradients: A Scalable ADMM Approach
- 2016 IJCAI Staleness-aware Async-SGD for Distributed Deep Learning
- 2016 ICLRW Revisiting Distributed Synchronous SGD
- 2016 Thesis Distributed Stochastic Optimization for Deep Learning (EASGD)
- 2015 NIPS Deep learning with Elastic Averaging SGD (EASGD)
- 2015 ICLR Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging
- 2016 NIPS Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity
- 2016 ICLR All You Need is a Good Init
- 2016 ICLR Data-dependent Initializations of Convolutional Neural Networks
- 2015 ICCV Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (MSRAinit)
- 2014 ICLR Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
- 2013 ICML On the importance of initialization and momentum in deep learning
- 2010 AISTATS Understanding the difficulty of training deep feedforward neural networks (Xavier initialization)
- 2017 arXiv Gradient Descent for Spiking Neural Networks
- 2017 arXiv Training Quantized Nets: A Deeper Understanding
- 2017 arXiv TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
- 2017 ICML ZipML: Training Linear Models with End-to-End Low Precision
- 2016 arXiv QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks
- 2015 NIPS Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms
- 2013 arXiv Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
- 2017 arXiv Self-Normalizing Neural Networks
- 2017 arXiv Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models
- 2016 NIPS Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
- 2016 NIPS Layer Normalization
- 2016 ICML Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks
- 2016 ICLR Data-Dependent Path Normalization in Neural Networks
- 2015 NIPS Path-SGD: Path-Normalized Optimization in Deep Neural Networks
- 2015 ICML Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- 2017 arXiv L2 Regularization versus Batch and Weight Normalization
- 2014 JMLR Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Dropout)
- 2017 ICML Neural Optimizer Search with Reinforcement Learning
- 2017 ICML Learned Optimizers that Scale and Generalize
- 2017 ICML Learning to Learn without Gradient Descent by Gradient Descent
- 2017 ICLR Learning to Optimize
- 2016 arXiv Learning to reinforcement learn
- 2016 NIPSW Learning to Learn for Global Optimization of Black Box Functions
- 2016 NIPS Learning to learn by gradient descent by gradient descent
- 2016 ICML Meta-learning with memory-augmented neural networks