Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
Finding Flatter Minima with SGD
Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey
Feb 12, 2018 (modified: Feb 12, 2018)ICLR 2018 Workshop Submissionreaders: everyone
Abstract:It has been discussed that over-parameterized deep neural networks (DNNs) trained using stochastic gradient descent (SGD) with smaller batch sizes generalize better compared with those trained with larger batch sizes. Additionally, model parameters found by small batch size SGD tend to be in flatter regions. We extend these empirical observations and experimentally show that both large learning rate and small batch size contribute towards SGD finding flatter minima that generalize well. Conversely, we find that small learning rates and large batch sizes lead to sharper minima that correlate with poor generalization in DNNs.
TL;DR:Small batch size and large learning rate steer SGD towards flat minima
Keywords:SGD, flat minima
Enter your feedback below and we'll get back to you as soon as possible.