On Batch Adaptive Training for Deep Learning: Lower Loss and Larger Step Size

Runyao Chen, Kun Wu, Ping Luo

Feb 15, 2018 (modified: Oct 27, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Mini-batch gradient descent and its variants are commonly used in deep learning. The principle of mini-batch gradient descent is to use noisy gradient calculated on a batch to estimate the real gradient, thus balancing the computation cost per iteration and the uncertainty of noisy gradient. However, its batch size is a fixed hyper-parameter requiring manual setting before training the neural network. Yin et al. (2017) proposed a batch adaptive stochastic gradient descent (BA-SGD) that can dynamically choose a proper batch size as learning proceeds. We extend the BA-SGD to momentum algorithm and evaluate both the BA-SGD and the batch adaptive momentum (BA-Momentum) on two deep learning tasks from natural language processing to image classification. Experiments confirm that batch adaptive methods can achieve a lower loss compared with mini-batch methods after scanning the same epochs of data. Furthermore, our BA-Momentum is more robust against larger step sizes, in that it can dynamically enlarge the batch size to reduce the larger uncertainty brought by larger step sizes. We also identified an interesting phenomenon, batch size boom. The code implementing batch adaptive framework is now open source, applicable to any gradient-based optimization problems.
  • TL;DR: We developed a batch adaptive momentum that can achieve lower loss compared with mini-batch methods after scanning same epochs of data, and it is more robust against large step size.
  • Keywords: deep learning, optimization
0 Replies