Inefficiency of stochastic gradient descent with larger mini-batches (and more learners)Download PDF

18 Apr 2024 (modified: 21 Jul 2022)Submitted to ICLR 2017Readers: Everyone
Abstract: Stochastic Gradient Descent (SGD) and its variants are the most important optimization algorithms used in large scale machine learning. Mini-batch version of stochastic gradient is often used in practice for taking advantage of hardware parallelism. In this work, we analyze the effect of mini-batch size over SGD convergence for the case of general non-convex objective functions. Building on the past analyses, we justify mathematically that there can often be a large difference between the convergence guarantees provided by small and large mini-batches (given each instance processes equal number of training samples), while providing experimental evidence for the same. Going further to distributed settings, we show that an analogous effect holds with popular Asynchronous Gradient Descent (\asgd): there can be a large difference between convergence guarantees with increasing number of learners given that the cumulative number of training samples processed remains the same. Thus there is an inherent (and similar) inefficiency introduced in the convergence behavior when we attempt to take advantage of parallelism, either by increasing mini-batch size or by increase the number of learners.
TL;DR: We theoretically justify that increasing mini-batch size or increasing the number of learners can lead to slower SGD/ASGD convergence
Conflicts: ibm.com
Keywords: Deep learning, Optimization
14 Replies

Loading