Abstract: SGD and its variants are widespread in the field of machine learning. Although there is extensive research on the choice of step-size schedules to guarantee convergence of these methods, there is substantially less work examining the influence of the batch size on optimization. The latter is typically kept constant and chosen via experimental validation.\\
In this work we take a stochastic optimal control perspective to understand the effect of the batch size when optimizing non-convex functions with SGD. Specifically, we define an optimal control problem, which considers the \emph{entire} trajectory of SGD to choose the optimal batch size for a noisy quadratic model. We show that the batch size is inherently coupled with the step size and that for saddles there is a transition-time $t^*$, after which it is beneficial to increase the batch size to reduce the covariance of the stochastic gradients. We verify our results empirically on various convex and non-convex problems.
Supplementary Material: zip
3 Replies
Loading