Abstract: The choice of batch size in minibatch stochastic gradient optimization is critical for both optimization and generalization performance in large-scale model training. Although large-batch training is arguably the dominant paradigm in large-scale deep learning because of hardware advances, model generalization often deteriorates relative to small-batch training, leading to the so-called "generalization gap." To mitigate this issue, we investigate adaptive batch size strategies derived from adaptive sampling methods, which were originally developed for stochastic gradient descent. Given the strong interplay between learning rates and batch sizes, together with the prevalence of adaptive gradient methods in deep learning, we emphasize the need for adaptive batch size strategies in these settings. We introduce AdAdaGrad and its scalar variant AdAdaGradNorm, which progressively increase batch sizes during training while performing updates with AdaGrad and AdaGradNorm, respectively. We prove that AdAdaGradNorm converges with high probability at a rate of $\mathscr{O}(1/K)$ to a first-order stationary point of a smooth nonconvex function within $K$ iterations. AdAdaGrad also exhibits similar convergence properties when combined with a novel coordinate-wise variant of our adaptive batch size strategy. We corroborate our theoretical claims with image-classification experiments that highlight the merits of the proposed schemes in terms of both training efficiency and model generalization. Our work highlights the potential of adaptive batch size strategies for adaptive gradient optimizers in large-scale model training.
Submission Type: Special issue on Statistics and AI
Submission Number: 4
Loading