Keywords: mini-batch SGD, batch size, wide neural network, saturation
Abstract: The performance of the mini-batch stochastic gradient method strongly depends on the batch-size that is used. In the classical convex setting with interpolation, prior work showed that increasing the batch size linearly increases the convergence speed, but only up to a point; when the batch size is larger than a certain threshold (the critical batchsize), further increasing the batch size only leads to negligible improvement.
The goal of this work is to investigate the relationship between the batchsize and convergence speed for a broader class of nonconvex problems. Building on recent improved convergence guarantees for SGD, we prove that a similar linear scaling and batch-size saturation phenomenon occurs for training sufficiently wide neural networks. We conduct a number of numerical experiments on benchmark datasets, which corroborate our findings.
Submission Number: 39
Loading