Don't Use Large Mini-batches, Use Local SGD

Tao Lin; Sebastian U. Stich; Kumar Kshitij Patel; Martin Jaggi

Don't Use Large Mini-batches, Use Local SGD

Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi

Published: 20 Dec 2019, Last Modified: 22 Jun 2025ICLR 2020 Conference Blind SubmissionReaders: Everyone

Abstract: Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants.

Code: [![github](/images/github_icon.svg) epfml/LocalSGD-Code](https://github.com/epfml/LocalSGD-Code) + [![Papers with Code](/images/pwc_icon.svg) 1 community implementation](https://paperswithcode.com/paper/?openreview=B1eyO1BFPr)

Data: [CIFAR-10](https://paperswithcode.com/dataset/cifar-10), [CIFAR-100](https://paperswithcode.com/dataset/cifar-100), [ImageNet](https://paperswithcode.com/dataset/imagenet)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/don-t-use-large-mini-batches-use-local-sgd/code)

Original Pdf: pdf

18 Replies

Loading