Large Batch Training of Convolutional Networks with Layer-wise Adaptive Rate Scaling

Boris Ginsburg; Igor Gitman; Yang You

Large Batch Training of Convolutional Networks with Layer-wise Adaptive Rate Scaling

Boris Ginsburg, Igor Gitman, Yang You

15 Feb 2018 (modified: 22 Jun 2025)ICLR 2018 Conference Blind SubmissionReaders: Everyone

Abstract: A common way to speed up training of large convolutional networks is to add computational units. Training is then performed using data-parallel synchronous Stochastic Gradient Descent (SGD) with a mini-batch divided between computational units. With an increase in the number of nodes, the batch size grows. However, training with a large batch often results in lower model accuracy. We argue that the current recipe for large batch training (linear learning rate scaling with warm-up) is not general enough and training may diverge. To overcome these optimization difficulties, we propose a new training algorithm based on Layer-wise Adaptive Rate Scaling (LARS). Using LARS, we scaled AlexNet and ResNet-50 to a batch size of 16K.

TL;DR: A new large batch training algorithm based on Layer-wise Adaptive Rate Scaling (LARS); using LARS, we scaled AlexNet and ResNet-50 to a batch of 16K.

Keywords: large batch, LARS, adaptive rate scaling

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 6 code implementations](https://www.catalyzex.com/paper/large-batch-training-of-convolutional/code)

7 Replies

Loading