Bias Decay Matters : Improving Large Batch Optimization with Connectivity SharpnessDownload PDF

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
Keywords: large batch optimization, sharpness/flatness
Abstract: As deep learning becomes computationally intensive, the data parallelism is an essential option for the efficient training of high-performance models. Accordingly, the recent studies deal with the methods for increasing batch size in training the model. Many recent studies focused on learning rate, which determines the noise scale of parameter updates~\citep{goyal2017accurate, you2017large, You2020Large} and found that a high learning rate is essential for maintaining generalization performance and flatness of the local minimizers~\citep{Jastrzebski2020The, cohen2021gradient, lewkowycz2020large}. But to fill the performance gap that still exists in the large batch optimization, we study a method to directly control the flatness of local minima. Toward this, we define yet another sharpness measure called \textit{Connectivity sharpness}, a reparameterization invariant, structurally separable sharpness measure. Armed with this measure, we experimentally found the standard \textit{no bias decay heuristic}~\citep{goyal2017accurate, he2019bag}, which recommends the bias parameters and $\gamma$ and $\beta$ in BN layers are left unregularized in training, is a crucial reason for performance degradation in large batch optimization. To mitigate this issue, we propose simple bias decay methods including a novel adaptive one and found that this simple remedy can fill a large portion of the performance gaps that occur in large batch optimization.
One-sentence Summary: Propose a novel sharpness metric and remedy pitfalls of no bias decay heuristic in large batch optimization.
10 Replies

Loading