Enjoy Your Layer Normalization with the Computation Efficiency of RMSNorm

Enjoy Your Layer Normalization with the Computation Efficiency of RMSNorm

ICLR 2026 Conference Submission9170 Authors

17 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Layer normalization, RMSNorm, Deep Learning

Abstract: Layer normalization (LN) is a milestone technique in deep learning and has been widely adopted across various network architectures. However, LN introduces additional computational costs in the inference process. This issue has been addressed by its counterpart, RMSNorm, which removes the centering operation. This paper explores how to retain the theoretical advantages of LN while achieving the computational efficiency of RMSNorm. We first propose a general framework to determine whether an LN in any DNN can be equivalently replaced with RMSNorm. We introduce the methodology for removing the centering operation of LN after a linear layer with mathematical equivalence, by proposing column-based wight centering (CBWC) on linear layer. We further define the foldable LN---i.e., that can be replaced by RMSNorm without altering model behavior after applying constraints onto certain layers, and introduce zero-mean graph to analyze whether any LN in arbitrary given neural network is foldable. We present an algorithm that automatically detects foldable LNs and show that most LNs in currently widely used architectures are foldable, which provides a straightforward benefit in reducing the computational costs during inference. Additionally, we conduct extensive experiments to show that 'CBWC+RMSNorm' achieves performance comparable to vanilla LN, while improving efficiency during training, even in cases where the LN is not foldable.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 9170

Loading