LayerNorm vs RMSNorm: Geometric Perspective and a Case Against Mean Subtraction

LayerNorm vs RMSNorm: Geometric Perspective and a Case Against Mean Subtraction

ACL ARR 2025 July Submission818 Authors

28 Jul 2025 (modified: 30 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper presents a novel geometric interpretation of LayerNorm and explores how LayerNorm influences the norm and orientation of hidden vectors in the representation space. We show that the definition of LayerNorm is innately linked to the uniform vector, defined as $\boldsymbol{1} = [1, 1, 1, 1, \cdots, 1]^T \in \mathbb{R}^d$. We then show that the standardization step in LayerNorm can be understood in three simple steps: (i) remove the component of a vector along the uniform vector, (ii) normalize the remaining vector, and (iii) scale the resultant vector by $\sqrt{d}$, where $d$ is the dimensionality of the representation space. Finally, we compare the hidden representations of LayerNorm-based LLMs with models trained using RMSNorm and show that all LLMs naturally operate orthogonal to the uniform vector both during training and inference, that is, on average they do not have a component along the uniform vector during training or inference. This presents the first mechanistic evidence that removing the component along the uniform vector in LayerNorm is a redundant step. These results advocate for using RMSNorm over LayerNorm which is also more computationally efficient.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: probing, feature attribution

Contribution Types: Model analysis & interpretability

Languages Studied: English

Previous URL: https://openreview.net/forum?id=R9qI0fhkpu

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).

Reassignment Request Reviewers: Yes, I want a different set of reviewers

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: citations happen in the entire paper

B2 Discuss The License For Artifacts: No

B2 Elaboration: everything is open source

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: everything is open source

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: Pretraining done on wikipedia data

B5 Documentation Of Artifacts: No

B5 Elaboration: Pretraining done on wikipedia data

B6 Statistics For Data: Yes

B6 Elaboration: section 3

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: 3

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: 3

C3 Descriptive Statistics: Yes

C3 Elaboration: 3

C4 Parameters For Packages: Yes

C4 Elaboration: 3

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 818

Loading