Abstract: This paper presents a novel geometric interpretation of LayerNorm and explores how LayerNorm influences the norm and orientation of hidden vectors in the representation space. We show that the definition of LayerNorm is innately linked to the uniform vector, defined as $\boldsymbol{1} = [1, 1, 1, 1, \cdots, 1]^T \in \mathbb{R}^d$. We then show that the standardization step in LayerNorm can be understood in three simple steps: (i) remove the component of a vector along the uniform vector, (ii) normalize the remaining vector, and (iii) scale the resultant vector by $\sqrt{d}$, where $d$ is the dimensionality of the representation space. Finally, we compare the hidden representations of LayerNorm-based LLMs with models trained using RMSNorm and show that all LLMs naturally operate orthogonal to the uniform vector both during training and inference, that is, on average they do not have a component along the uniform vector during training or inference. This presents the first mechanistic evidence that removing the component along the uniform vector in LayerNorm is a redundant step. These results advocate for using RMSNorm over LayerNorm which is also more computationally efficient.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: probing, feature attribution
Contribution Types: Model analysis & interpretability
Languages Studied: English
Previous URL: https://openreview.net/forum?id=R9qI0fhkpu
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: Yes, I want a different set of reviewers
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: citations happen in the entire paper
B2 Discuss The License For Artifacts: No
B2 Elaboration: everything is open source
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: everything is open source
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: Pretraining done on wikipedia data
B5 Documentation Of Artifacts: No
B5 Elaboration: Pretraining done on wikipedia data
B6 Statistics For Data: Yes
B6 Elaboration: section 3
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 3
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 3
C3 Descriptive Statistics: Yes
C3 Elaboration: 3
C4 Parameters For Packages: Yes
C4 Elaboration: 3
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 818
Loading