Your Transformer is Secretly Linear

Published: 01 Jan 2024, Last Modified: 20 May 2025ACL (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper reveals a novel linear characteristic exclusive to transformer decoders, including models like GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering an almost perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed, due to a consistently low transformer layer output norm. Our experiments show that pruning or linearly approximating some of the layers does not impact loss or model performance significantly. Moreover, we introduce a cosine-similarity-based regularization in our pretraining experiments on smaller models, aimed at reducing layer linearity. This regularization not only improves performance metrics on benchmarks like Tiny Stories and SuperGLUE but as well successfully decreases the linearity of the models. This study challenges the existing understanding of transformer architectures, suggesting that their operation may be more linear than previously assumed.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview