What Matters in Transformers? Not All Attention is Needed

Shwai He; Guoheng Sun; Zheyu Shen; Ang Li

What Matters in Transformers? Not All Attention is Needed

Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

26 Sept 2024 (modified: 23 Jan 2025)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Transformer, Attention, Model Compression, Efficiency

TL;DR: By exploring and leveraging redundancy in Transformer-based LLMs, we uncover the surprising extent of redundancy in attention layers, providing valuable insights for future model design.

Abstract: While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant archi- tectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Trans- formers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading perfor- mance. For instance, Llama-2-70B achieved a 48.4% speedup with only a 2.4% performance drop by pruning half of the attention layers. Furthermore, by tracing model checkpoints throughout the training process, we observed that attention layer redundancy is inherent and consistent across training stages. Additionally, we further propose a method that jointly drops Attention and MLP layers, allowing us to more aggressively drop additional layers. For instance, when dropping 31 layers (Attention + MLP), Llama-2-13B still retains 90% of the performance on the MMLU task. Our work provides valuable insights for future network architecture design. The code will be released upon acceptance.

Supplementary Material: zip

Primary Area: other topics in machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8274

Loading