Abstract: While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different Transformer modules, including blocks, MLP layers, and attention layers, through the lens of layer dropping. Surprisingly, despite the pivotal role of attention mechanisms in distinguishing Transformers from other architectures, we find that a large portion of attention layers exhibit excessively high redundancy and can be pruned without degrading performance. For example, LLaMA-3-70B achieves a 43.4\% speedup with only a 1.8\% drop in performance by pruning half of its attention layers. In contrast, dropping MLP layers severely impairs the model's ability to distinguish between tokens, leading to catastrophic performance degradation. Moreover, our analysis reveals that attention layer redundancy persists not only throughout training but is also evident in randomly initialized models. We attribute this redundancy to three key factors that constrain representational updates from attention layers: sparse attention patterns, over-smoothed token embeddings, and the low representational magnitude of attention outputs. Overall, our findings offer valuable insights into the internal redundancy of Transformer architectures and provide practical guidance for designing more efficient LLMs. Code will be released upon acceptance.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=xnYT0HjBsT
Changes Since Last Submission: In this revision, we have added several new analyses and experiments to further support our findings:
- Visualization of redundancy in attention. We provide additional visualizations including attention maps, relative magnitudes in the residual branch, and token-wise similarity (oversmoothing effects). We also analyze the impact of dropping MLP vs. attention layers on the output tokens, which confirms that attention layers are more redundant.
- Evaluation on Peri-LN models. Beyond Pre-LN architectures, we include experiments on models with Peri-LN normalization, demonstrating that our method generalizes across different normalization designs.
- Additional explanation of cosine similarity. We expand the discussion in the appendix to clarify how magnitude information is implicitly captured, and provide supporting evidence through comparisons of attention and MLP output magnitudes.
Assigned Action Editor: ~Murat_A_Erdogdu1
Submission Number: 6033
Loading