Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping

Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping

TMLR Paper6033 Authors

28 Sept 2025 (modified: 11 Feb 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different Transformer modules, including blocks, MLP layers, and attention layers, through the lens of layer dropping. Surprisingly, despite the pivotal role of attention mechanisms in distinguishing Transformers from other architectures, we find that a large portion of attention layers exhibit excessively high redundancy and can be pruned without degrading performance. For example, LLaMA-3-70B achieves a 43.4\% speedup with only a 1.8\% drop in performance by pruning half of its attention layers. In contrast, dropping MLP layers severely impairs the model's ability to distinguish between tokens, leading to catastrophic performance degradation. Moreover, our analysis reveals that attention layer redundancy persists not only throughout training but is also evident in randomly initialized models. We attribute this redundancy to three key factors that constrain representational updates from attention layers: sparse attention patterns, over-smoothed token embeddings, and the low representational magnitude of attention outputs. Overall, our findings offer valuable insights into the internal redundancy of Transformer architectures and provide practical guidance for designing more efficient LLMs. The code is released at: https://github.com/CASE-Lab-UMD/LLM-Drop.

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=xnYT0HjBsT

Changes Since Last Submission: In this revision, we have added several new analyses and experiments to further support our findings: - Visualization of redundancy in attention. We provide additional visualizations including attention maps, relative magnitudes in the residual branch, and token-wise similarity (oversmoothing effects). We also analyze the impact of dropping MLP vs. attention layers on the output tokens, which confirms that attention layers are more redundant. - Evaluation on Peri-LN models. Beyond Pre-LN architectures, we include experiments on models with Peri-LN normalization, demonstrating that our method generalizes across different normalization designs. - Additional explanation of cosine similarity. We expand the discussion in the appendix to clarify how magnitude information is implicitly captured, and provide supporting evidence through comparisons of attention and MLP output magnitudes.

Assigned Action Editor: ~Murat_A_Erdogdu1

Submission Number: 6033

Loading