On the Limitation and Redundancy of Transformers: A Rank Perspective

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformer efficiency, attention mechanism, model capacity analysis, matrix rank
TL;DR: This paper reveals an inherent low-rank limitation in attention matrices of Transformers, showing that increasing head dimensions eventually yields diminishing returns, with theoretical and empirical support for both rank and performance saturation.
Abstract: Transformers have showcased superior performance across a variety of real-world applications, particularly leading to unparalleled successes of large foundation models. However, the overall computation and memory loads of these large models trained on web-scale datasets are considerably increasing, calling for more efficient learning methods. In this work, we step towards this direction by exploring the architectural limitation and redundancy of Transformers via investigating the ranks of attention score matrices. On one hand, extensive experiments are conducted on various model configurations (model dimensions, heads, layers, etc) and data distributions (both synthetic and real-world datasets with varied sequence lengths), uncovering two key properties: The attention rank is eventually upper bounded (limitation) and gets saturated (redundancy), as the head dimension $d_h$ increases. We call them the low-rank barrier and model-reduction effect, respectively. Most importantly, the redundancy appears that both the attention rank and learning performance simultaneously get marginal enhancements when increasing modeling parameters. On the other hand, we provide rigorous demonstrations for these observations under idealized settings through a fine-grained mathematical analysis, highlighting (i) a consistent theoretical upper bound ($\approx 0.63n$, $n$: the sequence length) on the attention rank (regardless of $d_h$) given random weights; (ii) a critical position of the rank saturation ($d_h=\Omega(\log n)$). These results contribute to the principled understanding and assessment of Transformers' model capacity and efficiency, and are also successfully verified in practical applications such as multi-head \emph{latent} attention (MLA) applied in DeepSeek-V3.
Primary Area: learning theory
Submission Number: 14520
Loading