Keywords: Transformer efficiency, attention mechanism, model capacity analysis, matrix rank
TL;DR: This paper reveals an inherent low-rank limitation in attention matrices of Transformers, showing that increasing head dimensions eventually yields diminishing returns, with theoretical and empirical support for both rank and performance saturation.
Abstract: Transformers have showcased superior performance across a variety of real-world applications, particularly leading to unparalleled successes of large foundation models. However, the overall computation and memory loads of these large models trained on web-scale datasets are considerably increasing, calling for more *efficient* learning methods. In this work, we step towards this direction by exploring the architectural limitation and *redundancy* of Transformers via investigating the ranks of attention score matrices. On one hand, extensive experiments are conducted on various model configurations (model dimensions, heads, layers, etc) and data distributions (both synthetic and real-world datasets with varied sequence lengths), uncovering two key properties: The attention rank is eventually upper bounded (limitation) and gets saturated (redundancy), as the head dimension $d_h$ increases. We call them the *low-rank barrier* and *model-reduction effect*, respectively. Most importantly, the redundancy appears that *both the attention rank and learning performance simultaneously get marginal enhancements when increasing modeling parameters*. On the other hand, we provide rigorous demonstrations for these observations under idealized settings through a fine-grained mathematical analysis, highlighting (i) a consistent theoretical upper bound ($\approx 0.63n$, $n$: the sequence length) on the attention rank (regardless of $d_h$) given random weights; (ii) a critical position of the rank saturation ($d_h=\Omega(\log n)$). These results contribute to the principled understanding and assessment of Transformers' model capacity and efficiency, and are also successfully verified in practical applications such as multi-head *latent* attention (MLA) applied in DeepSeek-V3.
Primary Area: learning theory
Submission Number: 14520
Loading