On the Efficiency of Transformers: The Effect of Attention Rank

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Transformers, Attention Matrix Rank, Softmax Temperature, Head Dimension, Model Performance
TL;DR: The study reveals a positive correlation between the attention matrix rank and model performance, highlighting the role of initial rank and the impact of hyperparameters like softmax temperature and head dimension on it.
Abstract: Transformers have showcased superior performance across a variety of real-world scenarios, marking the advent of a period dominated by large language models. However, with the escalating complexity of these models and the continuous enlargement of training datasets, efficiency-related challenges have become more pronounced. In this study, we investigate the influence of the rank of attention matrices on the training and performance of these models. We first gain insight by benchmark tasks such as BERT and GPT-2. Our findings underscore that (i) the mean rank of attention matrices is stable throughout the training, and the initial rank is a dependable indicator of the final rank; (ii) a distinct positive relationship exists between the attention rank and the effectiveness of the model, where elevated ranks correlate with diminished loss and expedited convergence. These insights reveal a relationship between initial attention matrix rank and performance. We proceed to investigate the impact of hyperparameters on the initial rank. The study unveils that modifying the softmax temperature or the head dimension can amplify the ranks, with the former exerting a more significant effect. Notably, we theoretically identify the characterization in the attention matrix rank at low temperatures, and we demonstrate the existence of an upper bound of attention matrix rank with respect to the head dimension. These observations are validated through trials on a high-rank target, underscoring instances where traditional setups fall short.
Primary Area: visualization or interpretation of learned representations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3702
Loading