A Study of Necessity & Sufficiency of Linear Transformations in the Attention Mechanism

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformers, Attention, Self-Attention
TL;DR: We introduce three variants of attention mechanism based on removing existing linear transformations or adding extra ones to standard attention and study these algorithms in terms of perfofrmance and speed in varying scales.
Abstract: Scaled Dot Product Attention (SDPA) is the backbone of many modern deep-learning models. It is so versatile that it has been used in natural language, vision, and multi-modal domains with very little change compared to its original formulation. This paper studies the linear transformations used in SDPA. To this end, we introduce three variants of the attention mechanism by removing consecutive linear transformations or adding an extra one. We name these variants Optimized ($W^V$ removed), Efficient ($W^V$ and $W^K$ removed), and Super Attention ($W^V$ and $W^K$ removed and $W^A$ introduced) to simplify comparison when referring to them. In addition to providing the mathematical intuition behind these choices, we evaluate these variants on several datasets of varying size and complexity in vision and text modalities for predictive and generative tasks. Optimized and Efficient variants have one and two matrix multiplications fewer per head, respectively, and 25\% and 50\% fewer parameters, respectively, than standard SDPA. However, the performance change compared to difference in parameter count is small. Super Attention introduces a new linear transformation on the values, transforming them from the left. It outperforms standard SPDA in both modalities by up to 10\% while having one fewer matrix multiplication per head and 25\% fewer parameters than standard SPDA. Consequently, it is also faster than standard SDPA.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5370
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview