Keywords: Transformers, Attention, Self-Attention
TL;DR: We introduce three variants of attention mechanism based on removing existing linear transformations or adding extra ones to standard attention and study these algorithms in terms of perfofrmance and speed in varying scales.
Abstract: Scaled Dot Product Attention (SDPA) is the backbone of many modern
deep-learning models. It is so versatile that it has been used in
natural language, vision, and multi-modal domains with very little
change compared to its original formulation. This paper studies the linear transformations used in SDPA. To this end, we introduce three variants of the attention mechanism by removing consecutive linear transformations or adding an extra one. We name these variants Optimized ($W^V$ removed),
Efficient ($W^V$ and $W^K$ removed), and Super Attention ($W^V$ and $W^K$ removed and $W^A$ introduced) to simplify comparison when referring to them. In addition to providing the mathematical intuition behind these choices, we evaluate these variants on several datasets of varying size and complexity in vision and text modalities for predictive and generative tasks. Optimized and
Efficient variants have one and two matrix multiplications fewer
per head, respectively, and 25\% and 50\% fewer parameters,
respectively, than standard SDPA. However, the performance change compared to difference in parameter count is small. Super Attention introduces a new linear transformation
on the values, transforming them from the left. It outperforms
standard SPDA in both modalities by up to 10\%
while having one fewer matrix multiplication per head and 25\% fewer
parameters than standard SPDA. Consequently, it is also faster than standard SDPA.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5370
Loading