Efficient distributed representations with linear-time attention scores normalization

TMLR Paper3597 Authors

30 Oct 2024 (modified: 31 Oct 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The attention score matrix ${\rm SoftMax}(XY^T)$ encodes relational similarity patterns between objects and is extremely popular in machine learning. However, the complexity required to calculate it runs quadratically with the problem size, making it a computationally heavy solution. In this article, we propose a linear-time approximation of the attention score normalization constants for embedding vectors with bounded norms. We show on several pre-trained embeddings that the accuracy of our estimation formula surpasses competing kernel methods by even orders of magnitude. From this result, we design a linear-time and task-agnostic embedding algorithm based on the optimization of the attention scores. The proposed algorithm is highly interpretable and easily adapted to an arbitrary embedding problem. We consider a few use-cases and observe similar or higher performances and a lower computational time with respect to comparable embedding algorithms.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Manzil_Zaheer1
Submission Number: 3597
Loading