Deciphering Attention Mechanisms: Optimization and Fenchel Dual Solutions

Deciphering Attention Mechanisms: Optimization and Fenchel Dual Solutions

TMLR Paper2847 Authors

11 Jun 2024 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Attention has been widely adopted in many state-of-the-art deep learning models. While the significant performance improvements it brings have attracted great interest, the theoretical understanding of attention remains limited. This paper presents a new perspective on understanding attention by showing that it can be seen as a solver of a family of estimation problems. Specifically, we explore a convex optimization problem central to many estimation tasks prevalent in the development of deep learning architectures. Instead of solving this problem directly, we address its Fenchel dual and derive a closed-form approximation of the optimal solution. This approach results in a generalized attention framework, with the popular dot-product attention used in transformer networks being a special case. We show that T5 transformer has implicitly adopted the general form of the solution by demonstrating that this expression unifies the word mask and the positional encoding functions. Finally, we discuss how these new attention structures can be practically applied in model design and argue that the underlying convex optimization problem offers a principled justification for the architectural choices in attention mechanisms.

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=XtL7cM4fQy&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions

Changes Since Last Submission: 1. We have refined our approach to presenting our framework. Instead of beginning with a superficially unrelated example, we now start by discussing practical examples. We demonstrate how these examples can be unified and abstracted similarly. Then, we show this similarity naturally leads to a unified design problem that can be modelled as an optimization task. We hope this update provides readers a clearer understanding of our main messages. 2. We have added more discussions on how our framework aids in designing new attention mechanisms. Additionally, we have included summarized empirical results for T5 and BERT in the main text, with further details provided in the appendix.

Assigned Action Editor: ~Pin-Yu_Chen1

Submission Number: 2847

Loading