Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

Published: 11 Oct 2024, Last Modified: 10 Nov 2024M3L PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Attention Mechanism, Mirror Descent, Implicit Regularization, Transformers
TL;DR: We characterize the implicit regularization property of the attention mechanism trained with mirror descent
Abstract: Attention mechanisms have revolutionized numerous domains of artificial intelligence, including natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. Building on recent results characterizing the optimization dynamics of gradient descent (GD) and the structural properties of its preferred solutions in attention-based models, this paper explores the convergence properties and implicit bias of a family of mirror descent (MD) algorithms designed for softmax attention mechanisms, with the potential function chosen as the $p$-th power of the $\ell_p$-norm. Specifically, we show the directional convergence of these algorithms to a generalized hard-margin SVM with an $\ell_p$-norm objective when applied to a classification problem using a one-layer softmax attention model. Our theoretical results demonstrate that these algorithms not only converge directionally to the generalized max-margin solutions but also do so at a rate comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions.
Is Neurips Submission: No
Submission Number: 31
Loading