On the Regularity of Attention

On the Regularity of Attention

TMLR Paper533 Authors

24 Oct 2022 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Attention is a powerful component of modern neural networks across a wide variety of domains. In this paper, we seek to quantify the regularity (i.e. the smoothness) of the attention operation. To accomplish this goal, we propose a new mathematical framework that uses measure theory and integral operators to model attention. Specifically, we formulate attention as an operator acting on empirical measures over representations of tokens. We show that this framework is consistent with the usual definition, captures the essential properties of attention, and that it can handle inputs of arbitrary length. Then we use it to prove that, on compact domains, the attention operation is Lipschitz continuous with respect to the 1-Wasserstein distance, and provide an estimate of its Lipschitz constant. Additionally, by focusing on a specific type of attention, we extend these Lipschitz continuity results to non-compact domains. Finally, we discuss the effects regularity can have on NLP models, as well as applications to invertible and infinitely-deep networks.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: * corrected typos pointed out by reviewers. * changed title to Section 6 from "Applications of Regularity": to "Discussion" to clarify its role in the paper.

Assigned Action Editor: ~Jeffrey_Pennington1

Submission Number: 533

Loading