The Lipschitz Constant of Self-Attention

Hyunjik Kim; George Papamakarios; Andriy Mnih

The Lipschitz Constant of Self-Attention

Hyunjik Kim, George Papamakarios, Andriy Mnih

28 Sept 2020 (modified: 22 Jun 2025)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Lipschitz constant, self-attention, theory

Abstract: Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling. We prove that the standard dot-product self-attention is *not* Lipschitz, and propose an alternative L2 self-attention that *is* Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of our theoretical work, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level language modelling task.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: Theoretical work showing that standard dot-product self-attention is *not* lipschitz and providing an alternative formulation of self-attention based on L2 distance that *is* Lipschitz.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/the-lipschitz-constant-of-self-attention/code)

Reviewed Version (pdf): https://openreview.net/references/pdf?id=SubXbokOJx

11 Replies

Loading