Globally 1-Lipschitz Attention Without Sequence-Length Dependence

Globally 1-Lipschitz Attention Without Sequence-Length Dependence

ICLR 2026 Conference Submission19647 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: self-attention, Lipschitz continuity, implicit layers, convex potential, proximal operator, monotone operator theory, robustness, Transformers, natural language processing, stability, non-expansive mappings

TL;DR: We introduce $\mathrm{LipAttn}$, an attention block that is unconditionally 1-Lipschitz via a proximal convex potential. It ensures stability and efficient solvers.

Abstract: Self-attention powers modern deep learning; however, dot-product attention is not globally Lipschitz, which limits stability and robustness. Prior fixes enforce Lipschitz continuity by changing the geometry of attention, which departs from the standard mechanism and still yields bounds that scale poorly with sequence length or spectral norms. We introduce $\mathrm{LipAttn}$, a new attention block that derives coefficients from a convex potential and realizes them through an implicit proximal update, guaranteeing architectural stability. This design ensures the entire block is firmly non-expansive and thus \emph{unconditionally 1-Lipschitz}, independent of sequence length or parameter norms, while remaining structurally close to dot-product attention. Building on monotone operator theory, we establish its contractive properties and develop efficient first-order solvers. Experiments on OpenWebText show that $\mathrm{LipAttn}$ achieves meaningful token mixing and retains learning capacity.

Primary Area: learning theory

Submission Number: 19647

Loading