Softmax is $1/2$-Lipschitz: A tight bound across all $\ell_p$ norms

TMLR Paper6139 Authors

07 Oct 2025 (modified: 25 Nov 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The softmax function is a basic operator in machine learning and optimization, used in classification, attention mechanisms, reinforcement learning, game theory, and problems involving log-sum-exp terms. Existing robustness guarantees of learning models and convergence analysis of optimization algorithms typically consider the softmax operator to have a Lipschitz constant of $1$ with respect to the $\ell_2$ norm. In this work, we prove that the softmax function is contractive with the Lipschitz constant $1/2$, uniformly across all $\ell_p$ norms with $p \ge 1$. We also show that the local Lipschitz constant of softmax attains $1/2$ for $p = 1$ and $p = \infty$, and for $p \in (1,\infty)$, the constant remains strictly below $1/2$ and the supremum $1/2$ is achieved only in the limit. To our knowledge, this is the first comprehensive norm-uniform analysis of softmax Lipschitz continuity. We demonstrate how the sharper constant directly improves a range of existing theoretical results on robustness and convergence. We further validate the sharpness of the $1/2$ Lipschitz constant of the softmax operator through empirical studies on attention-based architectures (ViT, GPT-2, Qwen3-8B) and on stochastic policies in reinforcement learning.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: **Summary of changes** 1) We corrected the numbering and cross-references of theorems and examples to fix the labelling inconsistencies noted by the reviewers. 2) We rewrote Section $4.2$ to remove the ambiguity highlighted by one of the reviewers and to make the statements and notation in the corresponding Theorem $2$ clearer. 3) **Additional related work.** In the initial submission, in a remark, we cited *Yudin et al.(2025)*, which establishes that the Lipschitz constant of softmax with respect to the $\ell_2$ norm is upper bounded by $1/2$. In the revision, we additionally cite two independent derivations of the same $\ell_2$ result and update the corresponding remark accordingly: a) Wael Alghamdi, Hsiang Hsu, Haewon Jeong, Hao Wang, Peter Michalak, Shahab Asoodeh, and Flavio Calmon. Beyond adult and compas: Fair multi-class prediction via information projection. *Proc. Advances in Neural Information Processing Systems*, 35:38747--38760, 2022. b) Laker Newhouse. Softmax is $\tfrac{1}{2}$-Lipschitz (in a norm that may not matter). *https://www.lakernewhouse.com/assets/writing/softmax-is-0-5-lipschitz.pdf*, 2025. Unpublished note. We also clarify how our contribution differs significantly: we prove a tight uniform $1/2$ Lipschitz constant across all $\ell_p$ norms ($1 \le p \le \infty$) and provide a detailed attainability analysis on the probability simplex across all $p$. 4) Changed the Broader Impact section to include a more elaborate discussion on the impacts of our work. 5) We have corrected several minor but important typographical/notational issues in the manuscript, and we thank the reviewers for bringing them to our attention.
Assigned Action Editor: ~Murat_A_Erdogdu1
Submission Number: 6139
Loading