Polynomial Alternatives to Softmax in Transformers

Polynomial Alternatives to Softmax in Transformers

ICLR 2026 Conference Submission17919 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: polynomial activations for transformers, theory for transformers

Abstract: Transformers have rapidly become the backbone of modern machine learning, with attention mechanisms, most often implemented with a softmax activation, at their core. The softmax function is typically motivated by its ability to produce a row-wise probability distribution over the attention matrix, yielding sparse patterns that align with the intuition of attending to different input tokens. In this paper, we uncover an additional and previously overlooked role of softmax: it implicitly regularizes the Frobenius norm of the attention matrix, which contributes to stabilizing training. This observation prompts a fundamental question: are the inductive biases imposed by softmax: positivity, normalization, and sparsity, truly necessary for effective transformer training? To answer this, we explore alternative activations, focusing on polynomial functions that preserve the regularization effect while introducing fundamentally different inductive biases. Through theoretical analysis, we show that specific polynomial activations can serve as viable substitutes for softmax, supporting stable training and strong performance despite abandoning its conventional properties. Extensive experiments across a range of transformer architectures and applications validate our findings, providing new insights into the design of attention mechanisms.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 17919

Loading