Self-Adjust Softmax

Self-Adjust Softmax

ACL ARR 2025 May Submission392 Authors

12 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one.**Usually, tokens with larger attention scores are important for the final prediction. However, the softmax function can face a gradient vanishing issue for such important tokens (e.g., probabilities close to one), leading to optimization difficulties for the important tokens so that the performance may not be better.** In this paper, we propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying $softmax(z)$ to $z \cdot softmax(z)$ and its normalized variant $\frac{(z - min(z_{\min},0))}{max(0,z_{max})-min(z_{min},0)} \cdot softmax(z)$. We theoretically show that SA-Softmax provides enhanced gradient properties compared to the vanilla softmax function. Moreover, SA-Softmax Attention can be seamlessly integrated into existing Transformer models to their attention mechanisms with minor adjustments. We conducted experiments to evaluate the empirical performance of Transformer models using \methodShortName compared to the vanilla softmax function. These experiments, involving models with up to 2.7 billion parameters, are conducted across diverse datasets, language tasks, and positional encoding methods.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Attention Mechanism, Transformer, Softmax

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 392

Loading