Normalized Attention Without Probability Cage

Oliver Paul Richter; Roger Wattenhofer

Normalized Attention Without Probability Cage

Oliver Paul Richter, Roger Wattenhofer

Published: 28 Jan 2022, Last Modified: 22 Jun 2025ICLR 2022 SubmittedReaders: Everyone

Keywords: Attention, Transformers, Neural Architecture, Aggregators

Abstract: Despite the popularity of attention based architectures like Transformers, the geometrical implications of softmax-attention remain largely unexplored. In this work we highlight the limitations of constraining attention weights to the probability simplex and the resulting convex hull of value vectors. We show that Transformers are biased towards local information at initialization and sensitive to hyperparameters, contrast attention to max- and sum-pooling and show the performance implications of different architectures with respect to biases in the data. Finally, we propose to replace the softmax in self-attention with normalization, resulting in a generally applicable architecture that is robust to hyperparameters and biases in the data. We support our insights with empirical results from more than 30,000 trained models. Implementations are in the supplementary material.

One-sentence Summary: The softmax in attention limits the expressiveness of the Transformer architecture and using normalization instead yields increased robustness with respect to hyperparameters.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/normalized-attention-without-probability-cage/code)

11 Replies

Loading