Doubly Normalized Attention

Nan Ding; Xinjie Fan; Zhenzhong Lan; Dale Schuurmans; Radu Soricut

Doubly Normalized Attention

Nan Ding, Xinjie Fan, Zhenzhong Lan, Dale Schuurmans, Radu Soricut

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone

Abstract: Models based on the Transformer architecture have achieved better accuracy than models based on competing architectures. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. In this paper, we provide two alternative views of the attention mechanism: one from the probabilistic view via the Gaussian mixture model, the other from the optimization view via optimal transport. Following these insights, we propose a new attention scheme that requires normalization on both the upper and lower layers, called the doubly-normalized attention scheme. We analyze the properties of both the original and the new attention schemes, and find that the doubly-normalized attention mechanism directly mitigates two unwanted effects: it resolves the explaining-away effect and alleviates mode collapse. We conduct empirical studies that quantify numerical advantages for the doubly-normalized attention model, as well as for a hybrid model that dynamically combines both attention schemes to achieve improved performance on several well-known benchmarks.

Original Pdf: pdf

7 Replies

Loading