Keywords: Transformer, attention, MLP, NLP
TL;DR: This paper proposes using a multi-layer perceptron instead of the dot product to compute attention weights of the Transformer architecture which leads to improved NLP task performance.
Abstract: The Transformer architecture has revolutionized natural language processing (NLP) and has achieved state-of-the-art results in various tasks. The attention mechanism is one of the key components of the Transformer architecture, which allows the model to focus on relevant parts of the input. In the standard Transformer, the attention weights are computed by the dot product of query and key vectors followed by a softmax function. However, in this paper, we propose to replace the dot product of query and key vectors with a multi-layer perceptron (MLP) to compute attention weights directly from the embeddings. The proposed modification is simple and can be easily implemented in existing Transformer-based models to improve their performance as shown in this paper for an NLP task. We provide the implementation code at https://github.com/AlirezaMorsali/MLP-Attention for reproducibility and ease of adoption.
6 Replies
Loading