Keywords: Transformer; orthogonal projection; BeliefFormer
TL;DR: incorporating orthogonal projection as residual signals into attention layer in Transformer to improve generation performance
Abstract: In this paper, we consider modifying the attention layer in Transformer to improve its generalization performance. Conceptually speaking, the standard attention layer takes the softmax-based weighted summation of V vectors as the residual signal (with a linear mapping for dimensionality alignment) when performing the skip-connection operation. Inspired by distribution optimization, we propose to first perform an orthogonal projection of the softmax-based weighted summation of V vectors with respect to the original V vectors and then take the orthogonal projection instead as the residual signal (with a linear mapping for dimensionality alignment) when performing the skip-connection operation. By doing so, the token vectors are modified relatively more along their tangent directions compared to their magnitudes. Intuitively speaking, the orthogonal projection reflects a belief about the discrepancy between the weighted summation of V vectors and the V vectors themselves. We refer to the newly modified layer and the overall architecture as the belief-attention and the BeliefFormer, respectively. To further improve performance, we also design a variant of belief-attention by incorporating two types of orthogonal projections, referred to as belief-attention$^{\ast}$. Extensive experiments show that the two new variants of attention layer in Transformers lead to better performance than the standard attention for image classification over ImageNet and natural language processing when training nano-GPT2.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 25156
Loading