Keywords: self attention, attention, deep learning, large language models, neural networks, transformers
TL;DR: We evaluate K and KV transformers by simplifying the QKV formulation and show on par performance with drop in model size.
Abstract: Transformers have become the standard solution for various AI tasks. The widely adopted query, key, and value (QKV) formulation has played a significant role in this. Although the performance of transformer models has been widely studied, the individual contribution of these three components and the precise impact on performance when some are omitted are still not fully understood. Consequently, we evaluated two transformer variants: one with two projections to construct K and V vectors, and another with only a single projection. Both resulted in symmetric self-attention maps. Additionally, we explored an asymmetric attention mechanism by incorporating a 2D positional encoding into the attention matrix. In particular, these modified transformers exhibited reduced parameter counts and computational demands compared to the standard architecture. Through experiments encompassing three task types: synthetics (such as reversing or sorting a list), vision (MNIST, CIFAR, and Tiny ImageNet classification) and NLP (character generation and translation)—we discovered that our transformers perform on par or occasionally better than the QKV transformer on vision tasks but under-perform slightly on NLP tasks. Our findings suggest that three distinct self-attention representations are not universally required and depend on the specific task
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20380
Loading