From QKV to K/KV: Investigating Minimalist Attention Mechanisms

From QKV to K/KV: Investigating Minimalist Attention Mechanisms

ICLR 2026 Conference Submission20380 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: self attention, attention, deep learning, large language models, neural networks, transformers

TL;DR: We evaluate K and KV transformers by simplifying the QKV formulation and show on par performance with drop in model size.

Abstract: Transformers have become the standard solution for various AI tasks. The widely adopted query, key, and value (QKV) formulation has played a significant role in this. Although the performance of transformer models has been widely studied, the individual contribution of these three components and the precise impact on performance when some are omitted are still not fully understood. Consequently, we evaluated two transformer variants: one with two projections to construct K and V vectors, and another with only a single projection. Both resulted in symmetric self-attention maps. Additionally, we explored an asymmetric attention mechanism by incorporating a 2D positional encoding into the attention matrix. In particular, these modified transformers exhibited reduced parameter counts and computational demands compared to the standard architecture. Through experiments encompassing three task types: synthetics (such as reversing or sorting a list), vision (MNIST, CIFAR, and Tiny ImageNet classification) and NLP (character generation and translation)—we discovered that our transformers perform on par or occasionally better than the QKV transformer on vision tasks but under-perform slightly on NLP tasks. Our findings suggest that three distinct self-attention representations are not universally required and depend on the specific task

Primary Area: foundation or frontier models, including LLMs

Submission Number: 20380

Loading