$W_K, W_V$ is Probably All You Need: On the Necessity of the Query, Key, and Value Weight Triplet in Self-Attention Transformers
Keywords: Transformer Optimization, Parameter Efficiency, Attention Redundancy, Structural Invariance, Training Stability, Implicit Regularization, Architecture Simplification, Multi-head Attention, ReLU MLP, Skip connections
TL;DR: We propose a theory-motivated structural improvement to a family of transformer models that reduces attention parameters and improves training stability
Abstract: We theoretically investigate whether the Query, Key, Value weight triplet can be reduced in encoder-only and decoder-only transformers. Under mild assumptions, we prove that Query, Key or Value weights are redundant and can be replaced with the identity matrix, reducing attention parameters by 25\%. If applied to the Query or Key weights, this also simplifies optimization: attention logits become linear rather than quadratic in learned weights. Validating the Query weight removal on decoder-only GPT-style small models trained from scratch, we find that with adjusted attention scaling and weight decay, reduced models match baseline performance despite fewer parameters. Training remains stable at over $3\times$ lower weight decay, suggesting Query weight elimination provides implicit regularization. Our analysis has also led us to a structural expressivity boundary: in the mathematically tractable ReLU setting, skip connections push MLPs into a generically disjoint function class at fixed width. These findings motivate investigation across modalities and at scale, where the observed stability and efficiency gains may prove most consequential.
Submission Number: 10
Loading