WK, WV is (Linearly) All You Need: On the Necessity of the QKV Weight Triplet in Self-Attention Transformers

Published: 24 May 2026, Last Modified: 28 May 2026ICML 2026 Workshop WSS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformers, Self-Attention, Attention, Symmetries, Deep Learning, MLP
TL;DR: MHA possesses GL(d) invariance, rendering linear WQ, WK, or WV redundant (reducible to Id). Breaking this symmetry via a nonlinear residual query at equal parameter cost empirically outperforms standard and wider-MLP baselines.
Abstract: Multi-head attention is invariant under the joint $GL(d)$ action $(X, W_Q, W_K, W_V) \mapsto (X\Theta,\Theta^{-1}W_Q,\Theta^{-1}W_K,\Theta^{-1}W_V)$ with two consequences for the QKV triplet. First, any one of $W_Q, W_K, W_V$ can be fixed to $I_d$ without loss of expressivity if $X W$ is precomputed; under mild structural conditions the precomputation folds into the preceding MLP (or, in the first layer, the embedding) at no parameter cost, removing $25\%$ of attention parameters per layer. We prove the multi-layer reduction and analyse the normalisation obstructions. Second, every linear $W_Q$ already lies on the orbit of $I_d$, so a learned linear query is redundant: expressive gains in the QKV pathway require at least one of the three to be nonlinear, a branch we realise with the residual query $Q(X)=\frac{1}{2}(X + f_\theta(X))$ at parity of parameters. This research also led us to examine residual skip connections: MLPs with and without a skip form generically disjoint function classes for modern activations. We validate both halves on GPT-style models trained from scratch under batch-matched comparisons: the reduced $117$M model matches the dense $124$M baseline; reallocating the saved parameters to the feed-forward sublayer strictly improves on it; and the residual nonlinear query outperforms a baseline with a wider feed-forward sublayer carrying $12.5$ % more non-embedding parameters.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 9
Loading