Controlling changes to attention logits

ICLR 2026 Workshop Sci4DL Submission38 Authors

03 Feb 2026 (modified: 02 Mar 2026)Submitted to Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: MLA, stability, attention, optimization
Abstract: Stability of neural network weights is critical when training transformer models. The query and key weights are particularly problematic, as they tend to grow large without any intervention. Applying normalization to queries and keys, known as `QK norm', fixes stability issues in practice, but is not always applicable. For example, QK norm is not compatible with Multi-head Latent Attention (MLA) because QK norm requires full materialization of queries and keys during inference, which is not done in MLA. In this paper we hypothesize that instability is driven primarily by changes in attention logits, rather than by their absolute magnitude, and that controlling these changes is sufficient for stability. We show that these changes are controllable by assigning parameter-dependent learning rates to the query and key weights. Our cheap intervention allows us to increase the base learning rate of the network, outperform other methods in the MLA setting, and achieve performance competitive with QK norm when using Multi-head Attention.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Challenge: This submission is an entry to the science of DL improvement challenge.
Submission Number: 38
Loading