Whitened Self-Attention

ACL ARR 2025 May Submission1074 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Self-attention in Transformer-based generative models such as GPT, implicitly assumes that context tokens are independent and identically distributed. However, this contradicts the very premise of attention: that the meaning of words is influenced by their complex interdependencies. We propose whitened self-attention, a filter that optimally accounts for inter-token correlations and show it enhances representation learning for autoregressive language modeling. Experiments on a small GPT architecture demonstrate an 11\% improvement in perplexity, an equivalent performance in 13x fewer iterations, and after optimizations, a lowered training time by up to 42\%. This work advances self-attention for generative NLP tasks, based on a theoretically grounded method for handling token dependencies, and our method shows promise for improving generalization in large-scale NLP models.
Paper Type: Short
Research Area: Machine Learning for NLP
Research Area Keywords: Representation Learning, Generative Models, Optimization Methods, Transfer Learning / Domain Adaptation
Contribution Types: NLP engineering experiment, Theory
Languages Studied: English
Submission Number: 1074
Loading