Keywords: Self-Attention, Whitening Transformation, Covariance Modeling, Transformers
TL;DR: Whitened Self-Attention is a novel and theoretically grounded enhancement to Transformer architectures that by better modeling inter-token correlations leads to more efficient learning.
Abstract: Self-attention in Transformer architectures is formulated as a function of the
pairwise contributions between target vectors and their context vectors. This con-
struction implicitly assumes ternary and higher order relationships are negligible.
It further treats the context vectors as though they can be processed individually,
as if mutually independent of one another. This model contradicts, however, our
understanding of language: that the meaning of words is influenced by complex
interdependencies. We introduce Whitened Self-Attention, a theoretically motivated
and novel enhancement that optimally accounts for inter-token correlations, and
based on several covariance modeling assumptions, we derive a computationally
feasible implementation for it. Experiments with a small GPT architecture show
that whitened self-attention reduces perplexity by 19.3%, achieves the same mean
cross-entropy loss in 37 times fewer training iterations, and, after hyperparameter
optimization, reduces training time by 91%. Our approach shows significant poten-
tial for scaling and for improving the performance and generalization of large-scale
language models. Moreover, as whitening decorrelates input sequences, it will
affect the structure of the resulting trained attention and feedforward weight matri-
ces. This will have an effect on their singular value decompositions, and should, in
turn, influence the results of the many studies on the mechanistic interpretability of Transformers.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 20232
Loading