Superiority of Multi-Head Attention: A Theoretical Study in Shallow Transformers in In-Context Linear Regression

Published: 22 Jan 2025, Last Modified: 11 Mar 2025AISTATS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We present a theoretical analysis of the performance of transformer with softmax attention in in-context learning with linear regression tasks. While the existing theoretical literature predominantly focuses on providing convergence upper bounds to show that trained transformers with single-/multi-head attention can obtain a good in-context learning performance, our research centers on comparing the exact convergence of single- and multi-head attention more rigorously. We conduct an exact theoretical analysis to demonstrate that multi-head attention with a substantial embedding dimension performs better than single-head attention. When the number of in-context examples $D$ increases, the prediction loss using single-/multi-head attention is in $O(1/D)$, and the one for multi-head attention has a smaller multiplicative constant. In addition to the simplest data distribution setting, our technical framework in calculating the exact convergence further facilitates studying more scenarios, e.g., noisy labels, local examples, correlated features, and prior knowledge. We observe that, in general, multi-head attention is preferred over single-head attention. Our results verify the effectiveness of the design of multi-head attention in the transformer architecture.
Submission Number: 318
Loading