On the Optimization and Generalization of Multi-head Attention

Puneesh Deora; Rouzbeh Ghaderi; Hossein Taheri; Christos Thrampoulidis

On the Optimization and Generalization of Multi-head Attention

Puneesh Deora, Rouzbeh Ghaderi, Hossein Taheri, Christos Thrampoulidis

Published: 18 Apr 2024, Last Modified: 09 May 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Event Certifications: iclr.cc/ICLR/2025/Journal_Track

Abstract: The training and generalization dynamics of the Transformer's core mechanism, namely the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of overparameterization when training fully-connected networks, we investigate the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. We then establish primitive conditions on the initialization that ensure realizability holds. Finally, we demonstrate that these conditions are satisfied for a simple tokenized-mixture model. We expect the analysis can be extended to various data-model and architecture variations.

Submission Length: Regular submission (no more than 12 pages of main content)

Supplementary Material: pdf

Assigned Action Editor: ~Srinadh_Bhojanapalli1

Submission Number: 1736

Loading