Abstract: Attention mechanism is a fundamental component of the transformer model and plays a significant role in its success.
However, the theoretical understanding of how attention learns to select tokens is still an emerging area of research.
In this work, we study the training dynamics and generalization ability of the attention mechanism, under classification problems with label noise.
We show that, with the characterization of signal-to-noise ratio (SNR), the token selection of attention mechanism achieves ``benign overfitting'', i.e., maintaining high generalization performance despite fitting label noise.
Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting.
Finally, we provide experiments to support our theoretical analysis using both synthetic and real-world datasets.
Lay Summary: Transformer models are the backbone of modern AI systems, including LLMs, and their attention mechanisms play a crucial role by selecting important tokens from input sequences.
But how attention learns this selection during training with gradient descent, especially when the data contains incorrect labels (label noise), is poorly understood.
We theoretically show that attention mechanisms can exhibit *benign* overfitting: even when the model memorizes incorrect labels, it can still generalize well to unseen data.
We also find that generalization can emerge much later than memorization.
Practically, our results indicate the possibility that overfitting may not be something to fear, even when training on low-quality data, as is often the case with LLMs.
Theoretically, our work deepens understanding of attention dynamics and provides an analytical framework for studying token selection dynamics, not only in label noise settings, but in any scenario where token selection behavior varies across the training examples.
Link To Code: https://github.com/keitaroskmt/benign-attention
Primary Area: Deep Learning->Attention Mechanisms
Keywords: Attention mechanism, token selection, benign overfitting, gradient descent, over-parameterization
Submission Number: 10581
Loading