Focus and Dilution: The Multi-stage Learning Process of Attention

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 spotlightEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus–dilution cycle in attention learning and provide a rigorous explanation in a one-layer Transformer setting for Markovian data via gradient-flow analysis. Using stage-wise linearization around critical points, we show that a single focus–dilution cycle can be decomposed into a sequence of distinct stages. First, embedding and projection rapidly condense to a rank-one structure, while attention parameters remain effectively frozen. Then, the attention parameters begin to increase, inducing a frequency-driven focus toward high-frequency tokens. As attention continues to evolve, it generates next-order perturbations in embeddings, leading to a mass-redistribution mechanism that progressively dilutes this focus. Finally, small asymmetries among low-frequency tokens lift a degenerate critical point, opening new embedding directions and initiating the next cycle. Experiments on synthetic Markovian data as well as WikiText and TinyStories corroborate the predicted stages and cyclical dynamics.
Lay Summary: Transformer models, the technology behind many modern language systems, are highly successful but still difficult to understand. A key open question is how their attention mechanism—the part that decides which words or tokens to emphasize—changes during training. We study this question using a simplified Transformer model that is small enough to analyze mathematically while still capturing important training behavior. We find that attention does not simply become stronger in one direction; instead, it goes through a repeating “focus and dilution” cycle. First, the model quickly compresses its internal representations into a simple pattern. Then attention begins to focus on frequent tokens, but this focus later weakens as the model’s internal representations adjust. Small differences among less frequent tokens can then open new directions for learning, starting the next cycle. We prove this mechanism in a controlled setting and observe similar patterns in experiments on synthetic data and small language datasets. These findings give researchers a clearer picture of how attention develops during early training and may help guide future analysis and design of Transformer models.
Link To Code: https://github.com/MiracleLin001/Transformer_Training_Dynamics
Primary Area: Deep Learning->Theory
Keywords: Attention mechanism, Training dynamics, Multi-stage analysis, Condensation
Originally Submitted PDF: pdf
Submission Number: 16346
Loading