A Mechanistic Study of Transformers Training Dynamics

A Mechanistic Study of Transformers Training Dynamics

13 Feb 2026 (modified: 09 May 2026)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large-scale pretraining of transformers has been central to the success of foundation models. However, the scale of those models limits our understanding of the mechanisms at play during optimization. In this work, we study the training dynamics of transformers in a controlled and interpretable setting. On the sparse modular addition task, we demonstrate that specialized attention circuits, called clustering heads, can be implemented during gradient descent to solve the problem. Our experiments show that such pathways naturally emerge during training. By monitoring the evolution of tokens via a visual sandbox, we uncover a two-stage learning and the occurrences of loss spikes due to the high curvature of normalization layers. Our findings provide several insights into patterns observed in more practical settings, such as the pretraining of large language models.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: N/A

Assigned Action Editor: ~Matthew_Walter1

Submission Number: 7494

Loading