Keywords: Circuit Analysis, Attribution Graphs, Feature Geometry
Other Keywords: Training Dynamics, Transformers, Visualization
TL;DR: We study the training dynamics in a small transformer model on a mathematical task using a visualization sandbox that help study each layer of the model during the optimization process.
Abstract: Large-scale pretraining of transformers has been central to the success of foundation models. However, the scale of those models limits our understanding of the mechanisms at play during optimization. In this work, we study the training dynamics of transformers in a controlled and interpretable setting. On the sparse modular addition task, we demonstrate that specialized attention circuits, called *clustering heads*, can be implemented during gradient descent to solve the problem. Our experiments show that such pathways naturally emerge during training. By monitoring the evolution of tokens via a visual sandbox, we uncover a two-stage learning and the occurrences of loss spikes due to the high curvature of normalization layers. Our findings provide several insights into patterns observed in more practical settings, such as the pretraining of large language models.
Submission Number: 148
Loading