Abstract: We present an analysis of the evolution of the attention head circuits for a list-sorting
attention-only transformer. Through various measures, we identify distinct developmental
stages in the training process. In particular, depending on the training setup, we find that
the attention heads can specialize into one of two different modes: Vocabulary-splitting
or copy-suppression. We study the robustness of these stages by systematically varying the
training hyperparameters, model architecture and training dataset. This leads usto discover
features in the training data that are correlated with the kind of head specialization the
model acquires.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Laurent_Charlin1
Submission Number: 3613
Loading