Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers

Published: 18 Jun 2024, Last Modified: 03 Jul 2024TF2M 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: transformer, in context learning, Markov chain, n-gram model, feature learning, training dynamics, induction head
TL;DR: We theoretically prove that a two-layer transformer model learns to perform "induction head" after training on n-gram Markovian data.
Abstract: In-context learning (ICL) is a cornerstone of large language model functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically explains how the attention mechanism facilitates ICL under certain data models. It remains unclear how the other building blocks of the transformer contribute to ICL. To address this question, we study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data, where each token in the Markov chain statistically depends on the previous $n$ tokens. We analyze a sophisticated transformer model featuring relative positional embedding, multi-head softmax attention, and a feed-forward layer with normalization. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model that performs a generalized version of the ``induction head'' mechanism with a learned feature, resulting from the congruous contribution of all the building blocks.
Submission Number: 62
Loading