Training Dynamics of In-Context Learning in Linear Attention

Yedi Zhang; Aaditya K Singh; Peter E. Latham; Andrew M Saxe

Training Dynamics of In-Context Learning in Linear Attention

Yedi Zhang, Aaditya K Singh, Peter E. Latham, Andrew M Saxe

Published: 09 Jun 2025, Last Modified: 25 Jun 2025HiLD at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: learning dynamics, in-context learning, linear attention

TL;DR: We provide a theoretical description of how in-context learning abilities progressively improve during gradient descent training of multi-head linear attention.

Abstract: While attention-based models have demonstrated the remarkable ability of in-context learning (ICL), the theoretical understanding of how these models acquired this ability through gradient descent training is still preliminary. Towards answering this question, we study the gradient descent dynamics of multi-head linear self-attention trained for in-context linear regression. We show that the training dynamics has exponentially many fixed points and the loss exhibits saddle-to-saddle dynamics, which we reduce to scalar ordinary differential equations. During training, the model implements principal component regression in context with the number of principal components increasing over training time. Overall, we provide a theoretical description of how ICL abilities progressively improve during the gradient descent training of multi-head linear self-attention.

Student Paper: Yes

Submission Number: 8

Loading