Incremental Learning of Sparse Attention Patterns in Transformers

Published: 31 Oct 2025, Last Modified: 28 Nov 2025EurIPS 2025 Workshop PriGMEveryoneRevisionsBibTeXCC BY 4.0
Keywords: incremental learning, transformers, optimization, dynamics, generalization, sparsity
Abstract: This paper studies simple transformers on a high-order Markov chain, where the model must incorporate knowledge from multiple past positions, each with different statistical importance. We show that transformers learn the task incrementally, with each stage induced by the acquisition or copying of information from a subset of positions via a sparse attention pattern. Notably, the learning dynamics transition from competitive, where all heads focus on the statistically most important attention pattern, to cooperative, where different heads specialize in different patterns. We explain these dynamics using a set of simplified differential equations, which characterize the stage-wise learning process and analyze the training trajectories. Overall, our work provides theoretical explanations for how transformers learn in stages even without an explicit curriculum and provides insights into the emergence of complex behaviors and generalization, with relevance to applications such as natural language processing and algorithmic reasoning.
Submission Number: 11
Loading