Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: transformers, chain of thought, length generalization, deep learning theory, non-convex optimization
Abstract: The ability to reason lies at the core of artificial intelligence (AI), and challenging problems usually call for deeper and longer reasoning to tackle. A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks that require longer chain-of-thoughts (CoT). In this work, we present a theoretical analysis of transformers trained via gradient descent on synthetic data for various state tracking tasks, revealing how length-generalizable reasoning can emerge. Specifically, we prove that: (i) for tasks with simple algebraic structure such as cyclic-group composition, transformers trained on short, constant-length chains learn a solution pattern that extrapolates to much longer chains; and (ii) for more complex tasks such as symmetric-group composition, a recursive self-training curriculum bootstraps longer reasoning and generalizes well beyond the training horizon, up to the natural limit of our setting. Our results demonstrate that transformers can learn sequential reasoning skills that scale with problem complexity. Moreover, we provide the first optimization-based guarantee demonstrating that constant-depth transformers can learn the state tracking problems in $\mathsf{NC}^1$, which exceeds the prior barrier limited to $\mathsf{TC}^0$, unless the famous conjecture $\mathsf{TC}^0 \neq \mathsf{NC}^1$ is false.
Supplementary Material: zip
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 19019
Loading