Chain-of-Thought Gradient Descent

Published: 26 May 2026, Last Modified: 14 Jun 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: CoT, attention
TL;DR: We show that CoT expands the expressive power of Transformer in-context learning, and establish an efficiency guarantee unique to CoT.
Abstract: We show that Chain-of-Thought (CoT) expands the expressiveness of Transformer in-context learning (ICL). Specifically, we show CoT enable efficient simulation of In-Context Gradient Descent (ICGD) for $N$-layer neural network. Different from CoT, a Transformer with fixed depth and hidden dimension has fixed ICL capacity in one forward pass. Simulating larger models or more optimization steps in-context requires deeper or wider Transformers. CoT removes this limitation by providing an expandable workspace via the sequence trajectory. This enables arbitrary-step and arbitrary-capacity ICGD within a constant-depth Transformer. Second, we provide a provable efficient guarantee unique to CoT through dynamical masking. The attention mechanism only process the relevant tokens for the current update step. This eliminates the redundant ``process everything'' cost of single-pass deep models. Specifically, we prove this CoT mechanism improves the computational cost of the prior best in-context result [Wu et al., ICML 2025] by $O(N)$. Numerical validations support our theory.
Submission Number: 3
Loading