Keywords: CoT, attention
TL;DR: We show that CoT expands the expressive power of Transformer in-context learning, and establish an efficiency guarantee unique to CoT.
Abstract: We show that Chain-of-Thought (CoT) expands the expressiveness of Transformer in-context learning (ICL).
Specifically, we show CoT enable efficient simulation of In-Context Gradient Descent (ICGD) for $N$-layer neural network.
Different from CoT, a Transformer with fixed depth and hidden dimension has fixed ICL capacity in one forward pass.
Simulating larger models or more optimization steps in-context requires deeper or wider Transformers.
CoT removes this limitation by providing an expandable workspace via the sequence trajectory.
This enables arbitrary-step and arbitrary-capacity ICGD within a constant-depth Transformer.
Second, we provide a provable efficient guarantee unique to CoT through dynamical masking.
The attention mechanism only process the relevant tokens for the current update step.
This eliminates the redundant ``process everything'' cost of single-pass deep models.
Specifically, we prove this CoT mechanism improves the computational cost of the prior best in-context result [Wu et al., ICML 2025] by $O(N)$.
Numerical validations support our theory.
Submission Number: 3
Loading