Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: A key paradigm to improve the reasoning capabilities of large language models (LLMs) is to allocate more inference-time compute to search against a verifier or reward model. This process can then be utilized to refine the pretrained model or distill its reasoning patterns into more efficient models. In this paper, we study inference-time computation by viewing chain-of-thought (CoT) generation as a metastable Markov process: easy reasoning steps (e.g., algebraic manipulations) form densely connected clusters, while hard reasoning steps (e.g., applying a relevant theorem) create sparse, low-probability edges between clusters, leading to phase transitions at longer timescales. Under this framework, we prove that implementing a search protocol that rewards sparse edges improves CoT by decreasing the expected number of steps to reach different clusters. In contrast, we establish a limit on reasoning capability when the model is restricted to local information of the pretrained graph. We also show that the information gained by search can be utilized to obtain a better reasoning model: (1) the pretrained model can be directly finetuned to favor sparse edges via policy gradient methods, and moreover (2) a compressed \emph{metastable representation} of the reasoning dynamics can be distilled into a smaller, more efficient model.
Lay Summary: A promising way to improve the reasoning ability of large language models (LLMs) is to invest more computation at inference time, for instance searching for better reasoning paths among potential output sequences by receiving rewards. This kind of search can also be used to refine the original model or train smaller, faster ones that inherit its reasoning skills. In this paper, we theoretically study the benefits of inference-time search for chain-of-thought (CoT) reasoning by mathematically modeling the output of the base model as a random process known as a metastable Markov process. Under this model, easy reasoning steps form dense clusters, while harder conceptual leaps (like invoking the right theorem to solve a problem) act as rare transitions between clusters. We show that encouraging these rare transitions can improve the efficiency of CoT search by reducing the time needed to reach new insights. We also demonstrate how to leverage the information gained during this search: the original model can be fine-tuned to favor more insightful steps, and a compressed representation of the search dynamics can be distilled into a smaller model that retains strong reasoning performance.
Primary Area: Theory->Learning Theory
Keywords: large language model, reasoning, search, distillation, metastability
Submission Number: 9095
Loading