Learning to Self-Correct through Chain-of-Thought Verification

Bradley Guo; Jingwen Gu; Jin Peng Zhou; Wen Sun

Learning to Self-Correct through Chain-of-Thought Verification

Bradley Guo, Jingwen Gu, Jin Peng Zhou, Wen Sun

Published: 10 Jun 2025, Last Modified: 11 Jul 2025PUT at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, reinforcement learning

Abstract: Self-correction in Large Language Models (LLMs) has emerged as a promising approach towards enhancing the reasoning capabilities of LLMs at inference-time. In this paper, we study how self-correction can enable an LLM to effectively perform search over multiple sequential turns on 3-SAT problems. We train a self-correcting model with reinforcement learning that verifies an initial solution through chain-of-thought reasoning and uses its own evaluation to provide a new solution. Despite being trained to self-correct once, the model can revise its answers in a sequential loop at inference-time, allowing for multi-turn gains. Our experiments demonstrate that generating strong chain-of-thought evaluations of potential solutions is essential, allowing sequential scaling through refining an initial solution over k turns to surpass even the strongest oracle-guided parallel scaling methods (i.e. pass@k).

Submission Number: 44

Loading