Keywords: large language models, reinforcement learning
Abstract: Self-correction in Large Language Models (LLMs) has emerged as a promising approach towards enhancing the reasoning capabilities of LLMs at inference-time. In this paper, we study how self-correction can enable an LLM to effectively perform search over multiple sequential turns on 3-SAT problems. We train a self-correcting model with reinforcement learning that verifies an initial solution through chain-of-thought reasoning and uses its own evaluation to provide a new solution. Despite being trained to self-correct once, the model can revise its answers in a sequential loop at inference-time, allowing for multi-turn gains. Our experiments demonstrate that generating strong chain-of-thought evaluations of potential solutions is essential, allowing sequential scaling through refining an initial solution over k turns to surpass even the strongest oracle-guided parallel scaling methods (i.e. pass@k).
Submission Number: 44
Loading