Keywords: Reinforcement Learning, Large Language Model, Vision Language Model
Abstract: Reinforcement Learning (RL) has proven effective at fine-tuning Large Language Models (LLMs) to improve the precision of their chain-of-thought reasoning. However, these methods typically rely on outcome-based rewards, without directly supervising the cognitive process of reflection. Consequently, while the model's ability to complete reasoning tasks is optimized, its capacity to identify and recover from its own errors within a single, continuous line of thought is not explicitly trained. In this work, we introduce Guided Sampling, a framework designed specifically to cultivate this missing reflection ability. Guided sampling casts the exploration phase as a sequential process where, upon generating an incorrect response, the model is prompted to re-evaluate its flawed reasoning and continue its generation. This technique creates a direct optimization pressure on the act of reflection itself, shifting the learning objective from merely finding a correct answer to actively correcting a wrong one. Experimental results demonstrate that by explicitly training for reflection, our GSRL framework is able to not only surpass traditional RL methods in final task accuracy but also fosters a more robust, self-correcting reasoning process.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10346
Loading