Keywords: Reinforcement learning, Reasoning, Large Language Model, Agent
Abstract: Large Language Models (LLMs) often struggle with challenging, multi-step reasoning problems due to a fundamental learning gap -- Reinforcement Learning with Verifiable Rewards (RLVR) suffers from sparse rewards when correct solutions are rarely sampled, while Supervised Fine-Tuning (SFT) tends to overfit to long demonstrations through rigid token mimicry. To bridge this gap, we introduce Supervised Reinforcement Learning (SRL), a framework that reformulates problem-solving as a sequence of logical actions. SRL trains the model to learn from each action, where the model first generates an internal reasoning monologue and then commits to an action. The model receives dense rewards based on the similarity between its actions and the expert’s at each step, providing a richer signal than RLVR. More importantly, by only rewarding the action, SRL allows the model flexibility in its self-generated thought process, promoting stronger reasoning abilities than SFT. On challenging mathematical reasoning benchmarks, SRL significantly outperforms both methods. Furthermore, a curriculum that cold-start with SRL before refining with RLVR achieves the strongest results. SRL also generalizes effectively to agentic software engineering tasks, establishing it as a robust framework for various reasoning tasks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 22047
Loading