Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

ICLR 2026 Conference Submission22047 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement learning, Reasoning, Large Language Model, Agent

Abstract: Large Language Models (LLMs) often struggle with challenging, multi-step reasoning problems due to a fundamental learning gap -- Reinforcement Learning with Verifiable Rewards (RLVR) suffers from sparse rewards when correct solutions are rarely sampled, while Supervised Fine-Tuning (SFT) tends to overfit to long demonstrations through rigid token mimicry. To bridge this gap, we introduce Supervised Reinforcement Learning (SRL), a framework that reformulates problem-solving as a sequence of logical actions. SRL trains the model to learn from each action, where the model first generates an internal reasoning monologue and then commits to an action. The model receives dense rewards based on the similarity between its actions and the expert’s at each step, providing a richer signal than RLVR. More importantly, by only rewarding the action, SRL allows the model flexibility in its self-generated thought process, promoting stronger reasoning abilities than SFT. On challenging mathematical reasoning benchmarks, SRL significantly outperforms both methods. Furthermore, a curriculum that cold-start with SRL before refining with RLVR achieves the strongest results. SRL also generalizes effectively to agentic software engineering tasks, establishing it as a robust framework for various reasoning tasks.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 22047

Loading