Keywords: LLM Reasoning, Reinforcement Learning, Supervised Reinforcement Fine-Tuning
TL;DR: We propose SRFT, an entropy-aware single-stage framework that unifies SFT and RL for LLM reasoning.
Abstract: Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet optimally integrating Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through a comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from an entropy-based perspective, we reveal key differences between these paradigms: SFT induces coarse-grained, global shifts to policy distributions, while RL performs fine-grained, selective optimizations.
Our analysis further establishes entropy as a critical indicator of training efficacy.
Building on these observations, we introduce **S**upervised **R**einforcement **F**ine-**T**uning (**SRFT**), a single-stage framework that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms.
SRFT simultaneously applies SFT and RL to directly optimize LLMs using demonstrations and self-exploration rollouts rather than through two-stage sequential methods.
Extensive experiments show that SRFT outperforms zero-RL baselines by **9.0%** on five mathematical reasoning benchmarks and by **10.9%** on three out-of-distribution benchmarks.
Moreover, by leveraging demonstration data, SRFT maintains a more stable policy entropy, facilitating sustained policy improvement.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1626
Loading