Sample Learning State Guided Dynamic SFT-RL Integration Post-training

Sample Learning State Guided Dynamic SFT-RL Integration Post-training

ACL ARR 2026 January Submission7712 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Reasoning, Reinforcement Learning, Supervised Fine-Tuning, Dynamic Loss Weighting

Abstract: The joint optimization of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) has emerged as a prominent paradigm for LLM post-training. Current methods, however, either sequence these stages or combine them with static weights, thereby overlooking sample differences and model dynamics, which can lead to overfitting or reward hacking. To address this, we introduce Sample Learning State (SLS) characterized by two key metrics: the Degree of Sample Mastery and the Dispersion of Exploration Trajectory, which capture the dynamic features in the model's handling of samples during the training process. Based on SLS, we design a training-state-aware, sample-wise weighting coefficient that enables dynamic integration of SFT and RL loss, balancing supervised guidance and autonomous exploration for synergistic optimization. Extensive experiments demonstrate that our method achieves new state-of-the-art (SOTA) results on four in-domain mathematical benchmarks and two out-of-domain tasks. Moreover, it exhibits strong robustness in multi-reward scenarios, effectively mitigating reward hacking under auxiliary constraints while maintaining stable reasoning performance. We will release the code upon publication.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: reinforcement learning,optimization methods,multi-task learning,generalization

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 7712

Loading