Keywords: Large Language Models, Reasoning, Reinforcement Learning, Supervised Fine-Tuning, Dynamic Loss Weighting
Abstract: The joint optimization of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) has emerged as a prominent paradigm for LLM post-training. Current methods, however, either sequence these stages or combine them with static weights, thereby overlooking sample differences and model dynamics, which can lead to overfitting or reward hacking. To address this, we introduce Sample Learning State (SLS) characterized by two key metrics: the Degree of Sample Mastery and the Dispersion of Exploration Trajectory, which capture the dynamic features in the model's handling of samples during the training process. Based on SLS, we design a training-state-aware, sample-wise weighting coefficient that enables dynamic integration of SFT and RL loss, balancing supervised guidance and autonomous exploration for synergistic optimization. Extensive experiments demonstrate that our method achieves new state-of-the-art (SOTA) results on four in-domain mathematical benchmarks and two out-of-domain tasks. Moreover, it exhibits strong robustness in multi-reward scenarios, effectively mitigating reward hacking under auxiliary constraints while maintaining stable reasoning performance. We will release the code upon publication.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: reinforcement learning,optimization methods,multi-task learning,generalization
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7712
Loading