Towards a Unified View of Large Language Model Post-Training

10 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Post-Training, Reinforcement Learning
Abstract: Many approaches with seemingly disparate losses exist for post-training modern language models, such as Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive the Unified Policy Gradient Estimator (UPGE), a framework with four interchangeable parts that unifies a wide spectrum of post-training approaches through their loss gradient form. We further present the calculations of these methods as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify effectiveness of HPT. Across six mathematical reasoning benchmarks and two out-of-distribution tasks, HPT consistently surpasses strong baselines across models of varying scales and families.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 3535
Loading