Towards a Unified View of Large Language Model Post-Training

Xingtai Lv; Yuxin Zuo; Youbang Sun; Hongyi Liu; Yuntian Wei; Zhekai Chen; Lixuan He; Xuekai Zhu; Kaiyan Zhang; Bingxiang He; Bingning Wang; Ning Ding; Bowen Zhou

Towards a Unified View of Large Language Model Post-Training

Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingxiang He, Bingning Wang, Ning Ding, Bowen Zhou

10 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Post-Training, Reinforcement Learning

Abstract: Many approaches with seemingly disparate losses exist for post-training modern language models, such as Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive the Unified Policy Gradient Estimator (UPGE), a framework with four interchangeable parts that unifies a wide spectrum of post-training approaches through their loss gradient form. We further present the calculations of these methods as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify effectiveness of HPT. Across six mathematical reasoning benchmarks and two out-of-distribution tasks, HPT consistently surpasses strong baselines across models of varying scales and families.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 3535

Loading