Iterative Preference Optimization with Proximal Policy Regularization for Large Language Model Alignment
Abstract: Aligning large language models (LLMs) with human preferences is commonly achieved via supervised fine-tuning followed by preference optimization. While direct preference optimization (DPO) offers a simple and efficient alternative to RLHF, its offline and off-policy nature can induce a distribution shift between the policy used to sample preference pairs and the continually updated policy being optimized, reducing data efficiency and limiting alignment gains. We propose \emph{Iterative Proximal Policy Regularized Preference Optimization} (Iterative PRPO), which introduces a proximal regularization that explicitly constrains the optimized policy to remain close to the sampling policy within each iteration, thereby mitigating distribution shift while preserving the efficiency of DPO-style updates. Starting from an RLHF objective with a KL constraint to the sampling policy, we derive an equivalent direct preference optimization formulation that requires offline preference pairs under the sampling policy. Across summarization and dialogue alignment benchmarks, Iterative PRPO consistently improves win rates over offline DPO and iterative DPO baselines under both reward-model and GPT-4o evaluations, with comparable computational cost. Moreover, the same proximal regularization principle generalizes to advanced preference optimization objectives, including Identity Preference Optimization (IPO), self-play preference optimization (SPPO), and efficient exact optimization (EXO), yielding Iterative PR-IPO, PR-SPPO, and PR-EXO variants that further strengthen alignment across model scales.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=Aw5tYh8QET
Changes Since Last Submission: This submission corrects formatting issues in the previous version.
The manuscript has been reformatted using the official TMLR LaTeX template and now fully complies with all submission and anonymization requirements.
No changes have been made to the technical content.
Assigned Action Editor: ~Amrit_Bedi1
Submission Number: 7047
Loading