MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

TMLR Paper5518 Authors

31 Jul 2025 (modified: 23 Dec 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We add additional subsections in Appendix C.2 with supplementary experimental results and Appendix D.2 (shown in color blue). It contains three parts: 1. (Appendix D.2) Further discussion on recent RL post-training methods suggested by reviewer ``2SVh``. 2. (Appendix C.2) Ablation Study on Reward Models. This answers the questions regarding reward signals raised by reviewer ``wQhG`` and reviewer ``U27W ``. 3. (Appendix C.2) Comparison with Other Methods (SFT-B). This answers the question regarding other approaches raised by reviewer ``wQhG``.
Assigned Action Editor: ~Han_Zhao1
Submission Number: 5518
Loading