Plackett–Luce Preference Optimization (PLPO): Listwise Ranking for Preference Optimization

ACL ARR 2025 May Submission5691 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Plackett–Luce Preference Optimization (PLPO) is a reinforcement-learning loss framework for training and fine-tuning sequence models when ground truth is unavailable. It uses the Plackett–Luce choice model to estimate the likelihood of ranked output lists under reward signals from a critic, which may be human, code-executor, or another evaluator. PLPO supports hard or soft constraints and applies during both primary training and downstream fine-tuning. At inference, it adapts online by sampling from a Gaussian-perturbed policy until a reward threshold is reached. We derive a closed-form gradient estimator and show that in the pairwise case it matches standard policy-gradient updates. Unlike DPO and PPO, both of which require a fixed reference model as ``ground truth'', PLPO operates without any such reference point, making it suitable for real-world settings where true supervisor signals are unavailable and only reward bounds or constraints are known.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Machine Learning for NLP, Reinforcement Learning, Algorithms and Resources for Learning with Limited Supervision, Generation
Contribution Types: NLP engineering experiment, Theory
Languages Studied: English
Submission Number: 5691
Loading