Existing alignment techniques, including Direct Preference Optimization (DPO), are primarily designed for pairwise preference data where rewards are inferred rather than explicitly provided. In this paper, we propose a comprehensive framework for aligning large language models (LLMs) by introducing a new optimization objective that facilitates the processing of reward datasets, which consist of a list of responses explicitly marked with scalar preference scores. Our contribution includes the development of a novel algorithm, termed Soft Preference Optimization (LPO), which allows for the direct derivation of an LLM policy from both reward and preference datasets. At the heart of LPO is a unique listwise preference optimization objective formulated using an exponential-logarithmic function and an adaptive loss coefficient, which effectively integrates listwise preference signals into the LLM. We assess the efficacy of our approach under both reward and preference scenarios using different sizes of Mistral models. Experimental results indicate that our method outperforms several preference-based benchmarks, particularly when reward datasets are utilized. Additionally, our method demonstrates a significant advantage over DPO in intricate reasoning tasks, such as mathematical problem-solving and coding.
Keywords: large language models, preference alignment, listwise optimization objective
Abstract:
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10079
Loading