VPO: Leveraging the Number of Votes in Preference Optimization

ACL ARR 2024 December Submission830 Authors

15 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Direct Preference Optimization (DPO) trains a language model using human preference data. Preference datasets, typically labeled with votes or scores, provide insights into whether a sentence pair is clearly preferable or controversial, but current methods fail to fully utilize this information. In this paper, we introduce a technique that leverages user voting data to better align with diverse subjective preferences. We employ the Bayesian Minimum Mean Square Error (Bayesian MMSE) estimator to model the probability that one generation is preferable to another. Using this estimated probability as a target, we develop the Vote-based Preference Optimization (VPO) framework, which incorporates the number of votes on both sides to distinguish between controversial and obvious generation pairs. We show that previous algorithms, such as DPO and Identity Preference Optimization (IPO), can be extended using the proposed framework, termed VDPO and VIPO. Our experiments demonstrate that these proposed algorithms outperform various existing methods, including their base algorithms.
Paper Type: Long
Research Area: Generation
Research Area Keywords: Inference methods, Analysis, Few-shot generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 830
Loading