UPO: Unpaired Preference Optimization for Large Language Models

ACL ARR 2024 April Submission898 Authors

16 Apr 2024 (modified: 14 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While Large Language Models (LLMs) have made remarkable progress in various NLP tasks, there is no guarantee that LLMs will provide helpful, honest, and harmless answers without proper alignment. Reinforcement Learning from Human Feedback (RLHF) has been shown to be an effective alignment method, though it is complex and costly. Advancing further, Direct Preference Optimization (DPO) simplifies the alignment process by bypassing the reward modeling step and the reinforcement learning step, achieving performance comparable to RLHF using Proximal Policy Optimization (PPO). However, both methods necessitate paired preference data, which is costly to obtain in reality. We propose a new align method, dubbed Unpaired Preference Optimization (UPO), which does not need paired cases to align with human's preferences. Building upon DPO's approach, we derive a new loss function tailored to process positive and negative cases separately from the DPO loss function. Our findings indicate the performance of UPO is comparable to the performance of DPO trained on a complete paired dataset without a large performance gap. Moreover, under conditions involving a paired preference dataset, our UPO method achieves performance comparable to that of DPO and is more memory-efficient and time-efficient. In cases where the datasets are unpaired, the UPO method maintains a high level of performance compared to fully paired datasets, with only minimal loss in effectiveness and it significantly outperforms Unified Language Model Alignment (ULMA, an alignment method for point-wise preference data) or fine-tuning on only the positive cases (Preferred-SFT).
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: generative models; reinforcement learning
Contribution Types: Theory
Languages Studied: English
Submission Number: 898
Loading