Retrieval-DPO: Retrieval-augmented preference optimization with non-paired preference data

Retrieval-DPO: Retrieval-augmented preference optimization with non-paired preference data

ACL ARR 2024 June Submission5851 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Aligning Large Language Models (LLMs) with human feedback is important and challenging. \citet{rafailov2023direct} propose Direct Preference Optimization (DPO), a simple but effective alignment method which is reinforcement learning free. However, DPO requires paired preference data which is harder and more expensive to obtain compared to binary preference data. We propose a retrieval-based method named Retrieval-DPO to align LLMs under binary preference data situation. The core idea of our method is that learning how to align can be achieved with non-paired preference data of similar questions rather than strictly paired preference data considering the learning process of human. For instance, to teach the LLM to learn how to treat multiple perspectives, other comprehensive golden answers of similar question may have similar positive effects as the golden answer of the same question. Following this idea, we retrieve an example with opposite label from the retrieval database for a binary preference data in the training set. After the retrieval process, we get a pair of preference data but with possibly different questions and then adopt the DOVE \cite{bansal2024comparing} optimization objective for the alignment. We compare Retrieval-DPO with other preference optimization algorithms which do not need paired preference data such as Kahneman-Tversky Optimization (KTO) and Unified Language Model Alignment (ULMA). We find that our method significantly outperforms KTO and ULMA on helpful-base subset of HH dataset (over 13\%) and slightly outperforms KTO on harmless-base subset of HH dataset and controlled sentiment generation task. Besides, our method is not sensitive to the ratio of the number of positive examples to the number of negative examples without additional hyperparameter tuning.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: optimiation methods; generative models

Contribution Types: Theory

Languages Studied: English

Submission Number: 5851

Loading