Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference ModelDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language. However, as they are trained on massive datasets of text, LLMs can also inherit harmful biases and produce outputs that are not aligned with human values. This paper explores two main approaches to LLM alignment: Reinforcement Learning with Human Feedback (RLHF) and contrastive learning-based methods like Direct Preference Optimization (DPO). We discuss the advantages and disadvantages of each approach, highlighting the complexity and instability of RLHF compared to the simpler but potentially less robust DPO. With that in mind, we propose MPO (Mixed Preference Optimization), a novel method that combines the strengths of both approaches. Specifically, we introduce a data selection method that utilizes the score difference of a reward model and divides the data into two parts: an easy set and a hard set. We then propose a two-stage training procedure: first train DPO on the easy dataset, and then train PPO on the difficult set with DPO model being the reference model. Experiments are conducted on two public alignment datasets, namely HH-RLHF and TLDR, demonstrating the effectiveness of MPO, both in terms of GPT4 evaluation and human evaluation.
Paper Type: long
Research Area: Generation
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview