Triple Preferences Optimization (TPO): A Simple One-Step Combination of SFT and Preference Optimization with a Better Performance
Abstract: Large Language Models (LLMs) excel across various tasks. However, aligning them with human demonstrations proves challenging. Prior approaches relied on Reinforcement Learning from Human Feedback (RLHF) using online RL methods like Proximal Policy Optimization (PPO). Recently, RL-free methods like Direct Preference Optimization (DPO) have emerged as appealing alternatives, offering improved stability and scalability while retaining competitive performance. However, these methods have a separate supervised fine-tuning (SFT) step for further learning and require sampling from the post-SFT model and ranking them. In this paper, we introduce Triple Preferences Optimization (TPO), a new preference learning method designed to align an LLM with three preferences without requiring a separate supervised fine-tuning step. Our TPO aims to maximize the log probability of preferred to less-preferred responses while simultaneously learning the gold standard response in a single step. To provide a comprehensive evaluation, we use HuggingFace Open LLMs Benchmarks and MT-Bench involving dialogue systems and encompassing various NLP aspects. The results indicate that TPO surpasses other alignment methods, such as DPO and SFT, in average accuracy by 1.8% and 2.5%, respectively. Notably, TPO without the SFT part exhibits superior average accuracy compared to DPO and SFT by 4% and 4.7%, respectively. Overall, TPO resolves sampling challenges and combines the SFT part with the preference optimization part into a single step and provides better performance.
Paper Type: long
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
0 Replies
Loading