Triple Preferences Optimization (TPO): A Simple One-Step Combination of  SFT and Preference Optimization with a Better Performance

Anonymous

Triple Preferences Optimization (TPO): A Simple One-Step Combination of SFT and Preference Optimization with a Better Performance

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Large Language Models (LLMs) excel across various tasks. However, aligning them with human demonstrations proves challenging. Prior approaches relied on Reinforcement Learning from Human Feedback (RLHF) using online RL methods like Proximal Policy Optimization (PPO). Recently, RL-free methods like Direct Preference Optimization (DPO) have emerged as appealing alternatives, offering improved stability and scalability while retaining competitive performance. However, these methods have a separate supervised fine-tuning (SFT) step for further learning and require sampling from the post-SFT model and ranking them. In this paper, we introduce Triple Preferences Optimization (TPO), a new preference learning method designed to align an LLM with three preferences without requiring a separate supervised fine-tuning step. Our TPO aims to maximize the log probability of preferred to less-preferred responses while simultaneously learning the gold standard response in a single step. To provide a comprehensive evaluation, we use HuggingFace Open LLMs Benchmarks and MT-Bench involving dialogue systems and encompassing various NLP aspects. The results indicate that TPO surpasses other alignment methods, such as DPO and SFT, in average accuracy by 1.8% and 2.5%, respectively. Notably, TPO without the SFT part exhibits superior average accuracy compared to DPO and SFT by 4% and 4.7%, respectively. Overall, TPO resolves sampling challenges and combines the SFT part with the preference optimization part into a single step and provides better performance.

Paper Type: long

Research Area: Efficient/Low-Resource Methods for NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

0 Replies

Loading