Expectation Preference Optimization for Improving the Reasoning Capability of Large Language Models

Expectation Preference Optimization for Improving the Reasoning Capability of Large Language Models

ACL ARR 2024 December Submission1323 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Pairwise preference optimization, such as Direct Preference Optimization (DPO), was originally designed to align large language models (LLMs) with human value. It has recently been used to improve the supervised fine-tuning (SFT) performance of LLMs. Using pairs of single samples, DPO estimates the probability distribution of the preferences of picking one response over another. However, in reasoning tasks that involve more complicated preferences than those in the human value alignment task, this sampling method is likely to bring deviations from the ground-truth distribution. To solve the problem, extra efforts (e.g., external annotations or amendment of the loss function) are often required. In this paper, we hypothesize that the preferences can be better estimated through a multi-sampling process. Accordingly, we propose an Expectation Preference Optimization (EPO) algorithm that takes pairs of sample groups, instead of pairs of single samples as in DPO, for preference learning. Compared to pairwise DPO, the proposed EPO tends to produce more proper preference estimations. Applying different preference optimization methods in a self-training paradigm, we have conducted extensive experiments on various reasoning benchmarks. The results show that our EPO approach outperforms a range of baseline approaches in terms of zero-shot accuracy on all benchmarks.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: text-to-text generation,optimization methods,commonsense reasoning,generative models

Contribution Types: NLP engineering experiment, Theory

Languages Studied: English

Submission Number: 1323

Loading