TL;DR: We propose a unified, RL-free probabilistic framework for language model preference alignment achieving state-of-the-art performance in math reasoning tasks.
Abstract: Offline preference alignment for language models such as Direct Preference Optimization (DPO) is favored for its effectiveness and simplicity, eliminating the need for costly reinforcement learning. Various offline algorithms have been developed for different data settings, yet they lack a unified understanding. In this study, we introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework that formulates language model preference alignment as a Maximum Likelihood Estimation (MLE) problem with prior constraints. This method effectively accommodates both paired and unpaired data, as well as answer and step-level annotations. We illustrate that DPO and KTO are special cases with different prior constraints within our framework. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N. Both algorithms demonstrate a $3\sim10\%$ performance enhancement on the GSM8K and MATH benchmarks across all configurations, achieving these gains without additional training or computational costs compared to existing algorithms.
Lay Summary: 1. Fine-tuning a pre-trained LLM on high-quality data, called alignment, is crucial for enhancing specific capabilities such as reasoning. However, current offline alignment approaches inspired by offline RL (e.g., DPO and KTO) face several limitations: (1) they rely on contrastive training with positive-negative pairs, which may be unnecessary for reasoning tasks; (2) they struggle to incorporate step-wise supervision; and (3) they lack a unified theoretical framework.
2. In this work, we recast the alignment problem within a probabilistic framework, formulating it as a MLE problem with constraints on some prior distributions. Within this framework, we show that existing methods like DPO and KTO emerge as special cases corresponding to different choices of prior. We introduce two new algorithms—PIPA-M and PIPA-N—which impose priors on the marginal and negative-conditioned distributions, respectively. Our formulation naturally extends to settings that include step-level supervision, maintaining a consistent framework.
3. We evaluate our approach on mathematical reasoning benchmarks and find that both PIPA-M and PIPA-N outperform existing methods.
Primary Area: Deep Learning->Large Language Models
Keywords: Preference alignment, Prior, Probabilistic estimation, Reasoning
Submission Number: 8018
Loading