On the Role of Preference Variance in Preference Optimization

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Direct Preference Optimization, LLM Alignment
Abstract: Direct Preference Optimization (DPO) has emerged as an important approach for learning from human preferences in aligning large language models (LLMs). However, collecting human preference data is costly and inefficient, motivating methods to reduce the required annotations. In this work, we investigate the impact of \emph{preference variance} (PVar), which measures the variance in model preferences when comparing pairs of responses, on the effectiveness of DPO training. We provide a theoretical insight by establishing an upper bound on the DPO gradient norm for any given prompt and proving that PVar additionally controls the directional descent component and signal-to-noise ratio (SNR) of the updates. This implies that prompts with low PVar can only produce small and noisy gradient updates, making them less valuable for learning. We validate this finding by fine-tuning LLMs with preferences generated by a reward model, evaluating on general instruction following and code generation benchmarks. Experimental results demonstrate that prompts with higher PVar outperform randomly selected prompts and other active selection baselines. We also show that our PVar-based selection method is robust across different algorithms (e.g., SimPO, KTO, ORPO) and when using smaller reward models (1B, 3B) for selection. Notably, in a separate experiment using the original human annotations from the UltraFeedback dataset, we found that training on only the top 10\% of prompts with the highest PVar yields better evaluation performance than training on the full dataset, highlighting the importance of preference variance in identifying informative examples for efficient LLM alignment.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23489
Loading