The Delta Learning Hypothesis: Preference Tuning on Weak Data Can Yield Strong Gains

Published: 06 Mar 2025, Last Modified: 30 Apr 2025ICLR 2025 Workshop Data Problems PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Preference tuning, LLM post-training, synthetic data, weak-to-strong generalization
Abstract: Preference tuning has greatly improved large language models (LLMs), yet obtaining preference data remains challenging, often requiring expensive human annotation or strong LLM judges to assess response quality. We explore the feasibility of synthetically generating preference pairs without optimizing the preferred response quality to train LLMs that surpass the preferred responses. We formulate the **delta learning hypothesis**, which posits that models can improve beyond the quality of their training data by learning solely from the relative quality difference—rather than the absolute quality—of paired responses. To validate this hypothesis, we conduct controlled experiments across diverse domains: a toy stylistic task (bold section generation), a math reasoning task (GSM8K), and real-world instruction-following. We show that preference tuning via Direct Preference Optimization (DPO) can enable models to extrapolate improvements from suboptimal data, whereas directly imitating weak data through supervised fine-tuning (SFT) can degrade performance. Armed with these insights, we build a simple weak-to-strong setup that achieves consistent gains over Llama-3.1-8B-Instruct, as well as a SOTA-competitive preference dataset—all without any strong judge.
Submission Number: 75
Loading