The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains

Scott Geng; Hamish Ivison; Chun-Liang Li; Maarten Sap; Jerry Li; Ranjay Krishna; Pang Wei Koh

The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains

Scott Geng, Hamish Ivison, Chun-Liang Li, Maarten Sap, Jerry Li, Ranjay Krishna, Pang Wei Koh

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Preference tuning, LLM post-training, synthetic data, weak-to-strong generalization

TL;DR: We show that preference pairs of weak data can be leveraged to improve a stronger language model beyond the strength of each individual point, an insight that enables new state-of-the-art post-training recipes that work without strong supervision.

Abstract: Improvements in language models are often driven by increasing the quality of the data we train them on, which can be limiting when strong supervision is not readily available. In this work, we show that paired preference data consisting of individually weak data points can enable gains beyond the strength of each individual sample. We formulate the **delta learning hypothesis** to explain this phenomenon, positing that the relative quality _delta_ between points suffices to drive learning via preference tuning—even when supervised finetuning on the weak data hurts. We validate our hypothesis in controlled experiments and at scale, where we post-train 8B models on preference data generated by pairing a small 3B model's responses with outputs from an even smaller 1.5B model to ensure a meaningful delta. Strikingly, on a standard 11-benchmark evaluation suite (MATH, MMLU, etc.), our simple recipe matches the performance of Tülu 3, a state-of-the-art open model that was tuned from the same base as our model while relying on vastly stronger supervisors (e.g., GPT-4o). Delta learning thus enables simpler and cheaper open recipes for state-of-the-art post-training, highlighting that models can learn a surprising amount from data that might typically be considered weak.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 940

Loading