SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

Tianjian Li; Daniel Khashabi

SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

Tianjian Li, Daniel Khashabi

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Combining on-policy and off-policy preference data enhances language model alignment, outperforming single-source and more complex methods.

Abstract: Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off-policy data for preference learning, others indicate that the advantages of on-policy data are task-dependent, highlighting the need for a systematic exploration of their interplay. In this work, we show that on-policy and off-policy data offer complementary strengths: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on subjective tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SimpleMix, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse tasks and benchmarks demonstrate that SimpleMix substantially improves language model alignment. Specifically, SimpleMix improves upon on-policy DPO and off-policy DPO by an average of 6.03 on Alpaca Eval 2.0. Moreover, it surpasses prior approaches that are much more complex in combining on- and off-policy data, such as HyPO and DPO-Mix-P, by an average of 3.05. These findings validate the effectiveness and efficiency of SimpleMix for enhancing preference-based alignment.

Lay Summary: Language models, like ChatGPT, get better at giving helpful and accurate responses when trained using human preferences—meaning they learn what responses people prefer. However, there's a debate about which data works best for training these models: on-policy data (generated by the same model you're improving) or off-policy data (from different, often publicly available, models). In this study, we discovered that both types of data have unique strengths. On-policy data is especially good for tasks with clear, right-or-wrong answers, like solving math problems or writing computer code. In contrast, off-policy data excels in tasks that are more subjective or creative, such as storytelling or giving personal recommendations. Inspired by these findings, we developed SIMPLEMIX, a straightforward method that combines both types of data to leverage their complementary advantages. SIMPLEMIX significantly improved the overall quality of responses, outperforming existing methods. Specifically, SIMPLEMIX boosted performance by about 6% compared to using just one data type and about 3% compared to previous, more complicated methods. Our approach highlights that the best outcomes don't always come from complicated solutions—sometimes, mixing existing resources intelligently is enough to yield significant improvements.

Primary Area: Deep Learning->Large Language Models

Keywords: Language Models, Alignment, Preference Optimization, RLHF

Submission Number: 9711

Loading