Are Easier or Harder Examples Better? Rethinking Data Selection for Reward Models and Preference Optimization

Published: 02 Mar 2026, Last Modified: 02 Mar 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0
Keywords: LLM alignment, data selection, reward model, policy optimization, reward gap, sample efficiency, dataset curation, preference data
TL;DR: This paper studies difficulty-based data selection for RM, DPO, and GRPO, showing that easier examples improve performance with oracle or noisy rewards but not when estimated via a weak proxy RM, helping reconcile prior findings.
Abstract: Despite being crucial for effective LLM alignment, data selection remains understudied. Prior work examining data selection for reward model (RM) training and policy optimization methods (e.g. Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO)) has identified *example difficulty*, measured by the reward gap between chosen and rejected responses, as a key factor. However, findings are contradictory: some studies favor easier examples with larger gaps, while others prefer harder ones. To isolate the role of difficulty from confounding factors, we *assume access to an oracle RM* and systematically study data selection across RM, DPO, and GRPO training. We find that *training on easier pairs consistently leads to better performance* than harder ones, particularly for smaller base models. This advantage persists even when reward estimates are noisy. Notably, using only the top 20\% easiest examples often matches or exceeds full-dataset performance while reducing post-training costs by 5$\times$.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 17
Loading