ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

ICLR 2026 Conference Submission19473 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Active Data Selection, Direct Preference Optimization, Human Feedback, LLM Alignment

TL;DR: We propose an active learning algorithm that uses a theoretically grounded selection criterion while using LLM to parameterize the reward model for efficiently collecting human preference feedback when the latent reward function is non-linear.

Abstract: The recent success in using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks, such as question answering, mathematical reasoning, and code generation. However, achieving effective LLM alignment depends on high-quality human preference datasets. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions, such as linear latent reward function. To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model that is used for active data selection. As a result, ActiveDPO explicitly accounts for the influence of LLM on data selection, unlike methods that select the data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Our extensive experiments demonstrate that ActiveDPO outperforms existing methods across various models and real-life preference datasets.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19473

Loading