Data-efficient Online Training for Direct Alignment in LLMs

15 Sept 2025 (modified: 30 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: direct preference alignment, online alignment, data selection
Abstract: In recent years, online Direct Alignment from Preferences (DAP) has emerged as a popular alternative for Reinforcement Learning from Human Feedback (RLHF) due to its training stability and simplicity. In online DAP, training relies on preference data, each composed of a question and a pair of large language model (LLM) responses. However, annotating preference data, i.e., generating responses for questions, is computationally expensive. To address this, we propose $\texttt{DOTA}$, a data selection framework that minimizes the cost of generating preference data, while still ensuring the quality of training. First, we propose a metric called Preference Perplexity ($\texttt{PFP}$) that enables us to design a low cost, gradient-based method to effectively estimate the contribution of each preference data point to model performance -- critical to data selection. Second, rather than first generating responses for all candidate questions and then selecting preference data points by measuring their $\texttt{PFP}$, we design an iterative multi-armed bandit (MAB)-based strategy that only has to generate responses for a small subset of questions, without missing valuable data points. Experiments on $\texttt{UltraChat-200k}$ and $\texttt{HH-RLHF}$ across 13 downstream tasks demonstrate that $\texttt{DOTA}$ reduces computation cost by a factor of three on LLaMA-3-8B, Qwen-3-4B, and Qwen-3-1.7B, without compromising training effectiveness. Code and data are available at https://anonymous.4open.science/r/DOTA-5CC5.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5850
Loading