The Role of Preference Data and Unembeddings in the Convergence Rate of DPO

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0
Track: Research Track
Keywords: DPO, convergence, preference data, Bradley-Terry
Abstract: Human or AI feedback in the form of preference data over response-pairs plays a crucial role in finetuning Large Language Models (LLMs) using Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO) and their variants. For these methods to be effective, the representations or unembeddings of responses must be expressive enough to align with the preference data. In this paper, we study the convergence of gradient descent for DPO using finite samples in the realizable setting, for example, preferences generated by the Bradley-Terry model of linear reward functions on query \& response representations. Unlike previous theoretical analysis with stronger assumptions about the underlying unembeddings, our analysis works with a parameterization that is better representative of LLM implementations and doesn't assume independence of logits. We derive a linear convergence rate bound for gradient descent on the DPO objective. Our bound crucially depends on the condition number of the matrix of query embeddings, the algebraic connectivity and the maximum degree of the comparison graph over responses. Our bound can guide the selection of preference feedback in order to optimize the cost of data acquisition as well as the cost of training. We show that in addition to the DPO converging to the optimum of the loss function, the learned reward differences also converge to the ground truth. These results, shown for pairwise preference data, can be extended to listwise preference data as well as discrete choice data and are validated through a set of experiments over both synthetic and real world datasets. To ensure the sufficiency of the available data, we study both the identifiability of the ground truth and the generalizability of the aligned model. Additionally, linear convergence results for DPO under tabular parameterization of the policy are also obtained.
Submission Number: 106
Loading