Primal-Dual Direct Preference Optimization for Constrained LLM Alignment

Yihan Du; Seo Taek Kong; R. Srikant

Primal-Dual Direct Preference Optimization for Constrained LLM Alignment

Yihan Du, Seo Taek Kong, R. Srikant

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models (LLMs), safety or constrained alignment, rigorous theoretical guarantees for LLM alignment, primal-dual direct preference optimization (DPO)

Abstract: The widespread application of Large Language Models (LLMs) imposes increasing demands on safety, such as reducing harmful content and fake information, and avoiding certain forbidden tokens due to rules and laws. While there have been several recent works studying safe alignment of LLMs, these works either require the training of reward and cost models and incur high memory and computational costs, or need prior knowledge of the optimal Lagrange multiplier. Motivated by this fact, we study the problem of constrained alignment in LLMs, i.e., maximizing the output reward while restricting the cost due to potentially unsafe content to stay below a threshold. For this problem, we propose a novel primal-dual DPO approach, which first trains a model using standard DPO on reward preference data to provide reward information, and then adopts a rearranged Lagrangian DPO objective utilizing the provided reward information to fine-tune LLMs on cost preference data. Our approach only needs to train two models rather than three as in prior works that need trained reward and cost models, which significantly saves memory costs, and does not require extra prior knowledge. Moreover, we establish rigorous theoretical guarantees on the suboptimality and constraint violation of the output policy. We also extend our approach to an online data setting by incorporating exploration bonuses, which enables exploration in the uncovered prompt-response space, and provide theoretical results that get rid of the dependence on preference data coverage. Experimental results on the widely-used preference dataset PKU-SafeRLHF demonstrate the effectiveness of our approach.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 8307

Loading