Keywords: Direct Preference Optimization, Meta-Learning, Robust Alignment
Abstract: Direct Preference Optimization (DPO) offers an effective paradigm for aligning Large Language Models (LLMs), yet its performance can be compromised by noisy or ambiguous preference data common in real-world scenarios. Standard DPO formulations often lack mechanisms to adapt to varying levels of reliability across training instances. This paper introduces Meta-Target DPO (MT-DPO), a novel framework that achieves robust preference alignment by dynamically learning adaptive confidence targets for each preference pair. MT-DPO employs a meta-learning approach where an auxiliary confidence module predicts a sample-specific target probability, representing the degree of belief in the observed preference. This module is informed by intrinsic signals, notably perplexity differentials derived from an anchored reference model, indicative of label consistency. Guided by a small, trusted meta-dataset, the confidence module is trained to generate targets that optimally steer the main policy optimization. MT-DPO optimizes the LLM policy using a cross-entropy objective, effectively minimizing the divergence between the policy's implied preference probability and the dynamically learned confidence target for each pair. This allows the learning process to naturally down-weight uncertain instances and potentially rectify contributions from mislabeled data by adapting the target across the full confidence spectrum. Comprehensive experiments on standard alignment benchmarks demonstrate that MT-DPO significantly outperforms vanilla DPO and other robust alignment strategies on both clean and synthetically noisy datasets, showcasing its superior adaptability and effectiveness in handling preference uncertainty through learned target modulation.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 12175
Loading