Keywords: Large Language Models, Direct Preference Optimization
Abstract: Noise in preference data significantly impedes the robust alignment of large language models (LLMs) with human values. Existing methods that rely on global noise assumptions or static pre-processing heuristics are often insufficient, as they fail to address the instance-specific and dynamic nature of preference noise. To overcome these limitations, we introduce Dynamic Preference Calibration, a novel framework that meta-learns to generate adaptive soft labels directly from noisy data. Our approach employs a lightweight meta-learner that maps a perplexity difference (PPLDiff) signal to a calibrated soft label. Crucially, the power of our dynamic approach stems from calculating this PPLDiff signal online, using the main, evolving LLM itself. This creates a symbiotic loop where the main model's improving understanding continuously informs and refines the calibration strategy, allowing it to co-evolve. Guided by a small, clean meta-dataset, the meta-learner is optimized to produce labels that maximize alignment performance. Extensive experiments on benchmark datasets demonstrate that our method establishes a new state-of-the-art for noisy preference alignment, significantly outperforming strong baselines. It maintains high performance and stability even under extreme noise levels up to 40\% label flips, highlighting the promise of meta-learning for building fundamentally more robust and reliable alignment techniques.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14982
Loading