Aligner, Diagnose Thyself: A Meta-Learning Paradigm for Fusing Intrinsic Feedback in Preference Alignment
Keywords: Large Language Models, Direct Preference Optimization
Abstract: The alignment of Large Language Models (LLMs) with human preferences is critically undermined by noisy labels in training datasets.
Existing robust methods often prove insufficient, as they rely on single, narrow heuristics such as perplexity or loss, failing to address the diverse nature of real-world noise.
We challenge this limited-scope approach by introducing a new paradigm where models learn to diagnose thyself, systematically fusing multiple streams of intrinsic feedback for a holistic reliability assessment of each preference pair.
We instantiate this paradigm through a meta-learning methodology that learns to adaptively reweight samples based on a rich diagnostic vector.
This vector captures three complementary perspectives: preference consistency, learning difficulty, and generation confidence.
Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods across various noise conditions.
Crucially, our work provides the first quantitative analysis of these intrinsic diagnostics, revealing that their fusion is essential for overcoming the blind spots inherent in any single heuristic.
This diagnostic-driven paradigm offers a principled path towards developing more robust and trustworthy LLMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15064
Loading