Abstract: Direct Preference Optimization (DPO) has shown significant promise in reducing hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO methods suffer from overfitting due to difficulty-level imbalance in preference data. Our analysis reveals that MLLMs tend to overfit on easily distinguishable pairs, which limits their ability to remove hallucinations in a fine-grained manner and impairs the model’s comprehensive ability.
To address this challenge, we introduce Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework comprising two key components: (1)\textit{Difficulty Estimation}, where we leverage pre-trained vision-language models with complementary generative and contrastive objectives, integrating their outputs through a distribution-aware voting strategy to obtain robust difficulty scores without additional training; and (2) \textit{Difficulty-Aware Training}, where we reweight preference data according to the estimated difficulty, down-weighting easy samples while emphasizing harder ones to mitigate overfitting.
This paradigm enhances preference optimization by efficiently exploiting challenging examples without requiring new data or additional fine-tuning stages.
Extensive experiments demonstrate that DA-DPO significantly improves multimodal preference optimization, achieving stronger robustness against hallucinations and better generalization on standard benchmarks, all in a cost-efficient manner.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Tian_Li1
Submission Number: 5835
Loading