Keywords: Vision-Language, Direct Preference Optimization
TL;DR: We propose contrastive self-rewarding approach that leverage the probablities shift of responses under contrastive inputs as reward to construct prefrence dataset for preference optimization.
Abstract: Direct Preference Optimization (DPO) has been proven effective and efficient in aligning Multi-modal Large Language Models (MLLMs) with human preferences and improving multimodal understanding of MLLMs. However, most existing work relies heavily on either human annotations or auxiliary reward models to construct preference data, which limits their scalability and introduces potential inconsistencies between the reward model and fine-tuned MLLMs. This paper presents ConSR, a Contrastive Self-rewarded Preference Optimization framework that constructs contrastive inputs and frames the variation of the corresponding model outputs as self-reward signals. We perturb the visual input by degrading its fine-grained details and enriching it with semantic context, respectively, forming contrasts with the original visual input. The variation of the corresponding model responses reveals the model’s sensitivity to the visual inputs, which we exploit to rank and construct preference pairs without external supervision. In addition, we reformulate the DPO objective to mitigate its length bias, and reweight visual tokens to assign higher weights to more responsive tokens regarding visual cues. Extensive experiments across multiple visual understanding benchmarks demonstrate ConSR’s consistent superiority across diverse tasks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2316
Loading