Self-Supervised Visual Preference Alignment

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers. The whole pipeline no longer hinges on supervision from GPT-4 or human involvement during alignment, and is highly efficient with few lines of code. With only 8k randomly sampled unsupervised data, it achieves 90\% relative score to GPT-4 on complex reasoning in LLaVA-Bench, and improves LLaVA-7B/13B by 6.7\%/5.6\% score on complex multi-modal benchmark MM-Vet. Visualizations shows its improved ability to align with user-intentions. A series of ablations are firmly conducted to reveal the latent mechanism of the approach, which also indicates its potential towards further scaling.
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work focuses on multimodal domains and deals with preference alignment in large vision-language models (VLMs). It makes a first attempt towards constructing preference data in an unsupervised manner, which largely saves the expansive annotated cost and relieves the difficulty in scaling preference data. We firmly verify the great advantage of our pipeline through multimodal benchmark comparisons, quantitative visualizations and in-depth analysis. The proposed method empowers current VLMs with improved ability towards practical usage, such as better aligning with user-intentions, less hallucinations and stronger chain-of-though ability, etc. Our work enjoys efficiency in pipeline and simplicity in implementation, which paves way for future preference alignment in multimodal and visual-language domains.
Supplementary Material: zip
Submission Number: 2131
Loading