Keywords: Vision-Language Models, Contrastive Decoding
TL;DR: This paper proposes a decoding method that selects appropriate visual augmentations based on the task and uses them contrastively to mitigate the bias in large vision-language models.
Abstract: Despite the astonishing performance of recent Large Vision-Language Models (LVLMs), these models are known to generate inaccurate responses by relying on biases in prior knowledge without properly utilizing the given visual information. To address this issue, previous studies have employed contrastive decoding (CD) by using augmented images that have contrastive features to mitigate the bias. These earlier approaches typically used a single augmentation method, such as noise injection, or relied on external knowledge from pre-trained models or human-labeled bounding boxes. However, these methods have limitations in that using external knowledge is costly, and relying on a single augmentation is restrictive for the type of task. In this study, we address this limitation by exploring how to utilize numerous data augmentations. Through extensive experimental observations, we discovered that certain augmentations can modify key features, i.e., strong contrast, making them a powerful candidate for CD in LVLMs. Based on this observation, we introduce a novel method called VACoDe, Visual Augmented Contrastive Decoding. This method selects the appropriate visual augmentation for each query to enhance contrast by using the proposed softmax distance metric. Our empirical tests show that VACoDe outperforms previous methods and improves output quality in various vision-language tasks. Additionally, VACoDe can be universally applied across different model types and sizes without additional training or using external models and data.
Submission Number: 14
Loading