Making LVLMs Look Twice: Contrastive Decoding with Contrast Images

ACL ARR 2024 December Submission2291 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

Large Vision-Language Models (LVLMs) are becoming increasingly popular for text-vision tasks requiring reasoning over both modalities, but often struggle with fine-grained visual discrimination. This limitation is evident in recent benchmarks like NaturalBench and D3, where closed models such as GPT-4o achieve only 39.6% accuracy, and open-source models perform below random chance (25%). We introduce Contrastive decoding with Contrast Images (CoCI), which adjusts LVLM outputs by contrasting them against outputs for similar images (Contrast Images - CIs). We first evaluate CoCI using naturally occurring CIs in benchmarks with curated image pairs, achieving improvements of up to 98.9% on NaturalBench, 69.5% on D3, and 37.6% on MMVP. For real-world applications where natural CIs are unavailable, we show that given sufficient training data, a lightweight neural classifier can effectively select CIs from similar images at inference time, improving NaturalBench performance by up to 36.8%. For scenarios lacking training data, we develop a caption-matching technique that selects CIs by comparing LVLM-generated descriptions of candidate images. Our method demonstrates the potential for improving LVLMs at inference time through different CI selection approaches, each suited to different data availability scenarios.

Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, cross-modal content generation, inference methods, model architectures
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 2291
Loading