Making LVLMs Look Twice: Contrastive Decoding with Contrast Images

ACL ARR 2025 February Submission7602 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Vision-Language Models (LVLMs) are becoming increasingly popular for text-vision tasks requiring cross-modal reasoning, but often struggle with fine-grained visual discrimination. This limitation is evident in recent benchmarks like NaturalBench and D3, where closed models such as GPT-4o achieve only 39.6\%, and open-source models perform below random chance (25\%). We introduce Contrastive decoding with Contrast Images (CoCI), which adjusts LVLM outputs by contrasting them against outputs for similar images (Contrast Images - CIs). CoCI demonstrates strong performance across three distinct supervision regimes: First, when using naturally occurring CIs in benchmarks with curated image pairs, we achieve improvements of up to 98.9\% on NaturalBench, 69.5\% on D3, and 37.6\% on MMVP. Second, for scenarios with modest training data (~5k samples), we show that a lightweight neural classifier can effectively select CIs from similar images at inference time, improving NaturalBench performance by up to 36.8\%. Third, for scenarios with no training data, we develop a caption-matching technique that selects CIs by comparing LVLM-generated descriptions of candidate images. Notably, on VQAv2, our method improves VQA performance even in pointwise evaluation settings without explicit contrast images. Our approach demonstrates the potential for enhancing LVLMs at inference time through different CI selection approaches, each suited to different data availability scenarios.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, cross-modal content generation, inference methods, model architectures
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 7602
Loading