Probing Visual Language Priors in VLMs

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Exploring the visual language priors in VLMs by constructing Question-Image-Answer triplets that deliberately deviate from the training data distribution. Proposed Image-DPO to encourage the model to use more visual inputs.
Abstract: Vision-Language Models (VLMs) may over-rely on visual language priors from their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q\&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4o achieves only 66.17\% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data and then apply pixel-level and semantic corruptions to form ``good-bad" image pairs for self-training. Our proposed training objective, Image-DPO, compels VLMs to focus more on the actual visual inputs, and we demonstrate its effectiveness in LLaVA-v1.5 and Cambrian. Project Page: \href{https://vilp-team.github.io/}{ViLP}.
Lay Summary: Modern AI systems that look at pictures and read text—called vision-language models—sometimes answer by guessing from familiar patterns in their training data instead of truly “seeing” the image. We built a new test, ViLP, that shows them unusual, computer-generated pictures with tricky questions. Each question has three possible answers: one you could guess from general knowledge and two that require careful visual inspection. People get almost everything right, but even the powerful GPT-4o is correct only 66 percent of the time, revealing a heavy reliance on shortcuts. To help these models improve, we created a self-training recipe called Image-DPO. The model first invents its own image–question pairs, then practices on slightly “corrupted” versions of those images and learns to change its answer whenever the picture changes. This teaches the model to pay closer attention to what it actually sees. Two open-source models, LLaVA-v1.5 and Cambrian, showed clear gains after this training. All code, data, and a demo are freely available at the ViLP website: https://vilp-team.github.io/.
Link To Code: https://vilp-team.github.io/
Primary Area: Applications->Computer Vision
Keywords: Visual Language Model, Visual Language Priors, DPO
Flagged For Ethics Review: true
Submission Number: 1396
Loading