Test-Time Consistency in Vision Language Models

Test-Time Consistency in Vision Language Models

ACL ARR 2025 May Submission56 Authors

06 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks, yet they often exhibit inconsistent behavior when faced with semantically equivalent inputs—undermining their reliability and robustness. Recent benchmarks, such as MM-R$^3$, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs, despite maintaining high average accuracy. Prior work addresses this issue by modifying model architectures or conducting large-scale fine-tuning on curated datasets. In contrast, we propose a simple and effective *test-time consistency framework* that enhances semantic consistency *without supervised re-training*. Our method is entirely *post-hoc*, model-agnostic, and applicable to any VLM with access to its weights. Given a single test point, we enforce consistent predictions via two complementary objectives: (i) a **Cross-Entropy Agreement Loss** that aligns predictive distributions across semantically equivalent inputs, and (ii) a **Pseudo-Label Consistency Loss** that draws outputs toward a self-averaged consensus. Our method is *plug-and-play*, and leverages information from a single test-input itself to improve consistency. Experiments on the MM-R$^3$ benchmark show that our framework yields substantial gains in consistency across state-of-the-art models, establishing a new direction for inference-time adaptation in multimodal learning.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Test-time adaptation, Consistency, VLMs

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Keywords: Test-time adaptation, Consistency, VLMs

Submission Number: 56

Loading