Keywords: Large Vision-Language Models; Consistency Prediction; Internal Representations
TL;DR: This paper shows that LVLM consistency can be predicted efficiently from a single forward pass.
Abstract: Large Vision-Language Models (LVLMs) have shown strong performance on a wide range of multimodal tasks, yet their reliability, especially consistency, remains imperfect across semantically equivalent inputs. Prior work has evaluated consistency by aggregating model responses from multiple paraphrased or restyled variants, such repeated sampling is computationally expensive, making them difficult to use in real time. In this paper, we consider a different, more efficient, and competitively effective alternative---asking whether consistency can be predicted directly from the model’s internal states. Specifically, we introduce \textit{single-pass consistency prediction}, which estimates LVLM’s consistency from a single forward pass. Intuitively, consistent examples occupy coherent, in-distribution regions of the representation space, whereas unstable examples exhibit distinctive deviations that can be detected before inconsistency arises in generated tokens. Across several consistency and robustness evaluations, we find that features available from a single forward pass, including hidden-state representations and output logits, contain predictive signals of future model stability. Further systematic analysis---examining layers, components, and token positions---provides insights into where consistency-related information resides within the model. Together, our findings suggest that internal representations provide both a practical, effective mechanism for low-cost consistency monitoring and a useful lens for understanding the internal basis of reliable multimodal reasoning in LVLMs.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 371
Loading