Information Loss in Vision–Language Models

ACL ARR 2025 May Submission1121 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision-language models (VLMs) often process visual inputs through a pretrained vision encoder, followed by a projection into the language model's embedding space via a connector component. While crucial for modality fusion, the potential information loss induced by this projection step and its direct impact on model capabilities remain understudied. We propose two novel approaches to quantify such visual information loss in the projection by analyzing the latent representation space. First, we evaluate semantic information preservation by analyzing changes in k-nearest neighbor relationships between image representations, before and after projection. Second, we directly measure information loss by reconstructing visual embeddings from the projected representation, localizing loss at an image patch level. Our experiments reveal that connectors fundamentally alter visual semantic relationships---k-nearest neighbors of the visual embeddings diverge by 40-60% post-projection, correlating highly with degradation in retrieval performance. The patch-level embedding reconstruction provides interpretable insights for model behavior on visual question-answering tasks, finding that areas of high information loss reliably predict instances where models struggle.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Vision-language models, information loss, visual question answering, image captioning
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1121
Loading