Lost in Embeddings: Information Loss in Vision-Language Models

ACL ARR 2025 February Submission2673 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision--language models typically process visual inputs through a pretrained vision encoder followed by projection into the language model's embedding space. While crucial for modality fusion, this projection step induces under-characterized information loss that directly impacts model capabilities. We propose two novel approaches to quantify visual information loss introduced at this projection step. First, we evaluate the preservation of semantic information and structural relationships by analyzing changes in nearest-neighbor rankings between representations. Second, to locate information loss for the image representation at a patch level, we directly measure information loss through visual embedding reconstruction. Focusing on connector-based VLMs, our experiments reveal projection layers fundamentally alter visual semantic relationships – nearest neighbor similarity rankings diverge by 40-60% post-projection, directly explaining observed retrieval performance drops. Our embedding reconstruction approach provides interpretable insights for model behavior on visual question-answering tasks, finding that areas of high information loss reliably predict instances where models struggle.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality, Vision question answering, information loss
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 2673
Loading