Keywords: Object Hallucination, Vision language models, Efficient Vision language Models
Abstract: Large Vision-Language Models (LVLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs) to achieve seamless integration of visual and textual information. This paper investigates the embedding spaces generated by a representative vision encoder (ViT) and a powerful LLM (Vicuna), uncovering a critical disparity. Our analysis reveals that ViT token embeddings exhibit a surprisingly uniform distribution, lacking the rich semantic structure inherent in Vicuna's LLM embeddings. This absence of a well-defined semantic space in visual token embeddings poses a significant challenge to multimodal alignment, hindering the model's ability to establish meaningful correspondences between visual and textual elements. We demonstrate the implications of this embedding space divergence through a rigorous analysis of statistical properties. We argue that bridging this semantic gap requires complex mappings, ultimately limiting current LVLMs' multimodal reasoning capabilities. These findings provide valuable insights for future research aimed at developing more effective alignment strategies and achieving enhanced visual and linguistic understanding in LVLMs.
Submission Number: 192
Loading