AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Ahmed Masry; Juan A. Rodriguez; Tianyu Zhang; Suyuchen Wang; Chao Wang; Aarash Feizi; Akshay Kalkunte Suresh; Abhay Puri; Xiangru Jian; Pierre-Andre Noel; Sathwik Tejaswi Madhusudhan; Marco Pedersoli; Bang Liu; Nicolas Chapados; Yoshua Bengio; Enamul Hoque; Christopher Pal; Issam H. Laradji; David Vazquez; Perouz Taslakian; Spandana Gella; Sai Rajeswar

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision language models, multimodal, large language models

Abstract: Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), lack inductive bias to constrain visual features within the linguistic structure of the LLM’s embedding space, making them data-hungry and prone to cross-modal misalignment. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where visual and textual modalities are highly correlated. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding and under low-resource setups. We provide further analysis demonstrating its efficiency and robustness to noise.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 9649

Loading