Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

Published: 26 Jan 2026, Last Modified: 26 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large vision-language models, multimodality, language prior
TL;DR: A formal framework for understanding and quantifying the language prior in LVLMs by contrasting the chain-of-embedding between visual and blind contexts.
Abstract: Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP)---memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input–output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding for multimodal reasoning. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representational discrepancy beyond the VIP to quantify how strongly visual query influences response generation. Across 60 model–dataset combinations spanning 10 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8070
Loading