Textural or Textual: How Vision-Language Models Read Text in Images

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This study explores how multimodal visual models represent textual semantics and the mechanisms through which text disrupts visual understanding.
Abstract: Typographic attacks are often attributed to the ability of multimodal pre-trained models to fuse textual semantics into visual representations, yet the mechanisms and locus of such interference remain unclear. We examine whether such models genuinely encode textual semantics or primarily rely on texture-based visual features. To disentangle orthographic form from meaning, we introduce the ToT dataset, which includes controlled word pairs that either share semantics with distinct appearances (synonyms) or share appearance with differing semantics (paronyms). A layer-wise analysis of Intrinsic Dimension (ID) reveals that early layers exhibit competing dynamics between orthographic and semantic representations. In later layers, semantic accuracy increases as ID decreases, but this improvement largely stems from orthographic disambiguation. Notably, clear semantic differentiation emerges only in the final block, challenging the common assumption that semantic understanding is progressively constructed across depth. These findings reveal how current vision-language models construct text representations through texture-dependent processes, prompting a reconsideration of the gap between visual perception and semantic understanding. The code is available at: https://github.com/Ovsia/Textural-or-Textual
Lay Summary: Can AI models that read text in images really understand what the words mean? Or are they simply recognizing shapes and patterns, like reading handwriting without knowing the language? We investigated this question and found that, in most layers of these models, words are treated as visual textures rather than meaningful language. As the model compresses these visual features, its ability to recognize words improves. However, this improvement is still rooted in how the text looks, not what it means. Only in the final block, across its last few layers, does the model begin to show signs of actual language understanding. This discovery led us to a simple but effective solution. By fine-tuning this final part of the model, we can help it better distinguish between meaningless text patterns and meaningful words. This makes the system more robust against typographic attacks that try to confuse it with misleading or irrelevant text.
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: Typographic Attack, Vision-Language Models, Intrinsic Dimension, CLIP
Submission Number: 7117
Loading