What is the Color of RED? Vision–Language Models Prefer to Read Rather Than See

Nurbüke Teker; Rui Xiao; Zeynep Akata; Shuchen Wu

What is the Color of RED? Vision–Language Models Prefer to Read Rather Than See

Nurbüke Teker, Rui Xiao, Zeynep Akata, Shuchen Wu

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision–Language Models, Stroop Effect, Multimodal Conflict, Latent Interventions, Bias and Interpretability

TL;DR: We show that Vision–Language Models consistently favor words over ink colors in Stroop-style conflicts, and probe this bias via latent interventions in CLIP.

Abstract: A Visual Language Model (VLM) learns a joint understanding of image and text and generates text based on this understanding. Yet when multiple visual cues within an image conflict—such as a written word and its ink color—we do not fully understand how the model decides which signal to prioritize. A classical psychological paradigm to study how conflicting cues affect decision is the Stroop test, where participants are shown words in incongruent ink colors (e.g., the word "red" written in blue) and are instructed to report the ink color rather than read the word. We adapt the Stroop paradigm to VLMs and study how conflicting cues in the written word or ink color influence model behavior. Applying the Stroop test on a range of contrastive and generative VLMs suggests the models favor textual cues over color when text and color conflict. Analyzing the representation of the two cue types suggests that text cues in images are more salient than the color cues. This difference in saliency also translates to different intervention success in steering the VLMs: we found that it is easier to steer the embedding to make the model favor text cues than color cues. Overall, using the Stroop test, our findings suggest VLMs, similar to humans, are biased to "read" an image rather than to "see," and the saliency of the two cue types is reflected in their embedding space. We will release our dataset and code to support future research upon acceptance.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 11244

Loading