Keywords: Vision Language Models, Color Vision, Psychophysics, Color perception, Representation
TL;DR: VLMs inherit the perceptual color space of the humans.
Abstract: Vision language models (VLMs) receive raw sRGB pixel values and could, in principle, discriminate colors at machine precision. But do they? And if not, what determines their perceptual thresholds? We use psychophysics-inspired experiments to characterize the color discrimination boundary of two VLMs (Gemini 3 Flash and Qwen3-VL-8B-Instruct) and ask which color-distance metric best explains it. Across three tasks (odd-one-out, same/different, triplet matching), two models, and both 2D chromaticity and full 3D CIELAB color spaces (totaling over 68,000 trials), CIE ∆E00 (a metric engineered to match human perception) consistently outperforms all input-space metrics, including sRGB L2, linear RGB L2, and CIE XYZ L2. Residual analysis confirms that ∆E00 is a sufficient statistic for VLM sensitivity in the chromaticity plane, though systematic axis-dependent deviations emerge when lightness varies. Layerwise probing of Qwen3-VL-8B-Instruct reveals that patch embeddings strongly prefer sRGB (R2=0.97) over ∆E00 (R2=0.46), indicating that perceptual structure is not built into the input projection but emerges downstream in the network. Overall, we demonstrate that VLMs, through large-scale training, have inherited the perceptual color space of the humans involved in data generation.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 101
Loading