The Visual Faithfulness Paradox: Scaling Vision–Language Models Degrades Glyph Recognition in Logographic Scripts
Keywords: Vision–Language Models, Multilingual OCR, Visual Faithfulness, Scaling Laws, Language Priors, Chinese, Japanese
Abstract: We identify a Visual Faithfulness Paradox in scaling Vision–Language Models (VLMs): although larger models improve aggregate OCR metrics, they can systematically degrade glyph-level accuracy for non-alphabetic scripts. In controlled trilingual OCR experiments using InternVL3 (1B–14B), English performance increases monotonically. Chinese follows a non-monotonic scaling curve: small models suffer visual collapse (1B/2B), the 8B model achieves the best visual–language balance, and the 14B model drifts into language-prior–driven glyph substitution. Japanese exhibits an intermediate pattern, while visually similar kana generate persistent variability. A fine-grained evaluation on a mixed-script Chinese phrase confirms the shift: the 14B model’s perfect-match rate falls to 25% (vs. 54% at 8B), while its semantic-deviation error rate is 7.28x higher. We explain these effects using a Visual Signal-to-Noise Ratio (VSNR) account: beyond a script-dependent threshold, strengthening language priors override ambiguous visual evidence during decoding. These results expose a fundamental trade-off between linguistic fluency and visual faithfulness, challenging the assumption that “bigger is always better”. Future VLMs must explicitly reinforce glyph-structure perception and coordinate the glyph stream with the knowledge stream.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal pretraining, multimodality, vision question answering, cross-modal content generation, vision language navigation
Contribution Types: Model analysis & interpretability
Languages Studied: English, Chinese, Japanese
Submission Number: 1878
Loading