The Visual Faithfulness Paradox: Scaling Vision–Language Models Degrades Glyph Recognition in Logographic Scripts

The Visual Faithfulness Paradox: Scaling Vision–Language Models Degrades Glyph Recognition in Logographic Scripts

ACL ARR 2026 January Submission1878 Authors

31 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision–Language Models, Multilingual OCR, Visual Faithfulness, Scaling Laws, Language Priors, Chinese, Japanese

Abstract: We identify a Visual Faithfulness Paradox in scaling Vision–Language Models (VLMs): although larger models improve aggregate OCR metrics, they can systematically degrade glyph-level accuracy for non-alphabetic scripts. In controlled trilingual OCR experiments using InternVL3 (1B–14B), English performance increases monotonically. Chinese follows a non-monotonic scaling curve: small models suffer visual collapse (1B/2B), the 8B model achieves the best visual–language balance, and the 14B model drifts into language-prior–driven glyph substitution. Japanese exhibits an intermediate pattern, while visually similar kana generate persistent variability. A fine-grained evaluation on a mixed-script Chinese phrase confirms the shift: the 14B model’s perfect-match rate falls to 25% (vs. 54% at 8B), while its semantic-deviation error rate is 7.28x higher. We explain these effects using a Visual Signal-to-Noise Ratio (VSNR) account: beyond a script-dependent threshold, strengthening language priors override ambiguous visual evidence during decoding. These results expose a fundamental trade-off between linguistic fluency and visual faithfulness, challenging the assumption that “bigger is always better”. Future VLMs must explicitly reinforce glyph-structure perception and coordinate the glyph stream with the knowledge stream.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal pretraining, multimodality, vision question answering, cross-modal content generation, vision language navigation

Contribution Types: Model analysis & interpretability

Languages Studied: English, Chinese, Japanese

Submission Number: 1878

Loading