Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance in processing and reasoning over text and images. However, they remain susceptible to hallucinations—instances where generated content deviates from input data or contradicts established knowledge. While hallucinations in MLLMs have attracted increasing attention, the specific impact of font variation—a common yet overlooked source of hallucination—has not been systematically investigated. Moreover, existing OCR benchmarks include limited font diversity and primarily focus on layout or background changes, lacking fine-grained control over font factors and neglecting long-tail fonts. To address this gap, we introduce and categorize font-induced hallucinations, and conduct comprehensive experiments to examine how fonts affect MLLMs across dimensions such as font perturbations, style shifts, font-semantic interactions, and sentiment recognition.Based on these findings, we propose FontHalu, a benchmark with diverse font types and scenario settings, specifically designed to evaluate MLLMs’ robustness in OCR, key information extraction (KIE), and sentiment analysis under font variation. We will release FontHalu and related code to support research on improving the reliability and robustness of MLLMs.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation, Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English, Chinese
Submission Number: 5714
Loading