FontHalu: Font-based Hallucinations in Multimodal Large Language Models

ACL ARR 2025 February Submission5884 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal large language models (MLLMs) have achieved remarkable performance in processing and reasoning over text and images. However, they remain susceptible to hallucinations—instances where generated content deviates from input data or contradicts established knowledge. While extensive research has explored general hallucinations in MLLMs, font-induced hallucinations remain an overlooked yet critical challenge, particularly in OCR-based applications and high-stakes domains such as medical and legal text analysis. In this work, we formally define the phenomenon of font hallucinations and systematically categorize them into three types: font style, font semantics, and font sentiment. We further conduct comprehensive experimental analyses to quantify their impact on model reliability. Building on this analysis, we propose the FontHalu benchmark, the first dedicated benchmark for evaluating MLLMs' robustness against font-based hallucinations. To mitigate these hallucinations, we implement LoRA-based parameter-efficient fine-tuning, demonstrating improved generalization to unseen fonts while highlighting the limitations of current adaptation techniques. We will publicly release the benchmark and datasets, advancing the development of more reliable multimodal AI systems.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation, Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English, Chinese
Submission Number: 5884
Loading