Too Long or Too Fake? Disentangling the Causes of Hallucination in Vision–Language Models

ACL ARR 2025 May Submission4289 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Training vision–language models (VLMs) on long, synthetic captions has been shown to increase hallucination compared to using short, human-written ones. Prior work attributes this to errors in synthetic data, but confounds caption origin (human vs. synthetic) and caption length. We disentangle these factors through controlled experiments on three matched set of captions: short human-written, long human-written, and long synthetic ones. VLMs trained on these datasets are evaluated using recent advanced metrics, with a breakdown by objects, attributes, and relations. We find that caption length is the main driver of hallucination, though synthetic origin also contributes, particularly through object and attribute errors.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation, image text matching, cross-modal pretraining, multimodality
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 4289
Loading