Low-hallucination Synthetic Captions Via Visual CheckList Based Reinforcement Learning for Vision-Language Model Pre-training

18 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Synthetic data, recaption, vision-language pre-training, vision-language model
Abstract: Current pre-training of Vision-Language Models (VLMs) relies on large-scale, high-quality alt-text datasets. However, alt-text data is typically short yet noisy, with this issue being more pronounced for non-English languages. To address this limitation, this paper proposes a recaptioning model that rewrites original alt-text data into versions with rich details while maintaining low hallucination rates. The key to mitigating hallucinations lies in a reinforcement learning approach that leverages preference data produced via visual checklists. Leveraging this recaptioning model, we construct X-Recap - a dataset comprising 1 billion synthetic image-caption pairs with low hallucinations. We empirically demonstrate that a VLM pre-trained on X-Recap substantially outperforms its counterpart trained on the original alt-text data, achieving an average performance improvement of approximately 4.6\% across 15 vision-language tasks. To facilitate further research in the community, 20\% of the X-Recap dataset will be released to the public.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 11923
Loading