Low-hallucination Synthetic Captions Via Visual CheckList Based Reinforcement Learning for Vision-Language Model Pre-training

Xinsong Zhang; Yarong Zeng; Xinting Huang; Hu Hu; Runquan Xie; Han Hu; Zhanhui Kang

Low-hallucination Synthetic Captions Via Visual CheckList Based Reinforcement Learning for Vision-Language Model Pre-training

Xinsong Zhang, Yarong Zeng, Xinting Huang, Hu Hu, Runquan Xie, Han Hu, Zhanhui Kang

18 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Synthetic data, recaption, vision-language pre-training, vision-language model

Abstract: Current pre-training of Vision-Language Models (VLMs) relies on large-scale, high-quality alt-text datasets. However, alt-text data is typically short yet noisy, with this issue being more pronounced for non-English languages. To address this limitation, this paper proposes a recaptioning model that rewrites original alt-text data into versions with rich details while maintaining low hallucination rates. The key to mitigating hallucinations lies in a reinforcement learning approach that leverages preference data produced via visual checklists. Leveraging this recaptioning model, we construct X-Recap - a dataset comprising 1 billion synthetic image-caption pairs with low hallucinations. We empirically demonstrate that a VLM pre-trained on X-Recap substantially outperforms its counterpart trained on the original alt-text data, achieving an average performance improvement of approximately 4.6\% across 15 vision-language tasks. To facilitate further research in the community, 20\% of the X-Recap dataset will be released to the public.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 11923

Loading